Revisiting Car Prices | Python | requests, BeautifulSoup, Pandas

It feels like ages since I last posted something, and I’ve been trying to figure out how I can come up with the sort of scraping scripts I normally prepare…but in Python instead of R. In R, I would normally use the rvest package for basic scraping, and I would add RSelenium for more advanced stuff.

To retrieve the site’s HTML, I used the requests python package, and BeautifulSoup for the scraping. I’ve grown accustomed to the sort of data types I’ve been using in R, and after basic googling, it would appear that I wouldn’t be too far off from R if I used pandas as the package for compiling everything into a dataframe.

The website holds information regarding new and used cars, and doesn’t use too much javascript which is why I didn’t use Selenium. The code is a bit messy and I used quite a few custom function in order to make the transition from R to Python a little more smoother. As always, the site URL in the code is not the actual site that i used in the script.

 
#import necessary packages 
import requests 
import bs4 
import pandas as pd 
 
 
all_links = [] #empty list to be populated 
for j in range(1,69): #maximum page numbers are hardcoded. 4 pages. 
#Request the page from the site 
page = requests.get("https://www.dummy_car_site.com?page=" + 
str(j) + "&size=25") 
#Parse the pages details using the Beautiful Soup package 
soup = bs4.BeautifulSoup(page.content, "html.parser") 
 
#By inspecting the HTML in Chrome, the below selector should retrieve the full list of posts from one page 
#The result is a list of bs4.element.tag 
links_list = soup.select("h2 > a") 
 
#Assign an empty list to a variable 
#According to documentation, the tag object has a dict with all the attributes. 
#That means if there is a link (i.e. href) it can be retrieved as a key in a dictionary 
for post in links_list: 
link = post.attrs["href"] 
all_links.append(link) 
 
 
del page, links_list, j, post, soup, link #remove unneeded variables 
 
all_cars = [] #an empty list to populated and late converted to a pandas dataframe 
 
#Start a loop to run navigate to all the links and extract the information that is needed 
for link in all_links: 
car = {} 
post_request = requests.get(link) 
post_soup = bs4.BeautifulSoup(post_request.content, "html.parser") 
 
#Selector for extracting the price 
price_selector = "div.listing__price.delta.weight--bold" 
price_tag = post_soup.select(price_selector)[1] 
car["price"] = price_tag.text 
 
#_1. Selectors for the details table__ 
#___1a. Selector for the details title___ 
details_title_selector = "span.list-item__title" 
headers = get_details_list(post_soup.select(details_title_selector)) 
 
#___1b. Selectors for the details value___ 
details_value_selector = "span.float--right" 
values = get_details_list(post_soup.select(details_value_selector)) 
 
compiled_details = dict(zip(headers, values)) #combine the headers and values in to a dictionary 
 
car.update(compiled_details) #add to the original dictionary 
 
#__2. Selector for the date the post was updated__ 
date_selector = "listing__updated" 
date_posted = post_soup.find_all(attrs={"class": date_selector})[1].text 
 
car["date"] = date_posted #update dictionary 
 
all_cars.append(car) #add to the final list of cars 
print(str(len(all_cars)) + " posts completed.") 
 
#didn't know how to clear, in the quickest way possible, all the variables i created. so i have to do them manually
del car, compiled_details, date_posted, date_selector, details_title_selector, details_value_selector, headers 
del link, post_request, post_soup, price_selector, price_tag, values 

#convert all the variables in to dataframe 
raw_df = pd.DataFrame.from_records(all_cars) #convert all the details in to a dataframe 

I’ve only used one custom function in this script:

#to compile all the headers in to one list
def get_details_list(result):
    new_results = []
    for i in range(0, len(result)):
        new_results.append(result[i].text)
    return new_results

It was much more difficult to try and clean the data using python, not because using python for data cleaning is complicated, but rather i’ve grown so accustomed to doing it in R that i kept confusing myself.


#___DATA CLEANING STARTS HERE__#

import datetime

#remove RM prefixes and commas
#clean prices

#isolate the values coming after "RM"
RM_prices = []
iso_prices = list(raw_df["price"])
for iso_price in iso_prices:
    for location in range(2, len(iso_price)):
        if iso_price[location - 2:location] == "RM":
            iso_price = iso_price[location - 2:len(iso_price)]
    RM_prices.append(iso_price)

#remove the currency prefixes
prices_clean = str_replace_all(str_replace_all(RM_prices, "RM ", ""), ",", "")
new_prices_clean=[]
for price_clean in prices_clean:
    try:
        new_prices_clean.append(int(price_clean))
    except ValueError:
        new_prices_clean.append(price_clean)

raw_df["prices_clean"] = new_prices_clean

#clean dates
dates_unclean = str_replace_all(raw_df["date"], "Updated on: ", "")

clean_dates = []
for unclean_date in dates_unclean:
    clean_date = datetime.datetime.strptime(unclean_date, "%B %d, %Y").strftime("%Y-%m-%d")
    clean_dates.append(clean_date)

raw_df["dates_clean"] = clean_dates

#clean engine capacity
#there are only two unique values, 1799 and 1800 cc.
#converting all 1799 values to 1800.
clean_eng_cap = str_replace_all(list(raw_df["Engine Capacity"]), "1799", "1800")
raw_df["eng_cap_clean"] = clean_eng_cap

#clean Mileage, by only taking the averages
clean_mileages = get_mileages(raw_df["Mileage"])
raw_df["mileages_clean"] = clean_mileages

#drop posts with no prices
no_prices = []
for i in range(0, len(raw_df['prices_clean'])):
    if isinstance(raw_df["prices_clean"][i], str):
        no_prices.append(i)

final_df = raw_df.drop(raw_df.index[no_prices])

#delete unneeded variables
del clean_date, clean_dates, clean_eng_cap, clean_mileages, dates_unclean, new_prices_clean, prices_clean
del unclean_date, price_clean

I’ve used quite a few custom functions during the cleaning stage. To be honest, in retrospect, the entire code looks much messier after i finally managed to clean the data. I’ve noticed that this usually happens when i don’t know how to go about a certain task. As a result, the code reflects on the exploratory nature of the exercise.

#replace all instances of a string and replace  . Accepts lists
#the name comes from R's stringr package which has a function with the same name.
def str_replace_all(full_list, string, replacement):
    new_list = []
    for i in range(0, len(full_list)):
        new_list.append(full_list[i].replace(string, replacement))

    return new_list

#check if result has a length greater then one
def check_result(list_item, find_item):
    a = bs4.BeautifulSoup(str(list_item), "html.parser")
    b = a.find_all(attrs={"class": find_item})
    if len(b) > 0:
        return True
    else:
        return False

#to compile all the headers in to one list
def get_details_list(result):
    new_results = []
    for i in range(0, len(result)):
        new_results.append(result[i].text)
    return new_results

#to clean the mileages column
#runs two loops.
def get_mileages(mileages_list):
    new_mileages = [] #empty list to populate with loop
    for mileage in mileages_list:
        if len(mileage) > 1: #if the mileage has length greater than one...
            try: #a good chance some errors will appear here
                k_location = mileage.index("k") #get the string location of the lowercase k
                mileage = mileage[0:k_location] #remove the "k"
                hyphen = mileage.index("-") #find location of the hyphen
                mileage_low = int(mileage[0:hyphen].strip()) #extract the low end of the range
                mileage_high = int((mileage[hyphen + 1:len(mileage)].strip())) #extract the high end
                mileage = ((mileage_low + mileage_high)/2)*1000 #convert to an average
                new_mileages.append(mileage) #append this new value to the empty list
            except ValueError:
                mileage = mileage #otherwise don't change anything...
                new_mileages.append(mileage) #...and append it to the list
        else:
            if mileage == "-": #if the value is just a hyphen
                mileage = 0 #then the value is actually zero
                new_mileages.append(mileage) #and append it to the list
            else:
                mileage = mileage
                new_mileages.append(mileage)

    #Do it all over again, except this time isolate the uppercase K...
    #...and leave everything else the same
    new_new_mileages = [] #new list
    for new_mileage in new_mileages:
        try:
            new_mileage_string = str(new_mileage)
            K_location = new_mileage_string.index("K") #get the string location of the uppercase K
            new_mileage_string = new_mileage_string[0:K_location] #remove the "k"
            hyphen = new_mileage_string.index("-") #find location of the hyphen
            mileage_low = int(new_mileage_string[0:hyphen].strip()) #extract the low end of the range
            mileage_high = int((new_mileage_string[hyphen + 1:len(new_mileage_string)].strip())) #extract the high end
            new_mileage_string = ((mileage_low + mileage_high)/2)*1000 #convert to an average
            new_new_mileages.append(new_mileage_string) #append this new value to the empty list
        except ValueError:
            new_mileage = new_mileage #otherwise don't change anything...
            new_new_mileages.append(new_mileage) #...and append it to the list

    final_mileages=[]
    for new_new_mileage in new_new_mileages:
        new_new_mileage = remove_symbols(new_new_mileage)
        final_mileages.append(float(new_new_mileage))

    return final_mileages

#to check if an element is neither a number or a letter
#returns boolean to determine if there are any symbols
#if there are any symbols, it also returns the location in the string
#and also what the symbol is
$type returned is a dictionary
def check_characters(ch_string, symbols=["`", "!", "~", "@", "#", "$", "%", "^", "&", "*", "(", ")", "-", "_", "=", "+", "[", "{", "]", "}", ";", ":", "'", '"', ",", "<", ".", ">", "/", "?", " "]):
    if isinstance(ch_string, str):
        1 == 1
    else:
        print("Converting ch_string to a string...")
        ch_string = str(ch_string)

    len_string = len(ch_string)
    result = {"has_symbol": False, "location": "", "symbol": ""}
    for i in range(1, len_string):
        for symbol in symbols:
            if ch_string[i-1:i] == symbol:
                result["has_symbol"] = True
                result["location"] = i
                result["symbol"] = symbol
                break
    return result

#remove symbols from a series
def remove_symbols(string):
    while check_characters(string)["has_symbol"]:
        string = str(string)
        x = check_characters(string)["symbol"]
        string = string.replace(x, "")
    return string

#to remove letters from the any of the series that are meant to be numbers
#tries to convert each element in a series to a number. if failed, then its not a number.
def remove_letters(values):
    new_values = [] #new list to be populated later
    for value in values: #for each element in the list
        if isinstance(value, str): #if the element is a string then...
            for location in range(1, len(value)): #...for each character in that string...
                try: #...try to...
                    int(value[location - 1:location]) #...convert the character to an integer
                    value = value #...if it works assign the value to itself (probably unnecessary)
                except ValueError: #....but if there is an error...
                    character = value[location - 1:location] #...isolate that character..
                    value = value.replace(character, "") #...and replace it with nothing
            new_values.append(value) #append the new value to the new list
        else: #if it is not a string
            new_values.append(value) #then assign the value back to the new list
    return new_values #return the new values

I know, it’s a mess. My struggle to understand some of the methods and functions in python hindered any intention or determination i had to keep the code tidy.

The advantages of R’s ggplot2 package never really occurred to me until i tried out plotting using python. The package i used is bokeh, which is a pretty neat interactive graphing package that allows you to generate the sort of insightful graphs you would need when performing exploratory data analysis.

from bokeh.charts import Bar, output_file, show, Scatter

#filter out only the information that is needed
plot_df = final_df.query("Year == '2017' | Year == '2016' | Year == '2015'")
plot_df.columns = str_replace_all(list(plot_df.columns.values), " ", "_")
plot_df = plot_df.query("Car_type == 'New Car' | Car_type == 'Used Car'")

#generate plot: Bar graph
p = Bar(plot_df, label='Car type', values='prices_clean', agg='mean', group='Year',
        title="Average MPG by YR", legend='bottom_right')

#save a copy
output_file("bar.html")

#______________

from bokeh.charts import BoxPlot, output_file, show

p = BoxPlot(plot_df, values='prices_clean', label=['Year', 'Car_type', 'Transmission'], color='Year',
            whisker_color='Car_type', title="Toyota Vios Prices: By year, car type, and transmission",
            legend=False, outliers=False)

p.plot_width = 900

p.yaxis[0].formatter = NumeralTickFormatter(format="0,000")

yaxis = []
for i in range(0, 110000, 5000):
    yaxis.append(i)

p.yaxis.ticker = yaxis

output_file("boxplot.html", )

#__________________________________________________________________________

plot_df = final_df.query("Year == '2017' | Year == '2016' | Year == '2015'| Year == '2014'| Year == '2013'")
plot_df.columns = str_replace_all(list(plot_df.columns.values), " ", "_")
plot_df = plot_df.query("Car_type == 'Used Car' & prices_clean <= 100000")

#generate plot: Scatter plot
p = Scatter(plot_df, x='mileages_clean', y='prices_clean', title="HP vs MPG",
            xlabel="Mileage", ylabel="Price", color='Year', legend="bottom_right")

#reformat axes
p.yaxis[0].formatter = NumeralTickFormatter(format="0,000")
p.xaxis[0].formatter = NumeralTickFormatter(format="0,000")

output_file("scatter.html")

You can check out the plots here: Bar plot, Box plot, Scatter plot.

Not too bad, although i would still prefer ggplot2 mainly because of the facet_wrap() and facet_grid() functions. I couldn’t find something similar in bokeh.

All in all, not a bad start to using python.I’m still getting accustomed to the different data types and methods. Next up would be to use Selenium in python.

This entry was posted in Uncategorized and tagged , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *