Navigating & Scraping a Job Site | rvest & RSelenium

One of my family members gave me an idea to perhaps try scraping data from a job site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i didn’t think of this idea myself. Needless to say, i was anxious to try this out.

I picked a site and started inspecting the HTML code to see how would i get the information i needed from each job posting. Normally, the easiest scrapes (for me) are the ones where the site is structured with two characteristics.

First, it helps if all (or at least most) of the information that i need to extract is in the site’s search results page. For instance, in the context of job postings, if you search for “Data Scientist”, and the search results show the job title, the company that’s hiring, the years of experience required, the location, and a short summary – then there is no real need to navigate to each post and get that data from the post itself.

Second characteristic is if the URL of the search results shows the result page number that you are currently in – or even shows any indication of which search result number i am looking at. For instance, google “Data Scientist” and the take note of the URL. Scroll down and click the second page, and notice that the URL now ends with “start=10”. Go to the third page and you’ll notice that the it now ends with “start=20”. Although it doesn’t mention which page, it does indicate that if you were to change those last two digits to anything (go ahead and try), the search results would begin from start + 1; i.e. if start = 10, the search results would begin with search result no. 11. If i’m lucky, some websites have clear indications in the URL, like “page=2”, which makes the task even more easier.

Now why would these two characteristics make it much easier? Mainly because you can split the URL into different parts, with only one variable – the page number – and then concatenate the different parts back. After that it’s just a matter of looping through these URLs and picking up the information you need from the HTML source.

If the above two characteristics exist, all i need is the rvest package to make it all work, with dplyr and stringr for some of the “tidying”.

There are certain instances however, when both of these characteristics do not exist. It’s usually because the site incorporates some javascript and so the URL does not change when going through different search pages. This means that in order to make this work, i would actually have to click the page buttons in order to get the HTML source – and i can’t do that with rvest.

Enter RSelenium. The wonderful R package that allows me to do all that.

As always i started off with loading the packages, assigning the URL for the search result page, and extracting the data for just the first page. You’ll have to excuse me for using the “=” operator. WordPress seems to screw up the formatting if i use the “less than” operator combined with a hyphen; which is sort of annoying.

#Load packages
library(dplyr)
library(rvest)
library(stringr)
library(RSelenium)
library(beepr)

#Manually paste the URL for the search results here
link = "jobsite search results URL here"

#Get the html source of the URL
hlink = html(link)

#Extract Job Title
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjob_item") %>% html_nodes(".tpjob_title") %>% 
  html_text() %>% data.frame(stringsAsFactors = FALSE) -> a
names(a) = "Title"

#Extract Recruitment Company
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjobwrap") %>% html_nodes(".tpjob_cname") %>% html_text() %>% 
  data.frame(stringsAsFactors = FALSE) -> b
names(b) = "Company"

#Extract Links to Job postings
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjob_item") %>% html_nodes(".tpjob_lnk") %>% 
  html_attr("href") %>% data.frame(stringsAsFactors = FALSE) -> c
names(c) = "Links"

At this point i’ve only extracted the job titles, the hiring company’s name, and the link to the post. In order for me to get the same details for the remaining posts, i would need to first navigate to the next page, which involves clicking the Next button at the bottom of the search results page.


#From RSelenium
checkForServer() #Check if server file is available
startServer() #Start the server
mybrowser = remoteDriver(browser = "chrome") #Change the browser to chrome
mybrowser$open(silent = TRUE) #Open the browser
Sys.sleep((5)) #Wait a few seconds
mybrowser$navigate(link) #Navigate to URL
Sys.sleep(5) 

Pages = 16 #Select how many pages to go through

for(i in 1:Pages){ 
  
  #Find the "Next" button and click it
  try(wxbutton = mybrowser$findElement(using = 'css selector', "a.pagination_item.next.lft"))
  try(wxbutton$clickElement()) # Click
  
  Sys.sleep(8)
  
  hlink = html(mybrowser$getPageSource()[[1]]) #Get the html source from site
  
  hlink %>% html_text() -> service_check
  
  #If there is a 503 error, go back
  if(grepl("503 Service", service_check)){ 
    
    mybrowser$goBack()
    
  }
  else
  {
    
    #Job Title
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjob_item") %>% html_nodes(".tpjob_title") %>% 
      html_text() %>% data.frame(stringsAsFactors = FALSE) -> x
    names(x) = "Title"
    a = rbind(a,x) #Add the new job postings to the ones extracted earlier
    
    #Recruitment Company
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjobwrap") %>% html_nodes(".tpjob_cname") %>% html_text() %>% 
      data.frame(stringsAsFactors = FALSE) -> y
    names(y) = "Company"
    b = rbind(b,y)
    
    #Links
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjob_item") %>% html_nodes(".tpjob_lnk") %>% 
      html_attr("href") %>% data.frame(stringsAsFactors = FALSE) -> z
    names(z) = "Links"
    c = rbind(c,z)
    
  }
  
}

beep()

#Put everything together in one dataframe
compile = cbind(a,b,c)

#export a copy, for backup
write.csv(compile, "Backup.csv", row.names = FALSE)

#close server and browser
mybrowser$close()
mybrowser$closeServer()

Now that i have all the links to the posts, i can now loop through the previously compiled dataframe and get all the details from all the URLS.


#Make another copy to loop through
compile_2 = compile

#Create 8 new columns to represent the details to be extracted
compile_2$Location = NA
compile_2$Experience = NA
compile_2$Education = NA
compile_2$Stream = NA
compile_2$Function = NA
compile_2$Role = NA
compile_2$Industry = NA
compile_2$Posted_On = NA

#3 loops, 2 in 1
#First loop to go through the links extracted
for(i in 1:nrow(compile_2)){
  
  hlink = ""
  
  link = compile_2$Links[i]
  
  try(hlink = html(link))
  
  if(html_text(hlink) != ""){
  
        hlink %>% html_nodes(".jd_infoh") %>% 
          html_text() %>% data.frame(stringsAsFactors = FALSE) -> a_column
        
        hlink %>% html_nodes(".jd_infotxt") %>% 
          html_text() %>% data.frame(stringsAsFactors = FALSE) -> l_column
   
  if(nrow(a_column) != 0){      
             
        #Second loop to check if the details are in the same order in each page
        for(j in nrow(l_column):1){
          
          if(nchar(str_trim(l_column[j,1])) == 0){l_column[-j,] %>% data.frame(stringsAsFactors = FALSE) -> l_column}
          
        }
         
    if(nrow(a_column) == nrow(l_column)){
    
        cbind(a_column, l_column) -> comp_column
        
        #Third loop to update dataframe with all the details from each post
        for(k in 1:nrow(comp_column)){
          
          if(grepl("Location", comp_column[k,1])){compile_2$Location[i] = comp_column[k,2]} 
          
          if(grepl("Experience", comp_column[k,1])){compile_2$Experience[i] = comp_column[k,2]}
          
          if(grepl("Education", comp_column[k,1])){compile_2$Education[i] = comp_column[k,2]}
          
          if(grepl("Stream", comp_column[k,1])){compile_2$Stream[i] = comp_column[k,2]}
          
          if(grepl("Function", comp_column[k,1])){compile_2$Function[i] = comp_column[k,2]}
          
          if(grepl("Role", comp_column[k,1])){compile_2$Role[i] = comp_column[k,2]}
          
          if(grepl("Industry", comp_column[k,1])){compile_2$Industry[i] = comp_column[k,2]}
          
          if(grepl("Posted", comp_column[k,1])){compile_2$Posted_On[i] = comp_column[k,2]}
        }
  
  }
  }
  }
}

beep()

#Export a copy for backup
write.csv(compile_2, "Raw_Complete.csv", row.names = FALSE)

#Alert
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.3)
beep(sound = 8) #That one's just me goofing around

Alright, we now have a nice dataframe of 1840 jobs and 11 columns showing:

1. Job Title
2. Company: The hiring company.
3. Links: The URL of the job posting.
4. Location: Where the job is situated.
5. Experience: Level of experience required for the job, shown as a range (e.g. 2-3 years)
6. Education: Minimum educational qualification.
7. Stream: Work stream category.
8. Function: Job function category
9. Role: Then job’s general role.
10. Industry: Which industry the hiring company is involved in.
11. Posted_On: The day the job was originally posted.

As a matter of convenience, i decided to split the 5th column, Experience, into two other columns:

12. Min: Minimum years of experience required.
13. Max: Maximum years of experience.

The code used to mutate this Experience column was:

com_clean = compile_2

#logical vector of all the observation with no details extracted because of error
is.na(com_clean[,4]) -> log_vec


#Place NA tows in separate dataframe
com_clean_NA = com_clean[log_vec,]

#Place the remaining in onther dataframe
com_clean_OK = com_clean[!log_vec,]


com_clean_OK[,"Experience"] -> Exp

#Remove whitespace and the "years" part
str_replace_all(Exp, " ", "") %>% 
  str_replace_all(pattern = "years", replacement = "") -> Exp

#Assign the location of the hyphen to a list
str_locate_all(Exp[], "-") -> hyphens

#Assign empty vectors to be populated with a loop
Min = c()
Max = c()

for(i in 1:length(Exp)){
  
  substr(Exp[i], 0, hyphens[[i]][1,1] - 1) %>% 
    as.integer() -> Min[i]
  
  substr(Exp[i], hyphens[[i]][1,1] + 1, nchar(Exp[i])) %>% 
    as.integer() -> Max[i]
  
}

#Assign results to new columns
com_clean_OK$Min_Experience = Min
com_clean_OK$Max_Experience = Max

#Rearrange the columns
select(com_clean_OK, 1:4, 12:13, 5:11) -> com_clean_OK

write.csv(com_clean_OK, "Complete_No_NA.csv", row.names = FALSE)
write.csv(com_clean_NA, "Complete_All_NA.csv", row.names = FALSE)

And with that, i have a nice dataframe of all the information i need to go through the posts. I was flirting with the idea of even trying to compile some code that would automatically apply for a job if it meets certain criteria, e.g. if a job title equals X, minimum experience is less than Y, and location is in a list of Z; then click this and, so on. Obviously, there is the question of how to go through the Captcha walls, as a colleague had once highlighted. In any case, i thought i should leave this idea for a different post. Till then, i’ll be involved in some intense googling to see if someone else has actually tried it out (using R, or even Python) and maybe pick up a few things.

This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.

4 Responses to Navigating & Scraping a Job Site | rvest & RSelenium

  1. Jim Plante says:

    Mutating the experience column is super easy with tidyr (another of Hadley Wickham’s neat packages). Use the swirl package to check it out.

    Thank you for the rest of the code above. I’ve been trying to learn how to scrape web pages, and you just advanced that knowledge.

    • I’ve heard of plyr and dplyr, but not tidyr. Thanks a lot for the tip. 🙂
      Everything i learned from scraping and R, i picked it up online. So it’s a little rewarding if even a single person gets some value from my humble, and quite obscure, blog posts.

  2. Hendrik says:

    Thanks for your interesting post. I have been trying your code out, but am not sure which job search website you are referring to. Can you tell us, please.

    • You’re welcome Hendrik. 🙂
      Some sites outline in their Terms of Use that they don’t allow accessing their site via automated means. This is to avoid to many people burdening the server with requests. Which is why I don’t normally place the site’s URL on the “link” variable.

      However, is there any specific site that you’re plan on scraping info from? I can perhaps walk you through the steps I normally take, because each site requires different code depending on the HTML source.

      And sorry for the late reply to your comment. Been pretty busy lately.

Leave a Reply

Your email address will not be published. Required fields are marked *