Scraping IMDB’s Search Results

Learning how to scrape web pages on the internet, and given how i’m still a complete programming newbie, it took some getting used to. Credit to a certain Lee Hawthorn on how to do this.

I used SelectorGadget, as was suggested in Hawthorn’s article, but i noticed that it has a Chrome extension. This made the whole process even more easier! I’m not even certain i did this right, but it seems to give me the results i want, which when you’re starting out, is all that seems to matter.

Whenever you search for a word or phrase in IMDB’s search bar and refine the search by movie titles, you get a maximum of 200 search results. I’m still not too sure how i can get more than just 200, but for now, i guess this’ll have to do.

I created the function searchIMDB with the following script:

searchIMDB = function(searchTerm){

library(dplyr)
library(rvest)
library(stringr)

fullURL = paste("http://www.imdb.com/find?q=", searchTerm, "&s=tt&ttype=ft&ref_=fn_ft", sep = "")

page = html(fullURL)

movieTitles = page %>% 
  html_nodes(".result_text") %>%
  html_text()

movieTitles.df = na.omit(tbl_df(data.frame(movieTitles)))

trim(movieTitles.df)
}

And so running the script…

head(searchIMDB("crazy"))

gives me the following list of movie titles:

>
Source: local data frame [6 x 1]

movieTitles
1 C.R.A.Z.Y. (2005) aka “Crazy”
2 Crazy (I) (2000)
3 Crazy (II) (2008)
4 Crazy, Stupid, Love. (2011)
5 The Crazies (2010)
6 Like Crazy (2011)

This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *