Plotting Twitter Data

I’ve only just come to realize some social media sites allow you to download your entire historical activity in one large data file. I don’t know about other sites, but i’m aware that twitter does give you the option to download a data file showing all your tweets, retweets, when you tweeted them, who you retweeted, and who you’ve replied to. All of this also includes any links you might have included in the tweet, along with Continue reading

Posted in Uncategorized | Tagged , , , , , , | Leave a comment

Exploring New and Used Car Data in Malaysia

I came across a local website where individuals/dealers in Malaysia can post information on used and new cars that they are selling. Why i, in particular, would browse such a site might dumbfound some people (those who personally know me would know what i’m talking about), i nevertheless found myself spending over an hour going through the posts put up my hundreds of users in Malaysia; and it got me wondering. It would be pretty interesting to explore these posts a little further by extracting Continue reading

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Web Scraping: The Sequel | Propwall.my

Alright. Time to take another shot at web scraping. My attempt at scraping data off iBilik.my left me a little frustrated because of how long it took, and also at how i couldn’t get much information because of all the duplicated posts.

I think Propwall.my would be a much better Continue reading

Posted in Uncategorized | Tagged , , , , | Leave a comment

An Attempt at Web Scraping: Cyberjaya Rental Rates

I came across this short tutorial on how to use the rvest package to scrape information from websites. It looked pretty straightforward, although it took a while to get the hang of some of the HTML jargon. So i figured i should take a shot at this scraping myself. Continue reading

Posted in Uncategorized | Leave a comment

Eroding Commitment

There’s an old saying that goes: “If you dream and want something hard enough, but have no commitment….then you’re probably full of shit”. After having finished the Getting and Cleaning Data course on Coursera, I haven’t typed a single line of R code in a long time; and since i was only just starting out on the basics, my foundations in the language are a little shaky again. So i had to start all over again.

Below is a function that loops through an entire dataframe and find all the column names that contain a given value.

searchCol = function(name, dataframe){
  x = 0 
  newList = c() #This will be vector that would be returned
  
  #Start loop for all the values to be seached for
  
  for(i in 1:length(name)){
    #Start loop for each row in the dataframe...
    for(j in 1:nrow(dataframe)){
      
      #...same thing for the columns
      for(k in 1:ncol(dataframe)){
        
        #If the matching criteria is met, add the value to the newList vector
        if(dataframe[j,k] == name[i]){
          
          newList[x+1] = names(dataframe)[k]
          x = x+1
        }       
      }
    }
  }  
  return(newList)
}

Below is a simple demostration:

mat = data.frame(matrix(1:50, 5, 10))
mat

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 6 11 16 21 26 31 36 41 46
2 2 7 12 17 22 27 32 37 42 47
3 3 8 13 18 23 28 33 38 43 48
4 4 9 14 19 24 29 34 39 44 49
5 5 10 15 20 25 30 35 40 45 50

So to find which columns do the figures 28 and 39 fall in..

tst = c(28, 39)
searchCol(tst, mat)

…we would get:

[1] "X6" "X8"

God, i hate having to start from scratch again.

Posted in Uncategorized | Tagged , , | Leave a comment

Summarizing/Exploring Global Wealth Data

Alright. Time to take a crack at another data set, and see if i can hone some of those data cleaning skills that i’ve just learned on Coursera.

The data sets I’m using are from the World Bank’s Changing Wealth of Nations report. I’m using data compiled for 1995, 2000, and 2005. The data were all combined together in one Excel workbook, so I saved the three annual data separately in different files and then loaded them in R under the variables data95, data20, and data05 for 1995, 2000, and 2005; respectively.

Given that the data is not exactly arranged in a tidy manner, some cleaning had to be done. I experimented with combining data frames in lists so that I can loop together all the data frames within the list. I’ve been told that the plyr package does provide that functionality but I’m still not used to it; and besides I’m still going through all the online tutorials.

Below is the code I used to combine and the data sets and clean them by removing the variables/columns that are not needed.

library(plyr)
library(dplyr)
library(stringr)

#adding the year column to each data frame
data95$year = "1995"
data20$year = "2000"
data05$year = "2005"

#combine all datasets in a list
dflist = list(data95, data20, data05)

for(i in 1:length(dflist)){
  
  #to remove all rows where the second column show now data
  for(j in nrow(dflist[[i]]):1){
    if(nchar(dflist[[i]][j,2]) == 0){dflist[[i]] = dflist[[i]][-j,]}
  }
  
  #remove all columns where all values are NA
  dflist[[i]] = naAll.omit(dflist[[i]])
  #remove all rows where there is an NA value
  dflist[[i]] = na.omit(dflist[[i]])
  
  #remove all commas from all columns
  for(k in ncol(dflist[[i]]):1){
    dflist[[i]][,k] = str_replace_all(dflist[[i]][,k], ",", "")
  }
  
}

#convert from a list to a dataframe
final = ldply(dflist) 

The psych R package has the helpful describeBy() function which runs summary statistics for all unique values of a particular variable, in a given dataframe. And although you can subset the output to extract any particular group’s summary stats…I didn’t find it intuitive enough for manipulation. That said, I thought making a custom function would be more helpful…at least for me.

Below is the function that would return a data frame of all the groups for which the summary stats are to be run. The parameters are the dataframe and the name of the column, and the resulting data frame was assigned to the variable sumFinal.

 deStats = function(data, column){
  
  #index number of the column name
  gIndex = which(colnames(data) == column)
  
  #assign unique column names to a variable
  groups = unique(data[,gIndex])
  
  #create an empty list, to be filled later
  clist = list()
  
  #a loop to create list of data frames containing..
  #..summary data frames
  for(i in 1:length(groups)){
    x = which(colnames(data) == column)
    y = groups[i]
    z = subset(data, data[,x] == y)
     
   a = describe(z)
    
   #add columns that include the group name and..
   #..measurements
   a$group = y
   a$measurements = rownames(a)
   
   #add data frames to empty list
   clist[[i]] = a 
   
  }
  #convert list to dataframe
  b = ldply(clist)
  
  limit1 = ncol(b)-1
  limit2 = ncol(a)-2
  
  cPost = c(limit1:ncol(b))
  
  #re-position location of group and measurement columns
  b = b[,c(cPost,1:limit2)]
  
  #remove row names
  rownames(b) = NULL
  
  return(b)
}
sumFinal = deStats(final, "Region")

Most of the NA values that have been generated are a result of running summary stats on the character variables. You can find the data output at the end of this post.

I thought it might also be a good time for me to try out the explData() function I mentioned in my first post, and see how it fares with this tidy data. It keeps giving me an error, as i seem to have not placed the character class filters in the function’s code. So I’d have to run the function on only the numeric columns like so…

explData(final[,5:21])

…which gives me the following output:

Rplot

To elaborate on the plot once more, the blue boxes represent a positive relationship and the strength of that relationship is signified by the color’s intensity. The crossed out boxes simply mean that the relationship is not statistically significant, at an alpha level of 0.05.

I was also interested in checking if I could derive the percentage changes for each variable from 1995 to 2005. The only problem I had with this is that not all countries have data from 1995 to 2005. With that in mind, the loops would definitely be creating quite a number of NA values during the analysis. I also noticed that I’m using a whole bunch of loops in my data cleaning and summarization; something I’m hoping i will eventually stop doing once I get the hang of the plyr and dplyr packages. Anyway, here’s the code for calculating the percentages, working on the the final tidy data set that was created earlier.

#list of unique countries
uniCount = unique(final$Economy)

#create empty data frame to fill in later using loop
perDF = data.frame()


for(i in 1:length(uniCount)){
  
  x = final[final$Economy == uniCount[i] & final$year == "1995",]
  y = final[final$Economy == uniCount[i] & final$year == "2005",]
  
    for(j in 1:ncol(y)){       
      if(class(x[1,j]) != "character"){
        
        perDF[i,j] = (y[1,j] - x[1,j])/x[1,j]} else {
        
          if(j <= 4){perDF[i,j] = x[1,j]}else{
            perDF[i,j] = NA
          }      
        }
    }
}

for(i in nrow(perDF):1){
  if(sum(is.na(perDF[i,])) == ncol(perDF)){perDF = perDF[-i,]}
}

colnames(perDF) = make.names(colnames(final))
perDF = naAll.omit(perDF)

Using the ever so wonderful ggplot2 package (and also with the help of this awesome handbook), i’ve plotted the percentage changes in the country’s Pasture Land wealth, and color coding by the country’s Region. Below is the code and the output:

library(ggplot2)

plot = ggplot(perDF, aes(x = reorder(Economy,Pasture.Land), y = Pasture.Land, fill = Region)) + 
  geom_bar(stat = "identity")

plot + coord_flip() + labs(x = "Country", y = "% Change") + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black")) + 
  ggtitle("Percentage Change in Pasture Land Wealth - 1995 to 2005") + 
  theme(plot.title = element_text(face = "bold", size=rel(1.5)))
 

2015-07-06 04_16_07-Plot Zoom

Interesting note on the above plot: African countries situated in the sub-Saharan region seem to have experienced the most reduction in wealth derived from pasture land; and the opposite seems to be holding true for countries situated in Latin America and the Caribbean.

There’s actually still a whole lot more room for exploration in this dataset, but this is all i’m going to be posting because the code for exploring the rest is pretty much the same.

If you’re interested in the data sets i’ve used and the outputs that resulted from the summarization, i’ve uploaded the files below:

Raw datasets:

data95, data20, data05

Outputs:

final, sumFinal, perDF

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Summarizing Development Funds Data

Still going through the Getting and Cleaning Data course on Coursera while also enrolling in the Data Manipulation using dplyr course on DataCamp. That said, i guess working on a teeny data exploration task on actual raw data would help in remembering all these new functions.

The data I’ve used is the “full version” of the Research Release file compiled by AidData. The zip file was about 180MB in size, and about 600MB when extracted. Dimensions of the csv file were 44,210 rows by 99 columns. Data is from 1946 to 2013.

What I was curious to find out was which country contributed the most to each development/assistance category, that was shown in the file; and where Somalia was the recipient. The categories that were excluded were emergency relief funds, and emergency food aid.

I couldn’t find out the units used in the funds commitments variable, and so i’ve shown the data as they were shown in the file. Having said that, and after skimming through the data a bit, i find it a little difficult to believe that these figures are not units of 1,000 (if not 1,000,000). What i can say however is that the amounts are converted to USD and discounted to present day dollars; as of the day of the file’s publication. You can get more information about the data file from AidData’s website.

Started off with picking only the columns i needed, and then subsetted the data by selecting only ones where Somalia was a recipient.There were some funds where the purpose of the funds where not declared, and so purpose of those funds were labeled as “UNDISCLOSED” in the purpose column. This was done using a simple loop.

The second task, also done with a loop, was to create a list of data frames; each data frame containing data summarized by each donor country, and summations done on the funds, for each purpose category.

The last loop row binds all the data frames that were contained in the list generated and assigns it to a variable “e”; which is our final dataset. There were a few NA rows that were naturally generated because of the UNDISCLOSED observations, however those are then removed from the final data frame.

Here’s the code i ran to come up with the final data frame. The code assumes the data is already imported and named “aid”.You’ll have to excuse me, as the code is pretty sloppy…

library(dplyr)

#Exclude columns that are not needed..
aid_sub = select(aid, year, donor, donor_type, recipient, crs_purpose_name, commitment_amount_usd_constant)
#...filter data where the recipient is Somalia
aid_sub_som = subset(aid_sub, recipient == "Somalia")

#aid where the purpose is not available is labeled as UNDISCLOSED
x = aid_sub_som$crs_purpose_name
for(i in 1:length(x)){
if(x[i] == ""){x[i] = "UNDISCLOSED"}
}
aid_sub_som$crs_purpose_name = x

aid_sub_som = aid_sub_som[,5:6]

#to create a list of dataframes that hold all the summarised data
uDonor = unique(aid_sub_som$donor)
a = list()
for(i in 1:length(uDonor)){
x = subset(aid_sub_som, donor == uDonor[i])
b = x$donor
y = summarise_each(group_by(x[,5:6], crs_purpose_name), funs(sum))
len = length(y)
b = b[len]
y = cbind(b, y)
y[,1] = as.character(y[,1])
a[[i]] = y
}

#to deconstruct the list of dataframes and bind them to one dataframe
e = data.frame()
for(i in 1:length(a)){
c = data.frame(a[[i]])
d = data.frame(a[[i+1]])
e = rbind(e,c,d)
if(i+1 == length(a)){break}
}

…and here’s the final output after having plotted it on Tableau.

Somalia Development Funds Data

Notable observations are:

  1. The United States and Italy seem to be in the forefront in the area of agricultural development, food crop production, and livestock related assistance.
  2. With almost $4M in funds, Sweden has provided the most in basic healthcare and, along with the US, also in Civilian peace building and conflict resolution with $2.8M.
  3. Among the donor countries who have refugees situated with them, Finland has contributed the most spending almost $5.5M over the period.
  4. Japan and the Netherlands seem to have assisted highest in the category of Relief coordination and support services.
  5. It is interesting to see that Norway has focused assistance mainly in the areas of public sector policy and administration $2.3M, and basic health infrastructure $1M. While Canada has contributed basic drinking water supply and sanitation $1.2M.

There is one very notable exception, the United Arab Emirates, who have supplied with somewhere in the range of $10M in funds. However the aid was provided in the 1980s and not categorized, and so they were removed from the table, for the sake of the analysis.

Posted in Uncategorized | Tagged , , , | Leave a comment

R Functions: Removing NA Columns

I realize that this might sound a little nerdy, but making simple and useful R functions can be quite fun.

Here are a couple of simple R functions i occasionally use that remove columns with NAs.

This function would remove all columns from a data frame that contain any NA values:

AnyNaCol = function(x) {
limit= ncol(x) 
for(i in limit:1){
if(anyNA(x[,i] == TRUE)) {x[,i] = NULL}
}
return(x)
}

And here’s one that would only remove those columns where all the values in the column are NA:

AllNaCol = function(x) {
limit= ncol(x)
for(i in limit:1){
if(sum(is.na(x[,i])) == nrow(x)) {x[,i] = NULL}
}
return(x)
}

I usually put all kinds of comments on my code, but i thought that this might be pretty straight forward.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Scraping IMDB’s Search Results

Learning how to scrape web pages on the internet, and given how i’m still a complete programming newbie, it took some getting used to. Credit to a certain Lee Hawthorn on how to do this.

I used SelectorGadget, as was suggested in Hawthorn’s article, but i noticed that it has a Chrome extension. This made the whole process even more easier! I’m not even certain i did this right, but it seems to give me the results i want, which when you’re starting out, is all that seems to matter.

Whenever you search for a word or phrase in IMDB’s search bar and refine the search by movie titles, you get a maximum of 200 search results. I’m still not too sure how i can get more than just 200, but for now, i guess this’ll have to do.

I created the function searchIMDB with the following script:

searchIMDB = function(searchTerm){

library(dplyr)
library(rvest)
library(stringr)

fullURL = paste("http://www.imdb.com/find?q=", searchTerm, "&s=tt&ttype=ft&ref_=fn_ft", sep = "")

page = html(fullURL)

movieTitles = page %>% 
  html_nodes(".result_text") %>%
  html_text()

movieTitles.df = na.omit(tbl_df(data.frame(movieTitles)))

trim(movieTitles.df)
}

And so running the script…

head(searchIMDB("crazy"))

gives me the following list of movie titles:

>
Source: local data frame [6 x 1]

movieTitles
1 C.R.A.Z.Y. (2005) aka “Crazy”
2 Crazy (I) (2000)
3 Crazy (II) (2008)
4 Crazy, Stupid, Love. (2011)
5 The Crazies (2010)
6 Like Crazy (2011)

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

First R function

One thing that i’ve always wanted is have a function that would run the corrplot function of all the numerical variables.

Since i’m still new to R, this is all i could come up with. It definitely needs some more work, but this’ll have to do for now:

explData = function(data, shape = "square", sig = 0.05, insign = "pch") {
library(corrplot)
library(dplyr)
library(ggvis)

#***1. Removing character columns***

limit = ncol(data) #Assign number of columns

z = 0
for(i in ncol(data):1){
if(data[,i] == "character"){
data[,i] = NULL
z = z + 1
}
}

print(paste(z, "columns with class character were deleted.", sep = " "))


y = 0 #Counter to tally how many columns were removed
for(i in limit:1){
curr.column = data[,i]
if(sum(is.na(as.numeric(as.character(curr.column)))) == length(curr.column)){ #If the number of characters in the column are equal to the length of the column
data[,i] = NULL #Then delete column
y = y + 1 #Add one to counter
}
}

print(paste(y, "factor columns were deleted. Could not coerce into integers.", sep = " ")) #Notify how many columns were removed

a = 0
for(i in ncol(data):1){
if(sd(data[,i]) == 0 | is.na(sd(data[,i]))){
data[,i] = NULL
a = a + 1}
}

print(paste(a, "columns with standard deviation equaling zero, were deleted.", sep = " ")) #Notify how many columns were removed
#***_____________________________***


#***2. Creating a dataframe of all p.values from correlation of all variables***
corrs.pvalues = data.frame()
for(i in 1:ncol(data)){
for(j in 1:ncol(data)){
corrs.pvalues[i,j] = cor.test(data[,i], data[,j])$p.value
}
}
#***__________________________________________________________________________***


corrs = cor(data) #matrix of correlation coefficients

#***3. Plot correlation corrplot, crossing out statistically insignificant relationships
corrplot(corrs, p.mat = as.matrix(corrs.pvalues), sig.level = sig, method = shape, type = "lower", order = "FPC", addrect = 2, insig = insign)
}

#______________END!_____________________
Posted in Uncategorized | Tagged , , , , , | Leave a comment