Latest Posts
After a couple all-nighters we're finally done with our undergraduate statistics thesis. The abstract provides a brief overview of what we were trying to accomplish:
We explore the possibility of improving data analysis through the use of interactive visualization.
Exploration of data and models is an iterative process. We hypothesize that dynamic, interactive visualizations
greatly reduce the cost of new iterations and thus f acilitate agile investigation and rapid prototyping. Our
web-application framework, flyvis.com, offers evidence for such a hypothesis for a dataset consisting of airline
on-tim e flight performance between 2006-2008. Utilizing our framework we are able to study the feasibility of modeling
subsets of flight delays from temporal data, which fails on the full dataset.
Although virtually obsolete, Roman Numerals are subtly embedded into our culture. From the Super Bowl and Olympics to royal titles, Roman Numerals refuse to fully be extinguished from our every day lives. And that's not without reason. All numbers are beautiful and Roman Numerals are no exception, even if they are written a little differently from their Arabic counterparts.
In this post, we'll examine some fascinating properties of Roman Numerals - namely the lengths of Roman Numerals in succession.
First, we define a simple Arabic --> Roman Numeral converter. Start by creating two vectors, one for the 13 Roman symbols and another for the Arabic counterparts. Next, a simple for/while combination iterates through the arrays and chooses the appropriate Roman symbols while iteratively decreasing the input variable.
arabic <- c(1000, 900, 500, 400, 100, 90, 50, 40, 10, 9, 5, 4, 1)
roman <- c("M", "CM", "D", "CD", "C", "XC", "L", "XL", "X", "IX", "V", "IV", "I")
Many of my research and personal interests lie in the realm of Machine Learning for several reasons. To me, it is the perfect blend of mathematics, statistics, and computer science. Also, it is extremely pervasive in today's society. Everything from improved web-search to self-driving cars can be attributed to developments in Machine Learning.
For one of my computational finance classes, I attempted to implement a Machine Learning algorithm in order to predict stock prices, namely S&P 500 Adjusted Close prices. In order to do this, I turned to Artificial Neural Networks (ANN) for a plethora of reasons. ANNs have been known to work well for computationally intensive problems where a user may not have a clear hypothesis of how the inputs should interact. As such, ANNs excel at picking up hidden patterns within the data so well that they often overfit!
Keeping this in mind, I experimented with a technique known as a 'sliding window'. Rather than train the model with years of S&P 500 data, I created an ANN that would train over the past 30 days (t-30, ..., t) to predict the close price at t+1. A 30 day sliding window seemed to make a good fit. It wasn't so wide as to not capture the current market atmosphere, but also it wasn't narrow enough to be hypersensitive to recent erratic movements.
Then, I had to decide on the input variables I was going to use. Many stock market models are pure time-series autoregressive functions, but the benefit of ANNs is that we can use them as a more traditional Machine Learning technique, we several inputs (and not only previous prices). This is helpful, because there exist an extremely large number of technical indicators which should uncover some significance in the market. I defined several of my own inputs that I thought would be significant predictors such as:
- 28 Day Moving Average
- 14 Day Moving Average
- 7 Day Moving Average
- Previous Day Close Price
I'm gonna share a short code snippet that I thought was interesting. This post is inspired by one of my engineering computation classes at Rice. The program initializes a pair of coordinates 'z' and iteratively updates z by matrix multiplication based on some random number generation criteria. After each successive coordinate update, the new 'z' is plotted.
library(ggplot2)
z <- matrix(c(0, 0), nrow = 2)
x <- c()
y <- c()
I've been using the Global Terrorism Database a lot lately so I decided to share an interesting plot I made with the data.
The GTD provides over 100,000 observations of terrorist incidents between 1970 and 2011. Of these, there are about 2400 observations in the USA. While this is not a large number, the graph still provides some interesting and intuitive results.
## Load libraries
library(ggplot2)
library(plyr)
library(maps)
library(stringr)
Now that the Facebook Hacker Cup is coming to an end I figured I'd post my solution to one of the challenges. Unfortunately I only made it to Round 1, but I was able to answer this rather interesting Qualification Round problem.
The Problem:
When John was a little kid he didn't have much to do. There was no internet, no Facebook, and no programs to hack on.
So he did the only thing he could... he evaluated the beauty of strings in a quest to discover the most beautiful string in the world.
Given a string s, little Johnny defined the beauty of the string as the sum of the beauty of the letters in it.
I've been playing around with the 'twitteR' package for R ever since I heard of its existence. Twitter is great and easy to mine because the messages are all-text and most people's profiles are public. This process is made even easier with the 'twitteR' package, which takes advantage of the Twitter API.
After exploring some of the package's capabilities, I decided to conduct a pretty basic sentiment analysis on some tweets with various hashtags. Specifically, I analyzed the polarity of each tweet - whether the tweet is positive, negative, or neutral.
The hashtags I used were: #YOLO, #FML, #blessed, #bacon
The actual script is fairly simple and repetitive but does yield some interesting results:
library(twitteR)
library(sentiment)
library(ggplot2)
library(RJSONIO)
library(wordcloud)
The inspiration for this post came as I was browsing texts and articles about USA's GDP and I wondered what might have a positive relationship to GDP that would be interesting to graph and explore.
I stumbled upon two datasets: US States by Educational Attainment and US States by GDP. The data looked clean enough so I decided to write up a quick R program to see what I could find.
library(ggplot2)
bachelors <- read.csv("bachelors.csv", header = TRUE)
GDP <- read.csv("gdppercapita.csv", header = TRUE)
I was inspired by a few animated gifs that I saw recently so I decided to make one of my own. For this project, I sought out a way to effectively visualize how Mcdonald's expanded throughout the world. To do this, I created a heatmap of the world and using animations I was able to efficiently map out how McDonald's became more popular over time.
The data I am using is from this Wikipedia page. It took a small amount of manual cleaning before I could import it into R just because some of the countries' spellings from this article did not match with what is used in the R 'maps' package.
library(ggplot2)
library(maps)
library(mapproj)
library(lubridate)
library(animation)
In Problem 9 of Project Euler we are tasked with finding the product (abc) of the Pythagorean Triplet (a, b, c) such that a + b + c = 1000.
A Pythagorean triplet is a set of three natural numbers such that a2 + b2 = c2.
To solve this problem, we first see that c = (a2 + b2)1/2. Without loss of generality, we can only run the for loop for a and b, since c will be uniquely determined given a certain a and b.
The code I used:
for (a in 1:499) {
for (b in 1:499) {
if (a + b + sqrt(a^2 + b^2) == 1000) {
print(a * b * sqrt(a^2 + b^2))
break
}
}
if (a + b + sqrt(a^2 + b^2) == 1000) {
break
}
}
Now that I'm done with finals, I finally have time to update my blog. I managed to find three separate Wikipedia entries: one about the Quality of Life Scores of several different countries, one about the number of Nobel Laureates per capita, and one that is a List of Countries by Income Inequality which uses Gini index to rank countries.
I then plotted the data to see if there was a discernable relationship between the three. Most of the work for this project had to do with merging and cleaning the data. I began by pasting the tables from the wikipedia articles into a .csv file. Since the tables were all different lengths and some countries were missing values of the parameters, I had to tidy the dataset up quite a bit.
The result, featured below the code, is pretty interesting.
library(ggplot2)
nobel.data <- read.csv("nobel.csv", header = TRUE)
fixed.nobel.data <- matrix(nrow = 64, ncol = 4)
colnames(fixed.nobel.data) <- c("Country", "Laureates.Per.10.Million", "Quality.Of.Life.Score", "Gini.Score")
Today I'll be dealing with a dataset that has the price, carat, and several other attributes of almost 54,000 diamonds. It is publically available in the ggplot2 package. Let's jump right into it.
library(ggplot2)
library(gridExtra)
data(diamonds)
plot1 <- qplot(cut, data = diamonds)
plot2 <- qplot(carat, data = diamonds, binwidth = .1)
Problem 8 of Project Euler asks us to find the greatest product of five consecutive digits of a 1000 digit number. The problem and 1000 digit number can be found here.
I first saved the number in a text file and used the scan function to import the number into R. At first it is a 20 string characteric because "scan" separates items based on line breaks. The paste function then allows me to to concatenate the strings into one character. After that, I use the string split function to separate each number in the character into its own position in a vector. Finally, I convert the "strings" vector into a numeric vector so we can use mathematical operations on it.
It was easy to see that if we are taking products of 5 consecutive integers of a certain number, there are a total of x - 4 total products, where x is the total number of digits in the number. (If its a 5 digit number, only 1 product exist. For a 6 digit number, we can multiply 1 through 5 or 2 through 6 for 2 different products.)
Therefore, for this problem we had to consider 996 products. After allocating memory for an empty numeric vector, I ran a for loop 996 times to fill up the entries of the 'products' vector. The first entry became the product of the first 5 digits of our number, the second entry the product of digits 2-6, and so on.
string <- scan(file = "8.txt", what="")
string <- paste(string, collapse="")
string <- unlist(strsplit(string, split=""))
string <- as.numeric(string)
I recently discovered an awesome R package called zipcode so I decided to play around with it a little bit. I found some IRS data on the 100 highest and 100 lowest income zip codes in the US. After cleaning up and modifying the data a little bit I plotted it onto a map projection of the US.
library(ggplot2)
library(maps)
library(zipcode)
data(zipcode)
For this post, I attempted to reconstruct a famous visualization of Napoleon's March to Moscow. The French Invasion of Russia is considered a major turning point in the Napoleonic Wars. Up until that point, Napoleon's army was vast in size. By the end of his March on Moscow, the French army was reduced to a tiny fraction of its size.

Pictured above is Charles Minard's flow map of Napoleon's march. It is simply amazing that such a detailed and innovative graphic was published in 1869 (way before the first computer). Minard was truly a pioneer in the use of graphics in engineering and statistics.
The troops text file contains the longitude and latitude of Napoleon's army, both on the attack and retreat, on his march. The cities file contains the longitudes and latitudes of a few major Russian cities that were in Napoleon's path.
library(ggplot2)
library(maps)
library(mapproj)