Summer Reading on Data Science

Aug 6th, 2014 | Comments

One of the benefits of being at Insight has been a reasonably large library stocked with great material for learning data science. If you’re looking to brush up on your skills or break into the industry, I recommend checking out the following:

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., & Tibshirani, R. (2009). The elements of statistical learning (Vol. 2, No. 1). New York: Springer.
Russell, M. A. (2013). Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. O’Reilly Media, Inc.
McDowell, G. L. (2013). Cracking the Coding Interview: 150 Programming Questions and Solutions. CareerCup.
Chang, W. (2012). R graphics cookbook. O’Reilly Media, Inc.

I actually read through Winston’s cookbook before Insight, but it’s been an invaluable resource. Why write 20 lines of matplotlib or R base graphics when you can accomplish a better graph using 5 lines of ggplot2?

Recursion in R

Jul 26th, 2014 | Comments

Most technical interviews with companies will ask you to whiteboard code some type of recursive function in your favorite programming language. Although Python seems to be the dominate king in data science, recursion can be a powerful tool in R.

What is recursion?

Recursive functions call themselves. That is, they break down the problem into the smallest possible components and the function() calls itself within the original function() on each of the smaller components. Afterward, the results are put together to solve the original problem. Let’s take a look at more concrete examples.

Installing Old R Packages for New Installations

Jul 14th, 2014 | Comments

New versions of R are pushed frequently to fix bugs and address performance concerns. However, in order to avoid conflicts between R and packages that were compiled for older versions of R, every upgrade defines a new system and user library location in which to install packages (e.g., /Library/Frameworks/R.framework/Versions/3.1/).
So how does one avoid installing each package manually?

I wrote the following code for my lab to automate the re-installation of an R system library after version upgrades. It reads the old package names into R as a list and recompiles each packages for the new version of R, when available.

Re-installing R packages for newer versions

#!/usr/bin/env Rscript
# Automatic Package Reinstallation
# Author:  Jason A. French, Northwestern University
# GPL v2.0

# Replace the 3.1 with your old version
versions <- system('ls /Library/Frameworks/R.framework/Versions/', intern = TRUE)
previous.version <- sort(versions[-which(versions=='Current')])[length(sort(versions[-which(versions=='Current')])) - 1]
full.dir <- paste('/Library/Frameworks/R.framework/Versions/',
                  previous.version,
                  '/Resources/library/',
                  sep = '')
packages <- system(paste('ls', full.dir), intern = TRUE)

lapply(X = packages, function(x){install.packages(x, type = 'source')})

update.packages(ask = FALSE)

Using R With MySQL Databases

Jul 3rd, 2014 | Comments

Overview

When I began using R, like most researchers I kept all my data in some combination of R’s native data.frame format or a CSV file that my analysis would continually read. However, as I began to analyze big datasets at the SAPA Project and at Insight, I realized that there is a lot of value to instead keeping your data in a MySQL database and streaming it into R when necessary. This post will briefly outline a few advantages of using a database to store data and run through a basic example of using R to transfer data to MySQL.

Fixing Knitr: Formatting Statistical Output to 2 Digits in R

Apr 25th, 2014 | Comments

Overview of reproducible research

Reproducible research is a phrase that describes an academic paper or manuscript that contains the code and data in addition to what is usually published - the researcher’s interpretation. In doing so, the experimental design and method of analysis is easily replicated by unaffiliated labs and critiqued by reviewers as the full analysis used to produce the results is submitted along with the final paper. One way of producing reproducible research is to use R code directly inside your LaTeX document. In order to faciliate the combination of statistical code and manuscript writing, two R packages in particular have arisen: Sweave and knitr. knitr is an R package designed as a replacement for Sweave, but both packages combine your R analysis with your LaTeX manuscript (i.e., knitr = R + LaTeX).

One advantage of knitr is that the researcher can easily create ANOVA and demographic tables directly from the data without messing around in Excel. However, as we’ll see, both knitr and Sweave can run into problems when formatting your table values to 2 decimal points. In this post, I’ll detail my proposed method of fixing that which can be applied to your entire mansucript by editing the beginning of your knitr preamble.

Faster Tetrachoric Correlations

Nov 7th, 2013 | Comments

What are tetra- and polychoric correlations?

Polychoric correlations estimate the correlation between two theorized normal distributions given two ordinal variables. In psychological research, much of our data fits this definition. For example, many survey studies used with introductory psychology pools use Likert scale items. The responses to these items typically range from 1 (Strongly disagree) to 6 (Strongly agree). However, we don’t really think that person’s relationship to the item is actually polytomous. Instead, it’s an imperfect approximation.

Similarly, tetrachorics are special cases of polychoric crrelations when the variable of interest is dichotomous. The participant may have gotten the item either correct (i.e., 1) or incorrect (i.e., 0), but the underlying knowledge that led to the items’ response is probably a continuous distribution.

When you have polytomous rating scales but want to disattenuate the correlations to more accurately estimate the correlation betwen the latent continuous variables, one way of doing this is to use a tetrachoric or polychoric correlation coefficient.

The problem

At the SAPA Project, the majority of our data is polytomous. We ask you the degree to which you like to go to lively parties to estimate your score on latent extraversion. Presently, we use mixed.cor(), which calls a combination of the tetrachoric() and polychoric() functions in the psych package (Revelle, W., 2013).

However, each time we build a new dataset from the website’s SQL server, it takes hours. And that’s if everything goes well. If there’s an error in the code or a bug in a new function, it may take hours to hit the error, wasting your day.

After a bit of profiling, it was revealed that much of our time building the SAPA dataset was used estimating the tetrachoric and polychoric correlation coefficients. When you do this for 250,000+ participants for 10,000+ variables, it takes a long time. So Bill and I thought about how we could speed them up and feel others may benefit from our optimization.

A serious speedup to tetrachoric and polychoric was initiated with the help of Bill Revelle. The increase in speed is roughly 1- (nc-1)² / nc² where nc is the number of categories. Thus, for tetrachorics where nc=2, this is a 75% reduction, whereas for polychorics of 6 item responses this is just a 30% reduction.

Analyze Student Exam Items Using IRT

Oct 25th, 2013 | Comments

I’ve cross-posted this at the SAPA Project’s blog.

Item Response Theory can be used to evaluate the effectiveness of exams given to students. One distinguishing feature from other paradigms is that it does not assume that every question is equally difficult (or that the difficulty is tied to what the researcher said). In this way, it is an empirical investigation into the effectiveness of a given exam and can help the researcher 1) eliminate bad or problematic items and 2) judge whether the test was too difficult or the students simply didn’t study.

In the following tutorial, we’ll use R (R Core Team, 2013) along with the psych package (Revelle, W., 2013) to look at a hypothetical exam.

Easy Sweave for LaTeX and R

Aug 16th, 2013 | Comments

When you’re writing up reports using statistics from R, it can be tiresome to constantly copy and paste results from the R Console. To get around this, many of us use Sweave, which allows us to embed R code in LaTeX files. Sweave is an R function that converts R code to LaTeX, a document typesetting language. This enables accurate, shareable analyses as well as high-resolution graphs that are publication quality.

Needless to say, the marriage of statistics with documents makes writing up APA-style reports a bit easier, especially with Brian Beitzel’s amazing apa6 class for LaTeX.

Installing R in Linux

Mar 11th, 2013 | Comments

This guide is intended to faciliate the installation of up-to-date R packages for users new to either R or Linux. Unlike Windows binaries or Mac packages, Linux software is often distributed as source-code and then compiled by package maintainers. The use of package managers has many advantages that I won’t discuss here (see Wikipedia).
More importantly, the difference can be initially intimidating.
However, once the user gets used to using package managers such as apt or yum to install software, I’m confident they’ll appreciate their ease of use.

These instructions are organized by system type.

Debian-based Distributions

Ubuntu

Full installation instructions for Ubuntu can be found here. Luckily, CRAN mirrors have compiled binaries of R which can be installed using the apt-get package manager. To accomplish this, we’ll first add the CRAN repo for Ubuntu packages to /etc/apt/sources.list. If you prefer to manually edit the sources.list file, you can do so by issuing the following in the terminal:

Inspecting sources.list

sudo nano /etc/apt/sources.list

Installing R in Ubuntu

# Grabs your version of Ubuntu as a BASH variable
CODENAME=`grep CODENAME /etc/lsb-release | cut -c 18-`

# Appends the CRAN repository to your sources.list file 
sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu $CODENAME" >> /etc/apt/sources.list'

# Adds the CRAN GPG key, which is used to sign the R packages for security.
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

sudo apt-get update
sudo apt-get install r-base r-dev

Faster SSCC Access Using Bash

Apr 5th, 2012 | Comments

I use SSH regularly to login remotely to servers for experiments and data analysis. For instance, Northwestern’s Social Sciences Computing Cluster is available with an SSH remote login and using X11 forwarding, I can access RStudio and run analyses that require more memory than my office iMac has. However, logging into the SSCC over SSH isn’t as quick and launching a program in Spotlight.

While browsing a friend’s .bashrc on Github, I realized I could use a simple Bash function to speed things up. Copy and paste the following into Terminal:

Launching RStudio Remotely over SSH

echo "function Rsscc() {
    ssh  -c arcfour,blowfish-cbc \
        -XC netid@hardin.it.northwestern.edu rstudio
    wait $1
    exit 0
}" >> ~/.profile

After you restart Terminal.app, you can launch RStudio remotely by typing Rsscc, or whatever you renamed my function to. In principle, you could also create a simple menu for choosing among multiple servers or programs using a bit of read and case.

Creating a Simple Command Menu

task_menu () {
cat << EOF
$(tput setaf 5)Remote Login$(tput sgr0)
$(tput setaf 5)============$(tput sgr0)

Please choose one of the following:

1) RStudio
2) Stata

EOF
  read -r choice
  case "$choice" in
      1) task="rstudio" ;;
      2) task="xstata" ;;
      *) echo "Please choose a number!" && task_menu ;;
  esac
fi
ssh -c arcfour,blowfish-cbc \
    -XC netid@hardin.it.northwestern.edu $task
    wait $1
    exit 0
}

Note: This works best if you’re using an up-to-date version of X11, such as XQuartz and are accessing the SSCC using Ethernet.

← Older Blog Archives