Jason A. French

Northwestern University

Faster Tetrachoric Correlations

| Comments

What are tetra- and polychoric correlations?

Polychoric correlations estimate the correlation between two theorized normal distributions given two ordinal variables. In psychological research, much of our data fits this definition. For example, many survey studies used with introductory psychology pools use Likert scale items. The responses to these items typically range from 1 (Strongly disagree) to 6 (Strongly agree). However, we don’t really think that person’s relationship to the item is actually polytomous. Instead, it’s an imperfect approximation.

Similarly, tetrachorics are special cases of polychoric crrelations when the variable of interest is dichotomous. The participant may have gotten the item either correct (i.e., 1) or incorrect (i.e., 0), but the underlying knowledge that led to the items’ response is probably a continuous distribution.

When you have polytomous rating scales but want to disattenuate the correlations to more accurately estimate the correlation betwen the latent continuous variables, one way of doing this is to use a tetrachoric or polychoric correlation coefficient.

The problem

At the SAPA Project, the majority of our data is polytomous. We ask you the degree to which you like to go to lively parties to estimate your score on latent extraversion. Presently, we use mixed.cor(), which calls a combination of the tetrachoric() and polychoric() functions in the psych package (Revelle, W., 2013).

However, each time we build a new dataset from the website’s SQL server, it takes hours. And that’s if everything goes well. If there’s an error in the code or a bug in a new function, it may take hours to hit the error, wasting your day.

After a bit of profiling, it was revealed that much of our time building the SAPA dataset was used estimating the tetrachoric and polychoric correlation coefficients. When you do this for 250,000+ participants for 10,000+ variables, it takes a long time. So Bill and I thought about how we could speed them up and feel others may benefit from our optimization.

A serious speedup to tetrachoric and polychoric was initiated with the help of Bill Revelle. The increase in speed is roughly 1- (nc-1)2 / nc2 where nc is the number of categories. Thus, for tetrachorics where nc=2, this is a 75% reduction, whereas for polychorics of 6 item responses this is just a 30% reduction.

Analyze Student Exam Items Using IRT

| Comments

I’ve cross-posted this at the SAPA Project’s blog.

Item Response Theory can be used to evaluate the effectiveness of exams given to students. One distinguishing feature from other paradigms is that it does not assume that every question is equally difficult (or that the difficulty is tied to what the researcher said). In this way, it is an empirical investigation into the effectiveness of a given exam and can help the researcher 1) eliminate bad or problematic items and 2) judge whether the test was too difficult or the students simply didn’t study.

In the following tutorial, we’ll use R (R Core Team, 2013) along with the psych package (Revelle, W., 2013) to look at a hypothetical exam.

Easy Sweave for LaTeX and R

| Comments

When you’re writing up reports using statistics from R, it can be tiresome to constantly copy and paste results from the R Console. To get around this, many of us use Sweave, which allows us to embed R code in LaTeX files. Sweave is an R function that converts R code to LaTeX, a document typesetting language. This enables accurate, shareable analyses as well as high-resolution graphs that are publication quality.

Needless to say, the marriage of statistics with documents makes writing up APA-style reports a bit easier, especially with Brian Beitzel’s amazing apa6 class for LaTeX.

Installing R in Linux

| Comments

This guide is intended to faciliate the installation of up-to-date R packages for users new to either R or Linux. Unlike Windows binaries or Mac packages, Linux software is often distributed as source-code and then compiled by package maintainers. The use of package managers has many advantages that I won’t discuss here (see Wikipedia).
More importantly, the difference can be initially intimidating.
However, once the user gets used to using package managers such as apt or yum to install software, I’m confident they’ll appreciate their ease of use.

These instructions are organized by system type.

Debian-based Distributions

Ubuntu

Full installation instructions for Ubuntu can be found here. Luckily, CRAN mirrors have compiled binaries of R which can be installed using the apt-get package manager. To accomplish this, we’ll first add the CRAN repo for Ubuntu packages to /etc/apt/sources.list. If you prefer to manually edit the sources.list file, you can do so by issuing the following in the terminal:

Inspecting sources.list
1
sudo nano /etc/apt/sources.list
Installing R in Ubuntu
1
2
3
4
5
6
7
8
9
10
11
# Grabs your version of Ubuntu as a BASH variable
CODENAME=`grep CODENAME /etc/lsb-release | cut -c 18-`

# Appends the CRAN repository to your sources.list file 
sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu $CODENAME" >> /etc/apt/sources.list'

# Adds the CRAN GPG key, which is used to sign the R packages for security.
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

sudo apt-get update
sudo apt-get install r-base r-dev

Faster SSCC Access Using Bash

| Comments

I use SSH regularly to login remotely to servers for experiments and data analysis. For instance, Northwestern’s Social Sciences Computing Cluster is available with an SSH remote login and using X11 forwarding, I can access RStudio and run analyses that require more memory than my office iMac has. However, logging into the SSCC over SSH isn’t as quick and launching a program in Spotlight.

While browsing a friend’s .bashrc on Github, I realized I could use a simple Bash function to speed things up. Copy and paste the following into Terminal:

Launching RStudio Remotely over SSH
1
2
3
4
5
6
echo "function Rsscc() {
    ssh  -c arcfour,blowfish-cbc \
        -XC netid@hardin.it.northwestern.edu rstudio
    wait $1
    exit 0
}" >> ~/.profile

After you restart Terminal.app, you can launch RStudio remotely by typing Rsscc, or whatever you renamed my function to. In principle, you could also create a simple menu for choosing among multiple servers or programs using a bit of read and case.

Creating a Simple Command Menu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
task_menu () {
cat << EOF
$(tput setaf 5)Remote Login$(tput sgr0)
$(tput setaf 5)============$(tput sgr0)

Please choose one of the following:

1) RStudio
2) Stata

EOF
  read -r choice
  case "$choice" in
      1) task="rstudio" ;;
      2) task="xstata" ;;
      *) echo "Please choose a number!" && task_menu ;;
  esac
fi
ssh -c arcfour,blowfish-cbc \
    -XC netid@hardin.it.northwestern.edu $task
    wait $1
    exit 0
}

Note: This works best if you’re using an up-to-date version of X11, such as XQuartz and are accessing the SSCC using Ethernet.

Analyzing Qualtrics Data in R Using Github Packages

| Comments

Qualtrics is an online survey platform similar to SurveyMonkey that is used by researchers to collect data. Until recently, one had to manually download the data in either SPSS or .csv format, making ongoing data analysis difficult to check whether the trend of the incoming data supports the hypothesis.

Jason Bryer has recently developed an R package published to Github for downloading data from Qualtrics within R using the Qualtrics API (see his Github repo). Using this package, you can integrate your Qualtrics data with other experimental data collected in the lab and, by running an Rscript as a cronjob, get daily updates for your analyses in R. I’ll demonstrate the use of this package below.

Graphing Error Bars for Repeated-Measures Variables With Ggplot2

| Comments

When presenting data, confidence intervals and error bars let the audience know the amount of uncertainty in the data, and see how much of the variance is explained by the reported effect of an experiment. While this is straightforward for between-subject variables, it’s less clear for mixed- and repeated-measures designs.

Consider the following. When running an ANOVA, the test accounts for three sources of variance: 1) the fixed effect of the condition, 2) the ability of the participants, and 3) the random error, as data = model + error. Plotting the repeated-measures without taking the different sources of variance into consideration would result in overlapping error bars that include between-subject variability, confusing the presentation’s audience. While the ANOVA partials out the differences between the participants and allow you to assess the effect of the repeated-measure, computing a regular confidence interval by multiplying the standard error and the F-statistic doesn’t work in this way.

Winston Chang has developed a set of R functions based on Morey (2008) and Cousineau (2005) on his wiki that help deal with this problem, where the sample variance is computed for the normalized data, and then multiplied by the sample variances in each condition by M(M-1), where M is the number of within-subject conditions.

See his wiki here for more info.

Using Figures Within Tables in LaTeX

| Comments

By using LaTeX to author APA manuscripts, researchers can address many problems associated with formatting their results into tables and figures. For example, ANOVA tables can be readily generated using the xtable package in R, and graphs from ggplot2 can be rendered within the manuscript using Sweave (see Wikipedia). However, more complicated layouts can be difficult to achieve.

In order to make test items or stimuli easier to understand, researchers occasionally organize examples in a table or figure. Using the standard \table command in LaTeX, it’s possible to include figures in an individual table cell without breaking the APA6.cls package. For example: