Example re-analysis of hemagglutination inhibition data from an ImmPort Study

Version 2.0, last modified April 2017 Python 3.6 Numpy 1.12.1 Pandas 0.19.2 Plotly 1.12.9 Scipy 0.19.0

What is ImmPort ?

The Immunology Database and Analysis Portal (ImmPort) project provides tools and services for the collection, QC, curation, and sharing of immunological data. The purpose of ImmPort is to provide a long term, sustainable source for promoting the re-use of these data provided by NIAID DAIT and DMID funded investigators.

The study chosen for this tutorial is one of 229 currently available for download as of DR20.

Title: Apoptosis and other immune biomarkers predict influenza vaccine (TIV 2008) responsiveness
Principal Investigator: Mark M. Davis
Description: In an effort to indentify benchmarks of immunological health, influenza vaccination was used in 30 young (20 to 30 years) and 59 older subjects (60 to 89 years) as models for strong and weak immune responses, respectively.

We will focus on the HAI results for this tutorial.


Study packages can be downloaded from ImmPort after free registration. Study download packages are available in several formats. For this tutorial we will be using tab-separated text files (meaning a "tab" character is used as the column separator), found in the following zip archive:
SDYxxx-DRnn_Tab.zip where xxx is the study accession and nn the data release version.
Details on the tables included in this archive are available here.

To get the archive, go to the study page, in our case SDY212, and click on 'download'. If you haven't registered to ImmPort yet, you will need to create an account (it only takes a minute!). You might need to install Aspera if you haven't already.
Once logged in, you should see a list of directories and files. Look for SDY212-DR20_Tab.zip and click on it to download.

For this tutorial, we will also use python packages: Numpy, Scipy, Pandas and Plotly. Each of these packages should be installed for this notebook to run smoothly. For instructions on how to do this with conda, check out this tutorial.

Additional Resources

Jupyter notebooks can be used for data analysis and as a teaching tool. There are a multitude of very good tutorials online about how to use notebooks, below are a selected few:

  • Teaching Numerical Methods with IPython Notebooks by D.I. Ketcheson and A. Ahmadia at the 2014 SciPy conference.
    Part 1 Part 2 Part 3

  • Statistcial Data Analysis in Python by C. Fonnesbeck at the 2013 SciPy conference.
    Part 1 Part 2 Part 3 Part 4.

  • Exploring Machine Learning with Scikit-learn by J. Vanderplas and O. Grisel at the 2014 PyCon.
    Part 1

A more extensive list of interesting tutorials can be found here.


Load in the Python libraries used for this analysis

In [1]:
import numpy as np
import pandas as pd
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from scipy import stats

Explore the Overall Study Design and Subject Characteristics

In this section you will be introduced to some of the content available for download from ImmPort, which can be used for re-analysis of the study. We demonstrate here how to load in information from tab-separated text files, merge their content and generate simple descriptive statistics.
The files needed for this section can be found in the SDY212-DR20_Tab.zip archive. Once this archive is unzipped, you should have a new folder called SDY212-DR20, itself containing a 'Tab' folder. This is where you will find the following files:

  • subjects.txt - general subject demographic information.
  • arm_2_subject.txt - mapping from subjects to study arms/cohorts.
  • arm_or_cohort - study arm names and descriptions

Load in study files

To load the files we'll be using in memory, you can either type in the path to your files, or use the input() function, which in a Jupyter notebook will print a question, and capture your answer in a text box as a string. Either line of code works, though note that tab-complete will work in a Jupyter notebook if you hit the tab key within double quotes, but not with the input() function text box.
In the three code cells below, don't forget to change the path to where your files are located!

In [2]:
#subject_file = input("Enter the path to subject.txt: ")
subject_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/subject.txt"
In [3]:
#arm_2_subject_file = input("Enter the path to arm_2_subject.txt: ")
arm_2_subject_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/arm_2_subject.txt"
In [4]:
#arm_or_cohort_file = input("Enter the path to arm_or_cohort.txt: ")
arm_or_cohort_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/arm_or_cohort.txt"

The files are read in as pandas dataframe with the pd.read_table() function:

In [5]:
subjects = pd.read_table(subject_file, sep="\t")
arm_2_subject = pd.read_table(arm_2_subject_file, sep="\t")
arm_or_cohort = pd.read_table(arm_or_cohort_file, sep="\t")

Review file content

The pandas method df.head(), where 'df' indicates a dataframe, returns the first 5 rows of a dataframe object and is a good first step to look briefly at the data you will be working with:

In [6]:
0 SUB134268 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female White NaN Homo sapiens NaN NaN 2883
1 SUB134304 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female Asian NaN Homo sapiens NaN NaN 2883
2 SUB134309 NaN This subject record was used to consolidate du... Not Hispanic or Latino Male White NaN Homo sapiens NaN NaN 2883
3 SUB134324 NaN This subject record was used to consolidate du... Not Hispanic or Latino Male Other White, Asian Homo sapiens NaN NaN 2883
4 SUB134240 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female White NaN Homo sapiens NaN NaN 2883
In [7]:
0 ARM894 SUB134323 Age at Study Day 0 NaN Years 23.82 23.82 Non-twin
1 ARM894 SUB134252 Age at Study Day 0 NaN Years 24.15 24.15 Non-twin
2 ARM895 SUB134256 Age at Study Day 0 NaN Years 83.89 83.89 Non-twin
3 ARM895 SUB134262 Age at Study Day 0 NaN Years 84.17 84.17 Non-twin
4 ARM895 SUB134265 Age at Study Day 0 NaN Years 85.82 85.82 Non-twin
In [8]:
0 ARM895 Older participants aged 60 to 89 years, vaccin... Cohort_2 SDY212 Experimental 2883
1 ARM894 Young participants aged 20 to 30 years, vaccin... Cohort_1 SDY212 Experimental 2883

Pandas dataframes column headings are also accessible as follows:

In [9]:
In [10]:
In [11]:

To get an idea of the number of rows and/or columns in a table, you can use shape. The first dimension reported by shape is the number of rows and the second one is the number of columns.
Note here that the number of rows reported for the subjects and arms_2_subject files match. In these two files, each row corresponds to a subject. The fact that they have the same number of rows is a good indicator that there should be a one to one mapping between the two files.

In [12]:
(91, 11)
In [13]:
(91, 8)
In [14]:
(2, 6)

To make sure that the set of subject accessions in each table contains only one instance of each subject, you can take advantage of the built-in Python function set(), and of the built-in Pandas access to column content through column header. We can compare the number of unique elements returned by set() to the number of elements including potential duplicates. If they are the same, there are no duplicates.

In [15]:
if len(set(subjects.SUBJECT_ACCESSION)) != len(subjects.SUBJECT_ACCESSION):
    print("some subject accessions are duplicated in subject.txt")
In [16]:
if len(set(arm_2_subject.SUBJECT_ACCESSION)) != len(arm_2_subject.SUBJECT_ACCESSION):
    print("some subject accessions are duplicated in arm_2_subject.txt")

Of course, to determine beyond any doubt that the set of subject accessions are the same in both tables, you need to compare them. The pandas df.isin() method checks if all elements from one set exist in another given set. The resulting structure contains boolean ('True' or 'False') which can then be counted with df.sum(). Here, we expect the sum to be the same than the number of subject accessions (in other words, the number of rows in subjects), because 'True' counts as 1.

In [17]:
df = subjects.SUBJECT_ACCESSION.isin(arm_2_subject.SUBJECT_ACCESSION)

*Side note: ImmPort data is de-identified and curated such that accessions are unique identifiers*

An alternative to df.head() is to review the content by looking at a single row in each file with df.ix. This makes looking at many columns a bit easier.

In [18]:
SUBJECT_ACCESSION                                                 SUB134268
ANCESTRAL_POPULATION                                                    NaN
DESCRIPTION               This subject record was used to consolidate du...
ETHNICITY                                            Not Hispanic or Latino
GENDER                                                               Female
RACE                                                                  White
RACE_SPECIFY                                                            NaN
SPECIES                                                        Homo sapiens
STRAIN                                                                  NaN
STRAIN_CHARACTERISTICS                                                  NaN
WORKSPACE_ID                                                           2883
Name: 0, dtype: object
In [19]:
ARM_ACCESSION                    ARM894
SUBJECT_ACCESSION             SUB134326
AGE_EVENT            Age at Study Day 0
AGE_EVENT_SPECIFY                   NaN
AGE_UNIT                          Years
MAX_SUBJECT_AGE                   25.39
MIN_SUBJECT_AGE                   25.39
SUBJECT_PHENOTYPE              Non-twin
Name: 10, dtype: object

A pandas dataframe content can also be accessed by using the headers. Note that while ix returns the full content of a row (one value per column), the following returns all rows for the specified columns (one value per row):

In [20]:
0 ARM895 Cohort_2 Older participants aged 60 to 89 years, vaccin...
1 ARM894 Cohort_1 Young participants aged 20 to 30 years, vaccin...

In this particular case, the names of each arm aren't very descriptive, We can add a column to the dataframe with a more meaningful name:

In [21]:
arm_or_cohort['ARM_NAME'] = ['Old','Young']
In [22]:
0 ARM895 Cohort_2 Old Older participants aged 60 to 89 years, vaccin...
1 ARM894 Cohort_1 Young Young participants aged 20 to 30 years, vaccin...

Merge data into one table

The goal of this section is to create one dataframe containing only the information we want to proceed. We've seen above that subjects and arm_2_subject contain information pertaining to the same sets of subject accessions, so we know we can merge them together along subject accession with little issues.
The other table we have, arm_or_cohort, contains information about each arm mentioned in the mapping table, arm_2_subject. We can merge them together thanks to the arm accession.
There are several methods in pandas to merge dataframes together. Here, we use the function pd.merge(), which does all the hard work of merging gracefully together the dataframes.

In [23]:
# as a reminder:
print(subjects.columns, arm_or_cohort.columns, arm_2_subject.columns, sep="\n")
In [24]:
subjects_merged_1 = pd.merge(subjects, arm_2_subject, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
subjects_merged_2 = pd.merge(subjects_merged_1, arm_or_cohort, left_on='ARM_ACCESSION',right_on='ARM_ACCESSION')
SUBJECT_ACCESSION                                                 SUB134268
ANCESTRAL_POPULATION                                                    NaN
DESCRIPTION_x             This subject record was used to consolidate du...
ETHNICITY                                            Not Hispanic or Latino
GENDER                                                               Female
RACE                                                                  White
RACE_SPECIFY                                                            NaN
SPECIES                                                        Homo sapiens
STRAIN                                                                  NaN
STRAIN_CHARACTERISTICS                                                  NaN
WORKSPACE_ID_x                                                         2883
ARM_ACCESSION                                                        ARM894
AGE_EVENT                                                Age at Study Day 0
AGE_EVENT_SPECIFY                                                       NaN
AGE_UNIT                                                              Years
MAX_SUBJECT_AGE                                                       29.71
MIN_SUBJECT_AGE                                                       29.71
SUBJECT_PHENOTYPE                                                  Non-twin
DESCRIPTION_y             Young participants aged 20 to 30 years, vaccin...
NAME                                                               Cohort_1
STUDY_ACCESSION                                                      SDY212
TYPE                                                           Experimental
WORKSPACE_ID_y                                                         2883
ARM_NAME                                                              Young
Name: 0, dtype: object

Note in the list of columns above that pandas modified the name of some columns by appending a '_x' or '_y'.

The new dataframe created, subjects_merged contains all columns from all 3 tables, most of which are irrelevant at this time. We can drop the unnecessary columns by creating a new dataframe with only the columns of interest as follows:

In [25]:
subjects_merged = subjects_merged_2[['SUBJECT_ACCESSION','ARM_ACCESSION','ARM_NAME','GENDER','RACE','MIN_SUBJECT_AGE']]
(91, 6)
In [26]:
ARM_ACCESSION           ARM894
ARM_NAME                 Young
GENDER                  Female
RACE                     White
MIN_SUBJECT_AGE          29.71
Name: 0, dtype: object

Simple descriptive statistics

Below we show some descriptive statistics just to get a feel for the distribution of age, gender and race.

We are going to use two pandas methods here, chained by a dot operator. The object returned by the first function is taken as an argument by the next function. For instance below, the df.groupby() method groups rows by values contained in the specified column. It returns an object on which the next function, df.count(), operates. The df.count() method returns the number of observations in a given column.

  • How many subjects are in each arm?
In [27]:
Old      61
Young    30
Name: SUBJECT_ACCESSION, dtype: int64
  • How many subjects in each gender?
In [28]:
Female    54
Male      37
Name: SUBJECT_ACCESSION, dtype: int64
  • What is the race distribution overall?
In [29]:
American Indian or Alaska Native     1
Asian                                8
Other                                7
White                               75
Name: SUBJECT_ACCESSION, dtype: int64
  • What is the age distribution in each arm?

df.describe() is a very useful method which returns standard summary statistics (count, mean, median, standard dev, quartiles) on a given series of numbers.
Note in the following example that the only column in our dataframe containing numbers is MIN_SUBJECT_AGE, so it is the only one returned.

In [30]:
subjects_by_arm = subjects_merged.groupby('ARM_NAME')
Old count 61.000000
mean 77.151803
std 9.558915
min 61.200000
25% 68.080000
50% 78.750000
75% 85.170000
max 90.000000
Young count 30.000000
mean 25.403000
std 2.670702
min 20.680000
25% 23.882500
50% 25.050000
75% 27.115000
max 30.750000

Simple Descriptive Plots

In this version of the tutorial, we will be using the Plotly library for graphs instead of MatPlotLib. For more advanced users, there is a way to feed directly pandas dataframe into Plotly by using yet another library.

Plotly generates interactive graphs. By default, plots are saved to a server in a Plotly account that you have to set up. To enable local generation of your plots instead, use the following line:

In [31]: