Version 2.0, last modified April 2017 Python 3.6 Numpy 1.12.1 Pandas 0.19.2 Plotly 1.12.9 Scipy 0.19.0
The Immunology Database and Analysis Portal (ImmPort) project provides tools and services for the collection, QC, curation, and sharing of immunological data. The purpose of ImmPort is to provide a long term, sustainable source for promoting the re-use of these data provided by NIAID DAIT and DMID funded investigators.
The study chosen for this tutorial is one of 229 currently available for download as of DR20.
Title: Apoptosis and other immune biomarkers predict influenza vaccine (TIV 2008) responsiveness
Principal Investigator: Mark M. Davis
Description: In an effort to indentify benchmarks of immunological health, influenza vaccination was used in 30 young (20 to 30 years) and 59 older subjects (60 to 89 years) as models for strong and weak immune responses, respectively.
We will focus on the HAI results for this tutorial.
Study packages can be downloaded from ImmPort after free registration. Study download packages are available in several formats. For this tutorial we will be using tab-separated text files (meaning a "tab" character is used as the column separator), found in the following zip archive:
SDYxxx-DRnn_Tab.zip where xxx is the study accession and nn the data release version.
Details on the tables included in this archive are available here.
To get the archive, go to the study page, in our case SDY212, and click on 'download'. If you haven't registered to ImmPort yet, you will need to create an account (it only takes a minute!). You might need to install Aspera if you haven't already.
Once logged in, you should see a list of directories and files. Look for SDY212-DR20_Tab.zip and click on it to download.
For this tutorial, we will also use python packages: Numpy, Scipy, Pandas and Plotly. Each of these packages should be installed for this notebook to run smoothly. For instructions on how to do this with conda, check out this tutorial.
Jupyter notebooks can be used for data analysis and as a teaching tool. There are a multitude of very good tutorials online about how to use notebooks, below are a selected few:
Exploring Machine Learning with Scikit-learn by J. Vanderplas and O. Grisel at the 2014 PyCon.
A more extensive list of interesting tutorials can be found here.
import numpy as np import pandas as pd import plotly import plotly.plotly as py import plotly.graph_objs as go from scipy import stats
In this section you will be introduced to some of the content available for download from ImmPort, which can be used for re-analysis of the study. We demonstrate here how to load in information from tab-separated text files, merge their content and generate simple descriptive statistics.
The files needed for this section can be found in the SDY212-DR20_Tab.zip archive. Once this archive is unzipped, you should have a new folder called SDY212-DR20, itself containing a 'Tab' folder. This is where you will find the following files:
To load the files we'll be using in memory, you can either type in the path to your files, or use the input() function, which in a Jupyter notebook will print a question, and capture your answer in a text box as a string. Either line of code works, though note that tab-complete will work in a Jupyter notebook if you hit the tab key within double quotes, but not with the input() function text box.
In the three code cells below, don't forget to change the path to where your files are located!
#subject_file = input("Enter the path to subject.txt: ") subject_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/subject.txt"
#arm_2_subject_file = input("Enter the path to arm_2_subject.txt: ") arm_2_subject_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/arm_2_subject.txt"
#arm_or_cohort_file = input("Enter the path to arm_or_cohort.txt: ") arm_or_cohort_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/arm_or_cohort.txt"
subjects = pd.read_table(subject_file, sep="\t") arm_2_subject = pd.read_table(arm_2_subject_file, sep="\t") arm_or_cohort = pd.read_table(arm_or_cohort_file, sep="\t")
|0||SUB134268||NaN||This subject record was used to consolidate du...||Not Hispanic or Latino||Female||White||NaN||Homo sapiens||NaN||NaN||2883|
|1||SUB134304||NaN||This subject record was used to consolidate du...||Not Hispanic or Latino||Female||Asian||NaN||Homo sapiens||NaN||NaN||2883|
|2||SUB134309||NaN||This subject record was used to consolidate du...||Not Hispanic or Latino||Male||White||NaN||Homo sapiens||NaN||NaN||2883|
|3||SUB134324||NaN||This subject record was used to consolidate du...||Not Hispanic or Latino||Male||Other||White, Asian||Homo sapiens||NaN||NaN||2883|
|4||SUB134240||NaN||This subject record was used to consolidate du...||Not Hispanic or Latino||Female||White||NaN||Homo sapiens||NaN||NaN||2883|
|0||ARM894||SUB134323||Age at Study Day 0||NaN||Years||23.82||23.82||Non-twin|
|1||ARM894||SUB134252||Age at Study Day 0||NaN||Years||24.15||24.15||Non-twin|
|2||ARM895||SUB134256||Age at Study Day 0||NaN||Years||83.89||83.89||Non-twin|
|3||ARM895||SUB134262||Age at Study Day 0||NaN||Years||84.17||84.17||Non-twin|
|4||ARM895||SUB134265||Age at Study Day 0||NaN||Years||85.82||85.82||Non-twin|
|0||ARM895||Older participants aged 60 to 89 years, vaccin...||Cohort_2||SDY212||Experimental||2883|
|1||ARM894||Young participants aged 20 to 30 years, vaccin...||Cohort_1||SDY212||Experimental||2883|
Pandas dataframes column headings are also accessible as follows:
Index(['SUBJECT_ACCESSION', 'ANCESTRAL_POPULATION', 'DESCRIPTION', 'ETHNICITY', 'GENDER', 'RACE', 'RACE_SPECIFY', 'SPECIES', 'STRAIN', 'STRAIN_CHARACTERISTICS', 'WORKSPACE_ID'], dtype='object')
Index(['ARM_ACCESSION', 'SUBJECT_ACCESSION', 'AGE_EVENT', 'AGE_EVENT_SPECIFY', 'AGE_UNIT', 'MAX_SUBJECT_AGE', 'MIN_SUBJECT_AGE', 'SUBJECT_PHENOTYPE'], dtype='object')
Index(['ARM_ACCESSION', 'DESCRIPTION', 'NAME', 'STUDY_ACCESSION', 'TYPE', 'WORKSPACE_ID'], dtype='object')
To get an idea of the number of rows and/or columns in a table, you can use shape. The first dimension reported by shape is the number of rows and the second one is the number of columns.
Note here that the number of rows reported for the subjects and arms_2_subject files match. In these two files, each row corresponds to a subject. The fact that they have the same number of rows is a good indicator that there should be a one to one mapping between the two files.
To make sure that the set of subject accessions in each table contains only one instance of each subject, you can take advantage of the built-in Python function set(), and of the built-in Pandas access to column content through column header. We can compare the number of unique elements returned by set() to the number of elements including potential duplicates. If they are the same, there are no duplicates.
if len(set(subjects.SUBJECT_ACCESSION)) != len(subjects.SUBJECT_ACCESSION): print("some subject accessions are duplicated in subject.txt")
if len(set(arm_2_subject.SUBJECT_ACCESSION)) != len(arm_2_subject.SUBJECT_ACCESSION): print("some subject accessions are duplicated in arm_2_subject.txt")
Of course, to determine beyond any doubt that the set of subject accessions are the same in both tables, you need to compare them. The pandas df.isin() method checks if all elements from one set exist in another given set. The resulting structure contains boolean ('True' or 'False') which can then be counted with df.sum(). Here, we expect the sum to be the same than the number of subject accessions (in other words, the number of rows in subjects), because 'True' counts as 1.
df = subjects.SUBJECT_ACCESSION.isin(arm_2_subject.SUBJECT_ACCESSION) df.sum()
*Side note: ImmPort data is de-identified and curated such that accessions are unique identifiers*
An alternative to df.head() is to review the content by looking at a single row in each file with df.ix. This makes looking at many columns a bit easier.
SUBJECT_ACCESSION SUB134268 ANCESTRAL_POPULATION NaN DESCRIPTION This subject record was used to consolidate du... ETHNICITY Not Hispanic or Latino GENDER Female RACE White RACE_SPECIFY NaN SPECIES Homo sapiens STRAIN NaN STRAIN_CHARACTERISTICS NaN WORKSPACE_ID 2883 Name: 0, dtype: object
ARM_ACCESSION ARM894 SUBJECT_ACCESSION SUB134326 AGE_EVENT Age at Study Day 0 AGE_EVENT_SPECIFY NaN AGE_UNIT Years MAX_SUBJECT_AGE 25.39 MIN_SUBJECT_AGE 25.39 SUBJECT_PHENOTYPE Non-twin Name: 10, dtype: object
A pandas dataframe content can also be accessed by using the headers. Note that while ix returns the full content of a row (one value per column), the following returns all rows for the specified columns (one value per row):
|0||ARM895||Cohort_2||Older participants aged 60 to 89 years, vaccin...|
|1||ARM894||Cohort_1||Young participants aged 20 to 30 years, vaccin...|
In this particular case, the names of each arm aren't very descriptive, We can add a column to the dataframe with a more meaningful name:
arm_or_cohort['ARM_NAME'] = ['Old','Young']
|0||ARM895||Cohort_2||Old||Older participants aged 60 to 89 years, vaccin...|
|1||ARM894||Cohort_1||Young||Young participants aged 20 to 30 years, vaccin...|
The goal of this section is to create one dataframe containing only the information we want to proceed. We've seen above that subjects and arm_2_subject contain information pertaining to the same sets of subject accessions, so we know we can merge them together along subject accession with little issues.
The other table we have, arm_or_cohort, contains information about each arm mentioned in the mapping table, arm_2_subject. We can merge them together thanks to the arm accession.
There are several methods in pandas to merge dataframes together. Here, we use the function pd.merge(), which does all the hard work of merging gracefully together the dataframes.
# as a reminder: print(subjects.columns, arm_or_cohort.columns, arm_2_subject.columns, sep="\n")
Index(['SUBJECT_ACCESSION', 'ANCESTRAL_POPULATION', 'DESCRIPTION', 'ETHNICITY', 'GENDER', 'RACE', 'RACE_SPECIFY', 'SPECIES', 'STRAIN', 'STRAIN_CHARACTERISTICS', 'WORKSPACE_ID'], dtype='object') Index(['ARM_ACCESSION', 'DESCRIPTION', 'NAME', 'STUDY_ACCESSION', 'TYPE', 'WORKSPACE_ID', 'ARM_NAME'], dtype='object') Index(['ARM_ACCESSION', 'SUBJECT_ACCESSION', 'AGE_EVENT', 'AGE_EVENT_SPECIFY', 'AGE_UNIT', 'MAX_SUBJECT_AGE', 'MIN_SUBJECT_AGE', 'SUBJECT_PHENOTYPE'], dtype='object')
subjects_merged_1 = pd.merge(subjects, arm_2_subject, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION') subjects_merged_2 = pd.merge(subjects_merged_1, arm_or_cohort, left_on='ARM_ACCESSION',right_on='ARM_ACCESSION') subjects_merged_2.ix
SUBJECT_ACCESSION SUB134268 ANCESTRAL_POPULATION NaN DESCRIPTION_x This subject record was used to consolidate du... ETHNICITY Not Hispanic or Latino GENDER Female RACE White RACE_SPECIFY NaN SPECIES Homo sapiens STRAIN NaN STRAIN_CHARACTERISTICS NaN WORKSPACE_ID_x 2883 ARM_ACCESSION ARM894 AGE_EVENT Age at Study Day 0 AGE_EVENT_SPECIFY NaN AGE_UNIT Years MAX_SUBJECT_AGE 29.71 MIN_SUBJECT_AGE 29.71 SUBJECT_PHENOTYPE Non-twin DESCRIPTION_y Young participants aged 20 to 30 years, vaccin... NAME Cohort_1 STUDY_ACCESSION SDY212 TYPE Experimental WORKSPACE_ID_y 2883 ARM_NAME Young Name: 0, dtype: object
Note in the list of columns above that pandas modified the name of some columns by appending a '_x' or '_y'.
The new dataframe created, subjects_merged contains all columns from all 3 tables, most of which are irrelevant at this time. We can drop the unnecessary columns by creating a new dataframe with only the columns of interest as follows:
subjects_merged = subjects_merged_2[['SUBJECT_ACCESSION','ARM_ACCESSION','ARM_NAME','GENDER','RACE','MIN_SUBJECT_AGE']] subjects_merged.shape
SUBJECT_ACCESSION SUB134268 ARM_ACCESSION ARM894 ARM_NAME Young GENDER Female RACE White MIN_SUBJECT_AGE 29.71 Name: 0, dtype: object
Below we show some descriptive statistics just to get a feel for the distribution of age, gender and race.
We are going to use two pandas methods here, chained by a dot operator. The object returned by the first function is taken as an argument by the next function. For instance below, the df.groupby() method groups rows by values contained in the specified column. It returns an object on which the next function, df.count(), operates. The df.count() method returns the number of observations in a given column.
ARM_NAME Old 61 Young 30 Name: SUBJECT_ACCESSION, dtype: int64
GENDER Female 54 Male 37 Name: SUBJECT_ACCESSION, dtype: int64
RACE American Indian or Alaska Native 1 Asian 8 Other 7 White 75 Name: SUBJECT_ACCESSION, dtype: int64
df.describe() is a very useful method which returns standard summary statistics (count, mean, median, standard dev, quartiles) on a given series of numbers.
Note in the following example that the only column in our dataframe containing numbers is MIN_SUBJECT_AGE, so it is the only one returned.
subjects_by_arm = subjects_merged.groupby('ARM_NAME') subjects_by_arm.describe()
In this version of the tutorial, we will be using the Plotly library for graphs instead of MatPlotLib. For more advanced users, there is a way to feed directly pandas dataframe into Plotly by using yet another library.
Plotly generates interactive graphs. By default, plots are saved to a server in a Plotly account that you have to set up. To enable local generation of your plots instead, use the following line: