Version 2.0, last modified April 2017 Python 3.6 Numpy 1.12.1 Pandas 0.19.2 Plotly 1.12.9 Scipy 0.19.0
The Immunology Database and Analysis Portal (ImmPort) project provides tools and services for the collection, QC, curation, and sharing of immunological data. The purpose of ImmPort is to provide a long term, sustainable source for promoting the re-use of these data provided by NIAID DAIT and DMID funded investigators.
The study chosen for this tutorial is one of 229 currently available for download as of DR20.
SDY212
Title: Apoptosis and other immune biomarkers predict influenza vaccine (TIV 2008) responsiveness
Principal Investigator: Mark M. Davis
Description: In an effort to indentify benchmarks of immunological health, influenza vaccination was used in 30 young (20 to 30 years) and 59 older subjects (60 to 89 years) as models for strong and weak immune responses, respectively.
We will focus on the HAI results for this tutorial.
Study packages can be downloaded from ImmPort after free registration. Study download packages are available in several formats. For this tutorial we will be using tab-separated text files (meaning a "tab" character is used as the column separator), found in the following zip archive:
SDYxxx-DRnn_Tab.zip where xxx is the study accession and nn the data release version.
Details on the tables included in this archive are available here.
To get the archive, go to the study page, in our case SDY212, and click on 'download'. If you haven't registered to ImmPort yet, you will need to create an account (it only takes a minute!). You might need to install Aspera if you haven't already.
Once logged in, you should see a list of directories and files. Look for SDY212-DR20_Tab.zip and click on it to download.
For this tutorial, we will also use python packages: Numpy, Scipy, Pandas and Plotly. Each of these packages should be installed for this notebook to run smoothly. For instructions on how to do this with conda, check out this tutorial.
Jupyter notebooks can be used for data analysis and as a teaching tool. There are a multitude of very good tutorials online about how to use notebooks, below are a selected few:
Teaching Numerical Methods with IPython Notebooks by D.I. Ketcheson and A. Ahmadia at the 2014 SciPy conference.
Part 1 Part 2 Part 3
Statistcial Data Analysis in Python by C. Fonnesbeck at the 2013 SciPy conference.
Part 1 Part 2 Part 3 Part 4.
Exploring Machine Learning with Scikit-learn by J. Vanderplas and O. Grisel at the 2014 PyCon.
Part 1
A more extensive list of interesting tutorials can be found here.
import numpy as np
import pandas as pd
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from scipy import stats
In this section you will be introduced to some of the content available for download from ImmPort, which can be used for re-analysis of the study. We demonstrate here how to load in information from tab-separated text files, merge their content and generate simple descriptive statistics.
The files needed for this section can be found in the SDY212-DR20_Tab.zip archive. Once this archive is unzipped, you should have a new folder called SDY212-DR20, itself containing a 'Tab' folder. This is where you will find the following files:
To load the files we'll be using in memory, you can either type in the path to your files, or use the input() function, which in a Jupyter notebook will print a question, and capture your answer in a text box as a string. Either line of code works, though note that tab-complete will work in a Jupyter notebook if you hit the tab key within double quotes, but not with the input() function text box.
In the three code cells below, don't forget to change the path to where your files are located!
#subject_file = input("Enter the path to subject.txt: ")
subject_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/subject.txt"
#arm_2_subject_file = input("Enter the path to arm_2_subject.txt: ")
arm_2_subject_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/arm_2_subject.txt"
#arm_or_cohort_file = input("Enter the path to arm_or_cohort.txt: ")
arm_or_cohort_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/arm_or_cohort.txt"
The files are read in as pandas dataframe with the pd.read_table() function:
subjects = pd.read_table(subject_file, sep="\t")
arm_2_subject = pd.read_table(arm_2_subject_file, sep="\t")
arm_or_cohort = pd.read_table(arm_or_cohort_file, sep="\t")
subjects.head()
arm_2_subject.head()
arm_or_cohort.head()
Pandas dataframes column headings are also accessible as follows:
subjects.columns
arm_2_subject.columns
arm_or_cohort.columns
To get an idea of the number of rows and/or columns in a table, you can use shape. The first dimension reported by shape is the number of rows and the second one is the number of columns.
Note here that the number of rows reported for the subjects and arms_2_subject files match. In these two files, each row corresponds to a subject. The fact that they have the same number of rows is a good indicator that there should be a one to one mapping between the two files.
subjects.shape
arm_2_subject.shape
arm_or_cohort.shape
To make sure that the set of subject accessions in each table contains only one instance of each subject, you can take advantage of the built-in Python function set(), and of the built-in Pandas access to column content through column header. We can compare the number of unique elements returned by set() to the number of elements including potential duplicates. If they are the same, there are no duplicates.
if len(set(subjects.SUBJECT_ACCESSION)) != len(subjects.SUBJECT_ACCESSION):
print("some subject accessions are duplicated in subject.txt")
if len(set(arm_2_subject.SUBJECT_ACCESSION)) != len(arm_2_subject.SUBJECT_ACCESSION):
print("some subject accessions are duplicated in arm_2_subject.txt")
Of course, to determine beyond any doubt that the set of subject accessions are the same in both tables, you need to compare them. The pandas df.isin() method checks if all elements from one set exist in another given set. The resulting structure contains boolean ('True' or 'False') which can then be counted with df.sum(). Here, we expect the sum to be the same than the number of subject accessions (in other words, the number of rows in subjects), because 'True' counts as 1.
df = subjects.SUBJECT_ACCESSION.isin(arm_2_subject.SUBJECT_ACCESSION)
df.sum()
*Side note: ImmPort data is de-identified and curated such that accessions are unique identifiers*
An alternative to df.head() is to review the content by looking at a single row in each file with df.ix. This makes looking at many columns a bit easier.
subjects.ix[0]
arm_2_subject.ix[10]
A pandas dataframe content can also be accessed by using the headers. Note that while ix returns the full content of a row (one value per column), the following returns all rows for the specified columns (one value per row):
arm_or_cohort[['ARM_ACCESSION','NAME','DESCRIPTION']]
In this particular case, the names of each arm aren't very descriptive, We can add a column to the dataframe with a more meaningful name:
arm_or_cohort['ARM_NAME'] = ['Old','Young']
arm_or_cohort[['ARM_ACCESSION','NAME','ARM_NAME','DESCRIPTION']]
The goal of this section is to create one dataframe containing only the information we want to proceed. We've seen above that subjects and arm_2_subject contain information pertaining to the same sets of subject accessions, so we know we can merge them together along subject accession with little issues.
The other table we have, arm_or_cohort, contains information about each arm mentioned in the mapping table, arm_2_subject. We can merge them together thanks to the arm accession.
There are several methods in pandas to merge dataframes together. Here, we use the function pd.merge(), which does all the hard work of merging gracefully together the dataframes.
# as a reminder:
print(subjects.columns, arm_or_cohort.columns, arm_2_subject.columns, sep="\n")
subjects_merged_1 = pd.merge(subjects, arm_2_subject, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
subjects_merged_2 = pd.merge(subjects_merged_1, arm_or_cohort, left_on='ARM_ACCESSION',right_on='ARM_ACCESSION')
subjects_merged_2.ix[0]
Note in the list of columns above that pandas modified the name of some columns by appending a '_x' or '_y'.
The new dataframe created, subjects_merged contains all columns from all 3 tables, most of which are irrelevant at this time. We can drop the unnecessary columns by creating a new dataframe with only the columns of interest as follows:
subjects_merged = subjects_merged_2[['SUBJECT_ACCESSION','ARM_ACCESSION','ARM_NAME','GENDER','RACE','MIN_SUBJECT_AGE']]
subjects_merged.shape
subjects_merged.ix[0]
Below we show some descriptive statistics just to get a feel for the distribution of age, gender and race.
We are going to use two pandas methods here, chained by a dot operator. The object returned by the first function is taken as an argument by the next function. For instance below, the df.groupby() method groups rows by values contained in the specified column. It returns an object on which the next function, df.count(), operates. The df.count() method returns the number of observations in a given column.
subjects_merged.groupby('ARM_NAME').count()['SUBJECT_ACCESSION']
subjects_merged.groupby('GENDER').count()['SUBJECT_ACCESSION']
subjects_merged.groupby('RACE').count()['SUBJECT_ACCESSION']
df.describe() is a very useful method which returns standard summary statistics (count, mean, median, standard dev, quartiles) on a given series of numbers.
Note in the following example that the only column in our dataframe containing numbers is MIN_SUBJECT_AGE, so it is the only one returned.
subjects_by_arm = subjects_merged.groupby('ARM_NAME')
subjects_by_arm.describe()
In this version of the tutorial, we will be using the Plotly library for graphs instead of MatPlotLib. For more advanced users, there is a way to feed directly pandas dataframe into Plotly by using yet another library.
Plotly generates interactive graphs. By default, plots are saved to a server in a Plotly account that you have to set up. To enable local generation of your plots instead, use the following line:
plotly.offline.init_notebook_mode()
For this plot, we want to look at the counts of each gender in each arm. The df.crosstab() method lets us do that by computing a simple cross-tabulation of specified factors:
arm_counts_gender = pd.crosstab(subjects_merged.ARM_NAME, subjects_merged.GENDER)
arm_counts_gender.head()
We can now use this new object as input to generate a simple bar plot.
Plotly is a graphing library which generates interactive plots. You can zoom in, zoom out, see information about the data when mousing over the graph, etc. You can also save the plot as a png by clicking on the camera icon in the Plotly toolbar at the top right of each plot.
Plotly has a typical syntax to code for plots. Each call to make a plot takes in two things, data and layout. Data contains the actual data to plot, along with details such as how to plot it, what to call it, how to color it, etc. Layout is an object containing the plot configuration - chart title, legend font, location, color, etc.
Plotly made a cheat sheet of their most common plots for quick reference. The following plots will be barplots.
# define traces:
trace1 = go.Bar(
x = arm_counts_gender.index,
y = arm_counts_gender.Female,
name = "Female",
marker = {
"color" : "orange"
}
)
trace2 = go.Bar(
x = arm_counts_gender.index,
y = arm_counts_gender.Male,
name = "Male",
marker = {
"color" : "teal"
}
)
# store traces into a single object:
data1 = [trace1, trace2]
# define layout:
layout1 = go.Layout(
barmode = 'group',
bargroupgap = 0.1,
title = "Counts of Subjects by Arm and Gender",
xaxis = {
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
yaxis = {
"title" : "Counts",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
)
# create a figure object:
fig1 = go.Figure(data=data1, layout=layout1)
# call the plot:
plotly.offline.iplot(fig1)
For this plot, we want to look at the distribution of race in each arm.
arm_counts_race = pd.crosstab(subjects_merged.ARM_NAME, subjects_merged.RACE)
arm_counts_race.head()
This time let's use a loop so we don't have to type in (or copy and paste) for each race. This method has one advantage: you might not now how many races there are in your data. Since we're using a loop, let's also set up a color palette so we don't have to hard-code the color everytime (though we could also use the default Plotly palette by not declaring any color).
The colors below are from the standard HTML color palette. We could get fancier and use RGB values too. More info on html color names here
palette = ['orange','teal','purple','maroon','lavender']
# defining traces.
# 1. set up the data object
data2 = []
i = 0
# 2. loop over races:
for race in arm_counts_race:
# set up a trace object
trace = go.Bar(
x = arm_counts_race.index,
y = arm_counts_race[race],
name = race,
marker = {
"color": palette[i]
}
)
# increment i to change color at each iteration of the loop
i += 1
data2.append(trace)
# define layout:
layout2 = go.Layout(
barmode = 'group',
bargroupgap = 0.1,
title = "Counts of Subjects by Arm and Race",
xaxis = {
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
yaxis = {
"title" : "Counts",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
)
# create a figure object:
fig2 = go.Figure(data=data2, layout=layout2)
# call the plot:
plotly.offline.iplot(fig2)
For this plot, we want to look at the distribution of age in each of or arms, separated by gender. Just as for the previous plots, we need to massage the data into an object that we can use to plot. Namely, we want each the MIN_SUBJECT_AGE series of values for each arm, for each gender.
One way to do this is to subset the subjects_merged dataframe into one for each gender, and plot along ARM_NAME:
subjects_female = subjects_merged[subjects_merged['GENDER']=="Female"]
subjects_male = subjects_merged[subjects_merged['GENDER']=="Male"]
subjects_female.head()
For this plot, we'll do a boxplot.
# define traces for each group:
fem_bp = go.Box(
y = subjects_female.MIN_SUBJECT_AGE,
x = subjects_female.ARM_NAME,
name = "Female",
jitter = 0.5,
pointpos = -2,
boxpoints = "all",
marker = {
"color" : palette[0],
"size" : 4
}
)
male_bp = go.Box(
y = subjects_male.MIN_SUBJECT_AGE,
x = subjects_male.ARM_NAME,
name = "Male",
jitter = 0.5,
pointpos = -2,
boxpoints = "all",
marker = {
"color" : palette[1],
"size" : 4
}
)
# store traces in a list:
data3 = [fem_bp, male_bp]
# define layout:
layout3 = go.Layout(
boxmode = "group",
boxgroupgap = 0.5,
title = "Age at Enrollment by Arm and Gender",
xaxis = {
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
yaxis = {
"title" : "Age at Enrollment",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
)
# create a figure object:
fig3 = go.Figure(data=data3, layout=layout3)
# call the plot:
plotly.offline.iplot(fig3)
#hai_result_file = input("Enter the path to subject.txt: ")
hai_result_file = "/Users/thomascg/Desktop/immport_workshop/SDY212-DR20_Tab/Tab/hai_result.txt"
hai_result_tmp = pd.read_table(hai_result_file)
hai_result_tmp.columns
We don't need all these columns for further analysis, so let's trim the table:
hai_result = hai_result_tmp[['SUBJECT_ACCESSION','STUDY_TIME_COLLECTED','VALUE_PREFERRED','VIRUS_STRAIN_REPORTED']]
hai_result.head()
hai_result.shape
Note here that there are a LOT more lines than the number of subjects above. There are several results per subject accession. Before going forward, let's check how many subjects are represented in this file:
len(set(hai_result.SUBJECT_ACCESSION))
Let's explore this table before we move on. We've seen that there are several records per subject accession. STUDY_TIME_COLLECTED indicates when the samples were collected for analysis. Typically, samples are collected at Day 0 of the study - which in our case also happens to be day subjects received the flu vaccine, and some time later - here, the first 5 rows seem to indicate 28 days later. Let's check that our assumption that there are only 2 values in STUDY_TIME_COLLECTED is accurate:
set(hai_result.STUDY_TIME_COLLECTED)
VIRUS_STRAIN_REPORTED indicates which virus strain was used for the hemagglutination inhibition assay. Just as we did with STUDY_TIME_COLLECTED, we can check how many unique values there are in this column:
set(hai_result.VIRUS_STRAIN_REPORTED)
# sanity check
534 / 89
It seems then that unless there is missing data, we should expect 3 results per study time collected for each subject. Let's go with this assumption for now. To make further analysis easier, we can reorganize the data so that VALUE_PREFERRED for day 0 and day 28 are in two separate columns. The end result is going to be a table with a row for each combination of SUBJECT_ACCESSION and VIRUS_STRAIN_REPORTED, and values at Day 0 and Day 28.
# 1. subset data from day 0.
study_day_0 = hai_result[hai_result['STUDY_TIME_COLLECTED'] == 0].copy()
# 2. to avoid later confusion, rename the VALUE_PREFERRED column to 'Day0'
# and drop the now pointless STUDY_TIME_COLLECTED column.
study_day_0.rename(columns = {"VALUE_PREFERRED" : "Day0"}, inplace=True)
study_day_0.drop(['STUDY_TIME_COLLECTED'], axis=1, inplace=True)
study_day_0.head()
# 3. subset data from day 28.
study_day_28 = hai_result[hai_result['STUDY_TIME_COLLECTED'] == 28].copy()
# 4. to avoid later confusion, rename the VALUE_PREFERRED column to 'Day28'
# and drop the now pointless STUDY_TIME_COLLECTED column.
study_day_28.rename(columns={'VALUE_PREFERRED':'Day28'}, inplace=True)
study_day_28.drop(['STUDY_TIME_COLLECTED'], axis=1, inplace=True)
study_day_28.head()
study_day_0.shape
study_day_28.shape
Our assumption seems to be holding up so far, since there is an identical number of records for each study day. Now we can merge the two tables into one dataframe:
join_columns = ['SUBJECT_ACCESSION','VIRUS_STRAIN_REPORTED']
hai_values = pd.merge(study_day_0, study_day_28, left_on=join_columns, right_on=join_columns)
hai_values.head()
We can make sure that our assumption of three tests per subjects per day collected was right by checking the shape of hai_analysis. If we still have 267 rows, it means that the set of unique subject accession/virus strain was the same in both study_day_0 and study_day_28.
hai_values.shape
Let's start by calculating the fold change between day 0 and day 28 and convert it to log2 values to improve visualization later.
hai_values['FOLD_CHANGE'] = hai_values['Day28'] / hai_values['Day0']
hai_values['FOLD_CHANGE_LOG2'] = np.log2(hai_values['FOLD_CHANGE'])
# Let's add a categorical column here for future boxplot
hai_values['response'] = np.where(hai_values.FOLD_CHANGE_LOG2>=2, "High","Low")
hai_values.head()
Now that we have fold changes, we can integrate the subject information to the HAI results.
You might have noticed that here is a difference of 2 subjects between the list. We could use the same method as in the first section to figure out which, or we can just merge the hai_result table with subjects_merged, which will result in a table containing only the subjects in common to both.
hai_analysis = pd.merge(hai_values, subjects_merged, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
hai_analysis.head()
We can check how many subjects we still have:
len(set(hai_analysis.SUBJECT_ACCESSION))
We're going to generate a scatter plot to look at pre vs. post vaccination titers. Because it's a scatter plot, overlap between plotted dots is likely and makes seeing all the data at once difficult. One way around it is to use transparency in the colors, which is what we'll do in the following plot.
First, just as for the previous plots we need to group the data in order to generate plots.
hai_by_virus = hai_analysis.groupby('VIRUS_STRAIN_REPORTED')
To implement the transparency, we can set up a new palette in which the colors are defined with rgba() -- Red, Green, Blue, Alpha, where alpha determines saturation.
rgba_palette = ["rgba(255,165,0,0.5)",
"rgba(0,128,128,0.5)",
"rgba(128,0,128,0.5)",
"rgba(128,0,0,0.5)",
"rgba(230,230,250,0.5)"]
We can now use the Plotly scatter plot function to generate traces.
To iterate over the virus strains in the groupby object hai_by_virus, we can use a for loop as described here.
# set up an empty list to store traces:
data4 = []
# set up counter for changing color at each iteration of the loop
j = 0
for virus_strain, group in hai_by_virus:
# Define traces for each virus strain:
virus_trace = go.Scatter(
x = group.Day0,
y = group.Day28,
name = virus_strain + " Strain",
mode = "markers",
marker = {
"size" : 10,
"color" : rgba_palette[j],
"line" : {
"width": 2,
"color" : palette[j]
}
}
)
# increment counter!
j += 1
# add trace to data
data4.append(virus_trace)
# define layout for this plot
layout4 = go.Layout(
title = "Pre vs. Post-Vaccination HAI Titers",
xaxis = {
"title": "HAI Titer - Pre-Vaccination",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 12,
"color" : "rgb(107, 107, 107)"
}
},
yaxis ={
"title": "HAI Titer - Post-Vaccination",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 12,
"color" : "rgb(107, 107, 107)"
}
},
)
# create a figure object:
fig4 = go.Figure(data=data4, layout=layout4)
# call the plot:
plotly.offline.iplot(fig4)
Even with the transparency, there are too many data points stacked together to be able to see all of them. An alternative to using transparency is to introduce a jitter to separate the dots on the plot (though this is discouraged by some statisticians).
Jitter is built-in in Plotly's boxplots as we've seen above, but for some reason hasn't made it to scatter plots yet. We can use instead one of the following two functions to modify values slightly and create a jitter for dots stacked on top of each other.
def rand_jitter(arr):
stdev = .001*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
# for log value of data
def rand_jitter_log(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
To separate the data further, we can also try plotting the log values.
# set up an empty list to store traces:
data5 = []
# set up counter for changing color at each iteration of the loop
k = 0
for v_strain, grp in hai_by_virus:
# Define traces for each virus strain:
v_trace = go.Scatter(
x = rand_jitter_log(np.log2(grp.Day0)),
y = rand_jitter_log(np.log2(grp.Day28)),
name = v_strain + " Strain",
mode = "markers",
marker = {
"size" : 10,
"color" : rgba_palette[k],
"line" : {
"width": 2,
"color" : palette[k]
}
}
)
# increment counter!
k += 1
# add trace to data
data5.append(v_trace)
# we can re-use layout4 since this is basically the same plot.
# create a figure object:
fig5 = go.Figure(data=data5, layout=layout4)
# call the plot:
plotly.offline.iplot(fig5)
Yet another way to try and see every data point is to plot each category separately in subplots. In Plotly, one way to generate subplots is to change the configuration of the layout, more specifically to re-define what portion of the screen will be dedicated to each trace.
Plotly traces have attributes called xaxis and yaxis, which specify where the range of x and y values respectively will be for that given trace. By default, they are not defined:
print(data5[0]['xaxis'], data5[0]['yaxis'], data5[1]['xaxis'], data5[1]['yaxis'], data5[2]['xaxis'], data5[2]['yaxis'])
Let's give these attributes a value. We don't need to change the first trace as the default value will map to xaxis by default.
# add axis name:
data5[1]['yaxis'] = 'y2'
data5[1]['xaxis'] = 'x2'
data5[2]['yaxis'] = 'y3'
data5[2]['xaxis'] = 'x3'
Now we can set up a new layout. We need to define domains for our new axes, add a title for the plot and axes names.
layout5 = go.Layout(
title = "Pre vs. Post-Vaccination HAI Titers",
xaxis = {
"domain" : [0, 0.3],
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 12,
"color" : "rgb(107, 107, 107)"
}
},
yaxis ={
"title": "HAI Titer - Post-Vaccination",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 12,
"color" : "rgb(107, 107, 107)"
}
},
xaxis2 = {
"domain" : [0.35, 0.65],
"title": "HAI Titer - Pre-Vaccination"
},
xaxis3 = {"domain" : [0.7, 1]},
yaxis2 = {"anchor" : 'x2'},
yaxis3 = {"anchor" : 'x3'}
)
# create a figure object:
fig5_2 = go.Figure(data=data5, layout=layout5)
# call the plot:
plotly.offline.iplot(fig5_2)
Conclusion: These plots show a pretty typical triangular shape, seen when subjects already have a certain HAI titer for one strain before vaccination (x axis). HAI titer will most likely not go down after vaccination (y axis).
Now let's look at the HAI fold changes compared to pre-vaccination titers.
# set up an empty list to store traces:
data6 = []
# set up counter for changing color at each iteration of the loop
m = 0
for vstrain, gp in hai_by_virus:
# Define traces for each virus strain:
vtrace = go.Scatter(
x = rand_jitter_log(np.log2(gp.Day0)),
y = rand_jitter_log(gp.FOLD_CHANGE_LOG2),
name = vstrain + " Strain",
mode = "markers",
marker = {
"size" : 10,
"color" : rgba_palette[m],
"line" : {
"width": 2,
"color" : palette[m]
}
}
)
# increment counter!
m += 1
# add trace to data
data6.append(vtrace)
# add annotation to the plot
note = go.Scatter(
x = [10.3, 10.3],
y = [1, 3],
mode='text',
textfont = {"color" : "rgb(100,100,100)"},
text = ["Low Response", "High Response"],
showlegend=False
)
data6.append(note)
# define layout for this plot
layout6 = go.Layout(
title = "HAI Response vs. Pre-Vaccination HAI Titers",
xaxis = {
"title": "HAI Titer - Pre-Vaccination",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 12,
"color" : "rgb(107, 107, 107)"
}
},
yaxis ={
"title": "HAI Response (Post/Pre-Vaccination)",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 12,
"color" : "rgb(107, 107, 107)"
}
},
# add a horizontal line to separate high and low responders
shapes = [{
'type': 'line',
'x0': 2,
'y0': 2,
'x1': 11,
'y1': 2,
'line': {
'color': 'rgb(100, 100, 100)',
'width': 2,
'dash': 'dot',
},
},
]
)
# create a figure object:
fig6 = go.Figure(data=data6, layout=layout6)
# call the plot:
plotly.offline.iplot(fig6)
The scatter plot here doesn't allow us to see clearly whether or not there is a difference between responses depending on the virus strain. Let's investigate this further. We can now take advantage of the categorical column response created above to group the data by virus strain and type of response. The pandas GroupBy.size() method returns each group size, and df.unstack() returns a dataframe nicer to look at, and easier to work with:
hai_by_virus_by_response = hai_analysis.groupby(['VIRUS_STRAIN_REPORTED', 'response']).size().unstack()
hai_by_virus_by_response
Conclusion: There are more subjects showing a low response to the B and H1N1 strains, while the opposite is true for H3N2.
Now let's examine whether the age of the subjects and the virus strain influence the influenza vaccine response.
# define traces for each group:
data7 = []
n = 0
for virus, groups in hai_by_virus:
boxplot = go.Box(
y = groups.FOLD_CHANGE_LOG2,
name = virus + " Strain",
jitter = 0.5,
pointpos = -2,
boxpoints = "all",
marker = {
"color" : palette[n],
"size" : 4
}
)
n += 1
data7.append(boxplot)
# define layout:
layout7 = go.Layout(
boxmode = "group",
boxgroupgap = 0.5,
title = "HAI Response per Virus Strain",
xaxis = {
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
yaxis = {
"title" : "HAI Response (Post/Pre-Vaccination)",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
)
# create a figure object:
fig7 = go.Figure(data=data7, layout=layout7)
# call the plot:
plotly.offline.iplot(fig7)
Conclusion: The HAI responses seem to vary by virus types.
# define traces for each group:
old_vs_young = go.Box(
y = hai_analysis.FOLD_CHANGE_LOG2,
x = hai_analysis.ARM_NAME,
jitter = 0.5,
pointpos = -2,
boxpoints = "all",
marker = {
"color" : palette[3],
"size" : 4
}
)
data8 = [old_vs_young]
# define layout:
layout8 = go.Layout(
boxmode = "group",
boxgroupgap = 0.5,
title = "HAI Response per Arm",
xaxis = {
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
yaxis = {
"title" : "HAI Response (Post/Pre-Vaccination)",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
)
# create a figure object:
fig8 = go.Figure(data=data8, layout=layout8)
# call the plot:
plotly.offline.iplot(fig8)
Conclusion: The HAI responses seem to vary by age group (arm)
Do younger people have a higher response for each of the three virus strains ?
# define traces for each group:
hai_by_age = hai_analysis.groupby('ARM_NAME')
data9 = []
p = 2
for age, data in hai_by_age:
bp_age = go.Box(
y = data.FOLD_CHANGE_LOG2,
x = data.VIRUS_STRAIN_REPORTED,
name = age,
jitter = 0.5,
pointpos = -2,
boxpoints = "all",
marker = {
"color" : palette[p],
"size" : 4
}
)
p += 1
data9.append(bp_age)
# define layout:
layout9 = go.Layout(
boxmode = "group",
boxgroupgap = 0.5,
title = "HAI Response by Arm",
xaxis = {
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
yaxis = {
"title" : "HAI Response (Post/Pre-Vaccination)",
"titlefont" : {
"size" : 16,
"color" : "rgb(107, 107, 107)"
},
"tickfont" : {
"size" : 14,
"color" : "rgb(107, 107, 107)"
}
},
)
# create a figure object:
fig9 = go.Figure(data=data9, layout=layout9)
# call the plot:
plotly.offline.iplot(fig9)
Conclusion: It seems the response to the H3N2 strain is independent from the age group, while young people tend to respond on average more strongly than older ones to the H1N1 and B Strains.
To test our conclusion we can compare the HAI reponse values between arms with a Wilcoxon rank-sum tests, also called a Mann-Whitney U test, thanks to a function from the scipy package called mannwhitneyu. The first two arguments to this function are the two lists being compared. The third argument indicates whether or not to apply a correction for continuity, and the last one indicates that the p-value should represent a one or two-sided hypothesis.
The first step is to get the sets of fold changes for each arm, for each virus strain, which we can do thanks to pandas dataframes indexing and slicing capabilities. For the first test for instance, we want all values of the HAI response - in other words the content of FOLD_CHANGE_LOG2 - for all subjects with ARM_NAME 'Young' and VIRUS_STRAIN_REPORTED 'B'. This is what this first line does: t1_B is a list containing the HAI response values.
t1_B = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Young') & (hai_analysis.VIRUS_STRAIN_REPORTED=='B')]
t2_B = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Old') & (hai_analysis.VIRUS_STRAIN_REPORTED=='B')]
t1_H1N1 = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Young') & (hai_analysis.VIRUS_STRAIN_REPORTED=='H1N1')]
t2_H1N1 = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Old') & (hai_analysis.VIRUS_STRAIN_REPORTED=='H1N1')]
t1_H3N2 = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Young') & (hai_analysis.VIRUS_STRAIN_REPORTED=='H3N2')]
t2_H3N2 = hai_analysis.FOLD_CHANGE_LOG2[(hai_analysis.ARM_NAME=='Old') & (hai_analysis.VIRUS_STRAIN_REPORTED=='H3N2')]
u_stat_B, p_val_B = stats.mannwhitneyu(t1_B, t2_B, True, 'two-sided')
u_stat_H1N1, p_val_H1N1 = stats.mannwhitneyu(t1_H1N1, t2_H1N1, True, 'two-sided')
u_stat_H3N2, p_val_H3N2 = stats.mannwhitneyu(t1_H3N2, t2_H3N2, True, 'two-sided')
print("--B Strain--\nU statistics: " + str(u_stat_B) + "\np-value: " + str(p_val_B))
print("\n\n--H1N1 Strain--\nU statistics: " + str(u_stat_H1N1) + "\np-value: " + str(p_val_H1N1))
print("\n\n--H3N2 Strain--\nU statistics: " + str(u_stat_H3N2) + "\np-value: " + str(p_val_H3N2))
Conclusion: The results of this statistical tests show that the HAI response is significantly different between the age groups at a 0.05 level for the B and H1N1 strains, but not for the H3N2 strain