Version 1.0, last modified November 2017
Python 3.6
pandas 0.21.0
seaborn 0.8.1
matplotlib 2.1.0
This is a tutorial on using Python tools to prepare ImmPort study information for analysis. This tutorial should NOT be considered as a real scientific analysis, but is ONLY intended to show how be prepare data for real analysis. This tutorial will use SDY4 as the example study, but we will download all ALLSTUDIES package, to show how you can extract individual studies for analysis. If you are only interested in the analysis of one specific study, individual download packages are available.
This tutorial concentrates on how to prepare the lab test and assessments information for analysis. The download packages contain many more types of data that you can explore. The data in the ALLSTUDIES package was extraced from a MySQL database and the content of each file in the ALLSTUDIES package contains the data from a table. An overview of the ImmPort data model is available here, and the table definitions are available here
For this analysis we will start by creating a top level directory named ALLSTUDIES. Below the ALLSTUDIES directory three directories where created: data, downloads and notebooks. The ALLSTUDIES_DR24_Tab.zip file was downloaded from the ImmPort Data Browser web site to the downloads directory.
The following commands were used to unzip and move the contents to the data directory
cd downloads
unzip ALLSTUDIES_DR24_Tab.zip
cd ALLSTUDES-DR24_Tab/Tab
mv * ../../../data
To start the Jupyter Notebook the following commands were used.
cd ../../../notebooks
jupyter notebook
Import the Python modules we will be using for analysis.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Preload all the data we will be using for the tutorial. For each file loaded into a DataFrame, the column names and the number of rows and columns for each table will be printed.
In the ImmPort model a study can have one or more arms_or_cohort records, and a subject is assigned to arm_or_cohort. Read in the study, arm_or_cohort and arm_2_subject data, so we can build a mapping that includes the study_accession, arm_accession and subject_accession. We will use this information later to map arm to assessment and lab_test using the subject_accession.
We will not use the study table for this but you may want to review the column names and content. If you want to make a quick review of the types of data available in any table, you can use the df.head() command to get a tabular view of the table, after using the pd.read_table command. The study.head() command should be the last line in a notebook cell, to get the table to display.
print("STUDY")
print("============================================================")
study = pd.read_table("../data/study.txt",sep="\t")
print(study.columns)
print(study.shape)
# study.head()
print("")
print("ARM_OR_COHORT")
print("============================================================")
arm_or_cohort = pd.read_table("../data/arm_or_cohort.txt",sep="\t")
print(arm_or_cohort.columns)
print(arm_or_cohort.shape)
# arm_or_cohort.head()
print("")
print("ARM_2_SUBJECT")
print("============================================================")
arm_2_subject = pd.read_table("../data/arm_2_subject.txt",sep="\t")
print(arm_2_subject.columns)
print(arm_2_subject.shape)
# arm_2_subject.head()
print("")
print("BIOSAMPLE")
print("============================================================")
biosample = pd.read_table("../data/biosample.txt",sep="\t")
print(biosample.columns)
print(biosample.shape)
arm_or_cohort_short = arm_or_cohort[['STUDY_ACCESSION','ARM_ACCESSION']]
study_arm_subject = pd.merge(arm_or_cohort_short,arm_2_subject,left_on='ARM_ACCESSION',right_on='ARM_ACCESSION')
# Uncomment to review contents
# study_arm_subject.head()
study_arm_subject_mapping = study_arm_subject[['STUDY_ACCESSION','ARM_ACCESSION','SUBJECT_ACCESSION']]
print(study_arm_subject_mapping.shape)
study_arm_subject_mapping.head()
biosample_short = biosample[['STUDY_ACCESSION','BIOSAMPLE_ACCESSION','SUBJECT_ACCESSION',
'PLANNED_VISIT_ACCESSION','STUDY_TIME_COLLECTED']]
print(biosample_short.shape)
biosample_short.head()
Assessment information is contained in 2 tables in the ImmPort data model. The assessment_panel is used to group individual assessment_component records into panels. To perform analysis it is useful to merge the information from these 2 tables into 1 DataFrame
print("ASSESSMENT_PANEL")
print("============================================================")
assessment_panel = pd.read_table("../data/assessment_panel.txt",sep="\t")
print(assessment_panel.columns)
print(assessment_panel.shape)
print("")
print("ASSESSMENT_COMPONENT")
print("============================================================")
assessment_component = pd.read_table("../data/assessment_component.txt",sep="\t")
print(assessment_component.columns)
print(assessment_component.shape)
assessment_panel_short = assessment_panel[['ASSESSMENT_PANEL_ACCESSION','NAME_REPORTED','STUDY_ACCESSION']]
assessment_panel_short = assessment_panel_short.rename(columns={'NAME_REPORTED': "PANEL_NAME"})
# assessment_panel_short.head()
assessment_component_short = assessment_component[['ASSESSMENT_PANEL_ACCESSION','NAME_REPORTED',
'PLANNED_VISIT_ACCESSION','RESULT_VALUE_REPORTED',
'RESULT_UNIT_REPORTED','STUDY_DAY','SUBJECT_ACCESSION']]
assessment_component_short = assessment_component_short.rename(columns={'NAME_REPORTED': "COMPONENT_NAME"})
# assessment_component_short.head()
assessments = pd.merge(assessment_panel_short, assessment_component_short,
left_on='ASSESSMENT_PANEL_ACCESSION', right_on='ASSESSMENT_PANEL_ACCESSION')
print(assessments.shape)
# assessments.head()
In this case we are using the smaller study_arm_subject_mapping DataFrame because we do not want to include additional information from the study_arm_subject DataFrame. But in many cases you may want to use the full study_arm_subject DataFrame, to include information like age at the time of the study or subject_phenotype, etc.
assessments = pd.merge(assessments,study_arm_subject_mapping, left_on=['STUDY_ACCESSION','SUBJECT_ACCESSION'],
right_on=['STUDY_ACCESSION','SUBJECT_ACCESSION'])
print(assessments.shape)
assessments.head()
Lab test information is contained in 2 tables in the ImmPort data model. The lab_test_panel is used to group individual lab_test records into panels. To perform analysis it is useful to merge the information from these 2 tables into 1 DataFrame.
print("LAB_TEST_PANEL")
print("============================================================")
lab_test_panel = pd.read_table("../data/lab_test_panel.txt",sep="\t")
print(lab_test_panel.columns)
print(lab_test_panel.shape)
print("")
print("LAB_TEST")
print("============================================================")
lab_test = pd.read_table("../data/lab_test.txt",sep="\t")
print(lab_test.columns)
print(lab_test.shape)
lab_test_panel_short = lab_test_panel[['LAB_TEST_PANEL_ACCESSION','NAME_REPORTED','STUDY_ACCESSION']]
lab_test_panel_short = lab_test_panel_short.rename(columns={'NAME_REPORTED': 'PANEL_NAME'})
#lab_test_panel_short.head()
lab_test_short = lab_test[['LAB_TEST_PANEL_ACCESSION','BIOSAMPLE_ACCESSION','NAME_REPORTED',
'RESULT_VALUE_PREFERRED','RESULT_VALUE_REPORTED','RESULT_UNIT_REPORTED']]
lab_test_short = lab_test_short = lab_test_short.rename(columns={'NAME_REPORTED': 'LAB_TEST_NAME'})
#lab_test_short.head()
lab_tests = pd.merge(lab_test_panel_short,lab_test_short,
left_on='LAB_TEST_PANEL_ACCESSION',right_on='LAB_TEST_PANEL_ACCESSION')
print(lab_tests.shape)
# lab_tests.head()
lab_tests = pd.merge(lab_tests,biosample_short, left_on=['STUDY_ACCESSION','BIOSAMPLE_ACCESSION'],
right_on=['STUDY_ACCESSION','BIOSAMPLE_ACCESSION'])
lab_tests = pd.merge(lab_tests,study_arm_subject_mapping, left_on=['STUDY_ACCESSION','SUBJECT_ACCESSION'],
right_on=['STUDY_ACCESSION','SUBJECT_ACCESSION'])
print(lab_test.shape)
lab_tests.head()
The methods below can be used to explore the types of assessments or labtests that are available on a study basis. At the end of this section there are a few example of how to use these methods to explore study information. When looking at the results returned by these methods, you may want to adjust the options to control the number of rows to display, or the option to control column width, examples of how to set these options are in the code.
def assessment_panels_by_study(study_accession,assessments):
study_assessments = assessments[assessments['STUDY_ACCESSION']==study_accession]
study_panels = study_assessments[['PANEL_NAME','COMPONENT_NAME']]
study_panels = study_panels.groupby(['PANEL_NAME','COMPONENT_NAME']).size().reset_index(name="COUNTS")
return study_panels
def lab_test_panels_by_study(study_accession,lab_tests):
study_lab_tests = lab_tests[lab_tests['STUDY_ACCESSION']==study_accession]
study_panels = study_lab_tests[['PANEL_NAME','LAB_TEST_NAME']]
study_panels = study_panels.groupby(['PANEL_NAME','LAB_TEST_NAME']).size().reset_index(name="COUNTS")
return study_panels
For these examples, the maximum number of rows will be set to 10 and the maximum column width will be set to 100. Set the max_rows option to see more of the information. The value passed into the df.head() method should also be increased. At the end of the section,these options will be reset to their default values.
pd.set_option("display.max_rows",10)
pd.set_option("display.max_colwidth",100)
# Show SDY4 ??
assessment_panels = assessment_panels_by_study('SDY4',assessments)
print(assessment_panels.shape)
assessment_panels.head(10)
labtest_panels = lab_test_panels_by_study('SDY4',lab_tests)
print(labtest_panels.shape)
labtest_panels.head(10)
pd.reset_option("display.max_rows")
pd.reset_option("display.max_colwidth")
The information in the assessments and labtests DataFrame is in a format sometimes referred to as Long and Narrow, this means each row contains one row for each subject and value, but in order to do analysis we would like to pivot the data based on columns like subject_accession, study_time_collected, etc. so the data is in more of a tabluar format, much like an Excel spreadsheet.
If you Google terms like "pivot tables python", "reshape data long to wide", or "long vs wide format", there are plenty of examples what this means with respect to data analysis.
So for the next part of this tutorial, we will show examples of pivoting the original DataFrame into the wide format. There are many different ways to arrive at the final wide format and these are only examples of a few methods.
How you treat null values in a data set varies widely and depends on your final analysis goals. For this tutorial we will gloss over the problem of nulls, so these are NOT good examples of how to handle null values.
SDY4_CBC_Results = lab_tests[(lab_tests['STUDY_ACCESSION'] == 'SDY4') &
(lab_tests['PANEL_NAME'] == 'CBC_Results')]
print(SDY4_CBC_Results.columns)
print(SDY4_CBC_Results.shape)
SDY4_CBC_Results = SDY4_CBC_Results.reset_index(drop=True)
SDY4_CBC_Results.head(10)
The code below was used to identify that all the results were captured for a single time point, show the distribution of results by ARM, and preliminarly check that each subject has the same number of lab tests.
The results below indicate the data makes sense. If you look at the number of each subject in each Arm the counts per Arm look okay.
ARM | Name | Subject | Description |
---|---|---|---|
ARM241 | AD+ Acute EH+HSV+ | 21 | AD Subjects with acute Eczema Herpeticum and recurrent herpes simplex virus to follow-up 4-12 weeks after baseline. |
ARM242 | AD+ with history of EH+ HSV+ | 31 | AD Subjects with history of Eczema Herpeticum and recurrent herpes simplex virus. |
ARM243 | AD+ EH- HSV+ | 61 | AD Subjects without Eczema Herpeticum but with recurrent herpes simplex virus. |
ARM244 | AD+ EH- HSV- | 61 | AD Subjects with no history of Eczema Herpeticum and no recurrent herpes simplex virus. |
ARM245 | Healthy controls | 61 | Non-AD (healthy volunteers). |
print(SDY4_CBC_Results['STUDY_TIME_COLLECTED'].value_counts())
print(SDY4_CBC_Results['ARM_ACCESSION'].value_counts())
# print(SDY4_CBC_Results['SUBJECT_ACCESSION'].value_counts())
For this example we will be using the RESULT_VALUE_PREFERRED column, so first we check for nulls and we want to make sure the column contains Float values.
print("Number of Rows with Null values")
print("===============================")
print(SDY4_CBC_Results['RESULT_VALUE_PREFERRED'].isnull().sum())
print("")
print("DataType of the column")
print("===============================")
print(SDY4_CBC_Results['RESULT_VALUE_PREFERRED'].dtype)
Based on our data exploration above we noticed one row where the RESULT_VALUE_PREFERRED was NaN. In ImmPort the RESULT_VALUE_PREFFERED column is populated by transforming the RESULT_VALUE_REPORTED column into the proper data type. In this case (refer to row below), it looks like the RESULT_VALUE_REPORTED was not entered properly, so the RESULT_VALUE_PREFERRED could not be populated.
To assign the missing value, we will take the mean of all the other rows for this ARM and LAB_TEST. Then assign this value to the missing value. Remember this may not be to correct way to assign this missing value, it is only one way to handle the missing value.
SDY4_CBC_Results[SDY4_CBC_Results['RESULT_VALUE_PREFERRED'].isnull()]
SDY4_CBC_Results.iloc[5376,:]
mean_value = SDY4_CBC_Results[(SDY4_CBC_Results['ARM_ACCESSION']=='ARM245') &
(SDY4_CBC_Results['LAB_TEST_NAME']=='CD19 positive cells')]['RESULT_VALUE_PREFERRED'].mean()
mean_value
SDY4_CBC_Results.loc[5376,'RESULT_VALUE_PREFERRED'] = 186.0
SDY4_CBC_Results.iloc[5376,:]
SDY4_CBC_Results[SDY4_CBC_Results['RESULT_VALUE_PREFERRED'].isnull()]
XX = SDY4_CBC_Results[['SUBJECT_ACCESSION','ARM_ACCESSION','LAB_TEST_NAME',
'RESULT_VALUE_PREFERRED']]
XXX = XX.set_index(['SUBJECT_ACCESSION','ARM_ACCESSION']).copy()
SDY4_CBC_Results_wide = XXX.pivot(columns='LAB_TEST_NAME')
SDY4_CBC_Results_wide.head()
SDY4_CBC_Results_flat = SDY4_CBC_Results_wide.reset_index()
column_names = list(SDY4_CBC_Results_flat.columns.droplevel(0))
column_names[0] = 'Subject'
column_names[1] = 'ARM'
SDY4_CBC_Results_flat.columns = column_names
SDY4_CBC_Results_flat.head()
SDY4_CBC_Results_flat.groupby('ARM')['CD19 positive cells percentage'].mean()
CBC_Means = pd.DataFrame(index=['ARM241','ARM242','ARM243','ARM244','ARM245'])
for column_name in SDY4_CBC_Results_flat.columns[2:]:
CBC_Means[column_name] = SDY4_CBC_Results_flat.groupby('ARM')[column_name].mean()
CBC_Means.head()
sns.boxplot(x='ARM',y='CD19 positive cells percentage',data=SDY4_CBC_Results_flat);
sns.boxplot(x='ARM',y='CD16 positive/CD56 positive',data=SDY4_CBC_Results_flat);
for column_name in SDY4_CBC_Results_flat.columns[2:]:
plt.figure()
sns.boxplot(x='ARM',y=column_name,data=SDY4_CBC_Results_flat);