Multiple Methods for Downloading and Analyzing ImmPort data in Python¶

During this tutorial we will explore 3 methods that can be used to download information from ImmPort and methods to format the data, so it is available for analysis. Details on ways to analyze the data were detailed in both an R and Python tutorials which are available at immport.org. A Python version of this tutorial is available, which is a very good source of information on both downloading and analysis of ImmPort data, and is the basis for much of the code in this tutorial. The original Python tutorial is available in HTML and in Jupyter Notebook formats.

This tutorial only focuses on the download and preparation of data for analysis, not the actual analyis of the data. We should also point out the choice of analysis language is left to the researcher, but this tutorial shows how this can be accomplished using the Python language. The plans are to produce a similar tutorial using R.

Overview of Access to ImmPort data¶

There are 3 alternative methods for downloading data from ImmPort, which one to choose is up to the researcher, but there are advantages for each method depending on your analysis plan.

Method 1: Download using the ImmPort Data Browser¶

The ImmPort Data Browser is a web based application that allows the user to download individual files or folders of files and might be a good choice for beginning data scientists. The screen shot below shows the starting screen for downloading data for SDY212.

For this tutorial we will be downloading the SDY212-DR23_Tab.zip file. This zip package contains all the data files for SDY212 in TSV format. Each file represents the content of a database table containing rows of information for study SDY212. An overview of the ImmPort database model is available here and detailed table information is available here. One advantge to this method is you can get all the files you may need with one click.

If you are interested in doing cross study analysis, there is an ALLSTUDIES-DRXX_Tab.zip file that contains files the have the data for all the studies. Another alternative is to download individual files which are in the ALLSTUDIES/ALLSTUDIES-DRXX_Metadata folder. Example screen shot below.

Method 2: Download using the ImmPort File Download API¶

This method allows the user to access files of interest using a programatic API to download files. A bash shell script has been developed that can be used to download files, and this tutorial will illustrate how you can download files using Python. More details on how to use Python will be in the section for this method. One advantage of this method is you can download individual files and can accomplish this programatically rather than using a web interface.

The ImmPort files are hosted and downloaded using an Aspera files system. Aspera is an application for greatly increasing the performance when downloading large files. The Aspera company provides executables that are packaged in the Shell script zip package and must be downloaded from ImmPort.

Method 3: Download using the ImmPort Query API¶

This method allows the user to obtain data from a REST API. Currently this REST API supports downloading assay results for experiments using: Elisa, Elispot, HAI, etc. If you are interested in downloading raw experiment files like FCS files, there is an API call to support this need. There will be a small example of how to use this API method. The types of REST methods available and the types of filters you can add to your query are detailed in another document. Because the information returned by the API methods contains many metadata elements, that may be contained in 4 files in either method 1 or 2, it may be easier to start with this method, until you are comfortable with Python Pandas or R DataFrames, to merge multiple files into one coherent data set.

Access to all shared ImmPort data requires you to become a registered user, registration is simple and you can start [here](https://immport-user-admin.niaid.nih.gov:8443/registrationuser/registration). Because access is limited to registered users, the API methods require an authorization token to be used as part of the request. The software we make available hides this complexity and it is really not that complex, but you need to be aware of the need to aquire a token.

Getting Started¶

The section below sets up the Python environment that is used by all the methods. This tutorial assumes you have downloaded the File Download Tool distribution available from ImmPort, that contains the bash shell script, the immport_download.py file and the Aspera executables. The immport_download.py file contains convience functions access the API, please review if you want more details. The tutorial assumes you have unzipped the distribution in a directory at the same level of notebook directory.

import sys
import os
import pandas as pd

# Set the Python path to the location of the directory containing "immport_download.py"
immport_download_code = "../bin/"
sys.path.insert(0,immport_download_code)
os.chdir(immport_download_code)

import immport_download

Example Configuration Properties¶

user_name = "REPLACE"
password = "REPLACE"
download_directory = "../output"
data_directory = "../data"
sdy212_directory= "../data/SDY212-DR23_Tab/Tab"

Method 1: Download using the ImmPort DataBrowser¶

In preparation for this step, the SDY212-DR23_Tab.zip file was downloaded using the DataBrowser and unzipped into the ../data/SDY212-DR23_Tab/Tab directory. The following 4 files from this directory will be used:

subjects.txt - general subject demographic information.
arm_2_subject.txt - mapping from subjects to study arms/cohorts.
arm_or_cohort - study arm names and descriptions
hai_results.txt - HAI results

More details on the process for preparing the information from these files and analysis are available in the tutorial mentioned in the top of this tutorial.

# To view the contents of the ../data/SDY212-DR23_Tab/Tab directory, uncomment the command below
# %ls $sdy212_directory

Read in the 4 files and load into Panda's Data Frames¶

subject_file = sdy212_directory + "/subject.txt"
arm_2_subject_file = sdy212_directory + "/arm_2_subject.txt"
arm_or_cohort_file = sdy212_directory + "/arm_or_cohort.txt"
hai_result_file = sdy212_directory + "/hai_result.txt"
subjects = pd.read_table(subject_file, sep="\t")
arm_2_subject = pd.read_table(arm_2_subject_file, sep="\t")
arm_or_cohort = pd.read_table(arm_or_cohort_file, sep="\t")
hai_results = pd.read_table(hai_result_file,sep="\t")

Review content of the 4 Data Frames¶

subjects.head(5)

arm_2_subject.head(5)

arm_or_cohort.head()

hai_results.head()

# To review the columns for all data frames
print(subjects.columns)
print(arm_2_subject.columns)
print(arm_or_cohort.columns)
print(hai_results.columns)

Index(['SUBJECT_ACCESSION', 'ANCESTRAL_POPULATION', 'DESCRIPTION', 'ETHNICITY',
       'GENDER', 'RACE', 'RACE_SPECIFY', 'SPECIES', 'STRAIN',
       'STRAIN_CHARACTERISTICS', 'WORKSPACE_ID'],
      dtype='object')
Index(['ARM_ACCESSION', 'SUBJECT_ACCESSION', 'AGE_EVENT', 'AGE_EVENT_SPECIFY',
       'AGE_UNIT', 'MAX_SUBJECT_AGE', 'MIN_SUBJECT_AGE', 'SUBJECT_PHENOTYPE'],
      dtype='object')
Index(['ARM_ACCESSION', 'DESCRIPTION', 'NAME', 'STUDY_ACCESSION', 'TYPE',
       'WORKSPACE_ID'],
      dtype='object')
Index(['RESULT_ID', 'ARM_ACCESSION', 'BIOSAMPLE_ACCESSION', 'COMMENTS',
       'EXPERIMENT_ACCESSION', 'EXPSAMPLE_ACCESSION', 'REPOSITORY_ACCESSION',
       'REPOSITORY_NAME', 'STUDY_ACCESSION', 'STUDY_TIME_COLLECTED',
       'STUDY_TIME_COLLECTED_UNIT', 'SUBJECT_ACCESSION', 'UNIT_PREFERRED',
       'UNIT_REPORTED', 'VALUE_PREFERRED', 'VALUE_REPORTED',
       'VIRUS_STRAIN_PREFERRED', 'VIRUS_STRAIN_REPORTED', 'WORKSPACE_ID'],
      dtype='object')

Clean up and merge data for analysis¶

The steps below will remove columns not needed for analysis, plus we will assign more meaningfull labels to the ARM names. When we reviewed the arm_or_cohort contents above, we can see that ARM984/Cohort_1 corresponds to the Young particpants and ARM895/Cohort_2 corresponds to the Old participants. Then merge the ARM information with the Subject demographic information.

subjects = subjects[['SUBJECT_ACCESSION','GENDER','RACE']]
arm_2_subject = arm_2_subject[['SUBJECT_ACCESSION','ARM_ACCESSION','MIN_SUBJECT_AGE']]
arm_2_subject['ARM_NAME'] = ""
arm_2_subject.loc[arm_2_subject['ARM_ACCESSION'] == 'ARM894','ARM_NAME'] = "Young"
arm_2_subject.loc[arm_2_subject['ARM_ACCESSION'] == 'ARM895','ARM_NAME'] = "Old"
arm_2_subject_merged = pd.merge(subjects,arm_2_subject, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
arm_2_subject_merged.head()

Simple descriptive statistics¶

We will stop here with method one, but the next step would be to merge in the hai_result information into the arm_2_subject_merged data frame, so we could analyze the HAI result information. Details on how to accomplish this plus examples of analysis and plotting of results can be found in the more detailed tutorial mentioned at the beginning of this tutorial.

arm_2_subject_merged.groupby('ARM_NAME').count()['SUBJECT_ACCESSION']

ARM_NAME
Old      61
Young    30
Name: SUBJECT_ACCESSION, dtype: int64

arm_2_subject_merged.groupby('GENDER').count()['SUBJECT_ACCESSION']

GENDER
Female    54
Male      37
Name: SUBJECT_ACCESSION, dtype: int64

Method 2: Download using the ImmPort File Download API¶

For this method, we will be using the files that are available from ALLSTUDIES/ALLSTUDIES-DR23_Metadata directory. We will download the same 4 files that we used for method 1, but will use the Download API to programatically download the files. Because the files contain all the data for all studies, we will then filter out all content not necessary for SDY212 analysis.

The DR23 above represents the version of the data at specific point in time. ImmPort releases updated and new studies approximately 6 times a year. So you will need to replace the DR23 with the latest release number

immport_download.download_file(user_name,password,
                                    "/ALLSTUDIES/ALLSTUDIES-DR23_Metadata/subject.txt",download_directory)
immport_download.download_file(user_name,password,
                                    "/ALLSTUDIES/ALLSTUDIES-DR23_Metadata/arm_2_subject.txt",download_directory)
immport_download.download_file(user_name,password,
                                    "/ALLSTUDIES/ALLSTUDIES-DR23_Metadata/arm_or_cohort.txt",download_directory)
immport_download.download_file(user_name,password,
                                    "/ALLSTUDIES/ALLSTUDIES-DR23_Metadata/hai_result.txt",download_directory)

Check that download was successful¶

%ls ../output

arm_2_subject.txt  aspera-scp-transfer.0.log  hai_result.txt
arm_or_cohort.txt  aspera-scp-transfer.log    subject.txt

Read in the 4 files and load into Panda's Data Frames, filter to SDY212¶

For this step we will follow a similar process that was used for method 1 to build our analysis data frame, with the added filtering to only include information from SDY212. We will simplify the filtering because we already know that for SDY212 the 2 ARM_ACCESSION's we are interested in are ARM894 and ARM895.

subject_file = download_directory + "/subject.txt"
arm_2_subject_file = download_directory + "/arm_2_subject.txt"
arm_or_cohort_file = download_directory + "/arm_or_cohort.txt"
hai_result_file = download_directory + "/hai_result.txt"
subjects = pd.read_table(subject_file, sep="\t")
arm_2_subject = pd.read_table(arm_2_subject_file, sep="\t")
arm_or_cohort = pd.read_table(arm_or_cohort_file, sep="\t")
hai_results = pd.read_table(hai_result_file,sep="\t")

subjects = subjects[['SUBJECT_ACCESSION','GENDER','RACE']]
arm_2_subject = arm_2_subject[['SUBJECT_ACCESSION','ARM_ACCESSION','MIN_SUBJECT_AGE']]
# Filter only records for SDY212 using ARM_ACCESSION
arm_2_subject = arm_2_subject[arm_2_subject['ARM_ACCESSION'].isin(['ARM894','ARM895'])]
arm_2_subject.loc[arm_2_subject['ARM_ACCESSION'] == 'ARM894','ARM_NAME'] = "Young"
arm_2_subject.loc[arm_2_subject['ARM_ACCESSION'] == 'ARM895','ARM_NAME'] = "Old"
arm_2_subject_merged = pd.merge(subjects,arm_2_subject, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
arm_2_subject_merged.head()

Simple Descriptive Statistics¶

At this point we should have built a data frame with the same content as in method 1, so we will look at a few descriptive statistics to confirm.

arm_2_subject_merged.groupby('ARM_NAME').count()['SUBJECT_ACCESSION']

ARM_NAME
Old      61
Young    30
Name: SUBJECT_ACCESSION, dtype: int64

arm_2_subject_merged.groupby('GENDER').count()['SUBJECT_ACCESSION']

GENDER
Female    54
Male      37
Name: SUBJECT_ACCESSION, dtype: int64

Method 3: Download using the ImmPort Query API¶

This method will use the Query API to download a JSON file of HAI results for SDY212. Because the JSON file contains all the columns we need for analysis, this will simplify the creation of the data frame for analysis. The final counts for the descriptive statistics, may be slightly off, but this is expected because not all the subjects in the 2 ARM's have HAI results.

# Request a token, then make API call, then load into Pandas's DataFrame
token = immport_download.request_immport_token(user_name, password)
r = immport_download.api("https://api.immport.org/data/query/result/hai?studyAccession=SDY212",token)
hai_results = pd.read_json(r)

hai_results.columns

Index(['ageEvent', 'ageEventSpecify', 'ageUnit', 'ancestralPopulation',
       'armAccession', 'armName', 'biosampleAccession', 'biosampleSubtype',
       'biosampleType', 'clinical', 'comments', 'ethnicity',
       'experimentAccession', 'expsampleAccession', 'gender', 'maxSubjectAge',
       'measurementTechnique', 'minSubjectAge', 'plannedVisitAccession',
       'race', 'raceSpecify', 'repositoryAccession', 'repositoryName',
       'resultId', 'species', 'strain', 'studyAccession', 'studyTimeCollected',
       'studyTimeCollectedUnit', 'studyTimeT0Event', 'studyTimeT0EventSpecify',
       'studyTitle', 'subjectAccession', 'subjectPhenotype',
       'treatmentAccession', 'unitPreferred', 'unitReported', 'valuePreferred',
       'valueReported', 'virusStrainPreferred', 'virusStrainReported'],
      dtype='object')

Remove column and rename to match names in the first 2 methods¶

arm_2_subject_merged = hai_results[['subjectAccession','armAccession','minSubjectAge','gender','race']]
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'subjectAccession':'SUBJECT_ACCESSION'})
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'armAccession':'ARM_ACCESSION'})
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'minSubjectAge':'MIN_SUBJECT_AGE'})
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'gender':'GENDER'})
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'RACE':'RACE'})
arm_2_subject_merged.loc[arm_2_subject_merged['ARM_ACCESSION'] == 'ARM894','ARM_NAME'] = "Young"
arm_2_subject_merged.loc[arm_2_subject_merged['ARM_ACCESSION'] == 'ARM895','ARM_NAME'] = "Old"
arm_2_subject_merged.head()

Simple Descriptive Statistics¶¶

At this point we should have built a data frame with the similar content as in method 1 and 2, so we will look at a few descriptive statistics to confirm. Because each subject may have multiple HAI results, we need to remove duplicates.

arm_2_subject_merged[['ARM_NAME','SUBJECT_ACCESSION']].drop_duplicates().groupby('ARM_NAME').count()

arm_2_subject_merged[['GENDER','SUBJECT_ACCESSION']].drop_duplicates().groupby('GENDER').count()

Download Files Using Data Query and File Download API¶

In this final section, we have an example of using the Query API to identify flow cytometry FCS files for SDY212 and ARM Cohort_1 using the filePath API method. Then once we have identified the file paths for these files of interest, we will download them using the Download API. We will only download 5 files for this example.

token = immport_download.request_immport_token(user_name, password)
r = immport_download.api("https://api.immport.org/data/query/result/filePath?studyAccession=SDY212&armName=Cohort_1&measurementTechnique=Flow%20cytometry",token)
df = pd.read_json(r)
df = df[df['fileDetail'] == "Flow cytometry result"]
df_fcs = df[['subjectAccession','armAccession','armName','gender','race','minSubjectAge','studyTimeCollected', \
               'fileDetail','filePath']]
df_fcs.head()

unique_file_paths = df_fcs.filePath.unique()
unique_fcs_paths = [path for path in unique_file_paths if path.endswith(".fcs")]
for i in range(0,5):
    print("Downloading: ",unique_fcs_paths[i])
    immport_download.download_file(user_name,password,
                                    unique_fcs_paths[i],download_directory)

Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_FLU 010V1.390100.fcs
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_FLU 018V1.390104.fcs
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_022V1.390144.fcs
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_047V1.390160.fcs
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_039V1.391052.fcs

%ls ../output

arm_2_subject.txt
arm_or_cohort.txt
aspera-scp-transfer.0.log
aspera-scp-transfer.log
hai_result.txt
PHOSPHOFLOW SPECIMEN_022V1.390144.fcs
PHOSPHOFLOW SPECIMEN_039V1.391052.fcs
PHOSPHOFLOW SPECIMEN_047V1.390160.fcs
PHOSPHOFLOW SPECIMEN_FLU 010V1.390100.fcs
PHOSPHOFLOW SPECIMEN_FLU 018V1.390104.fcs
subject.txt

	SUBJECT_ACCESSION	ANCESTRAL_POPULATION	DESCRIPTION	ETHNICITY	GENDER	RACE	RACE_SPECIFY	SPECIES	STRAIN	STRAIN_CHARACTERISTICS	WORKSPACE_ID
0	SUB134268	NaN	This subject record was used to consolidate du...	Not Hispanic or Latino	Female	White	NaN	Homo sapiens	NaN	NaN	2883
1	SUB134304	NaN	This subject record was used to consolidate du...	Not Hispanic or Latino	Female	Asian	NaN	Homo sapiens	NaN	NaN	2883
2	SUB134309	NaN	This subject record was used to consolidate du...	Not Hispanic or Latino	Male	White	NaN	Homo sapiens	NaN	NaN	2883
3	SUB134324	NaN	This subject record was used to consolidate du...	Not Hispanic or Latino	Male	Other	White, Asian	Homo sapiens	NaN	NaN	2883
4	SUB134240	NaN	This subject record was used to consolidate du...	Not Hispanic or Latino	Female	White	NaN	Homo sapiens	NaN	NaN	2883

	ARM_ACCESSION	SUBJECT_ACCESSION	AGE_EVENT	AGE_EVENT_SPECIFY	AGE_UNIT	MAX_SUBJECT_AGE	MIN_SUBJECT_AGE	SUBJECT_PHENOTYPE
0	ARM894	SUB134323	Age at Study Day 0	NaN	Years	23.82	23.82	Non-twin
1	ARM894	SUB134252	Age at Study Day 0	NaN	Years	24.15	24.15	Non-twin
2	ARM895	SUB134256	Age at Study Day 0	NaN	Years	83.89	83.89	Non-twin
3	ARM895	SUB134262	Age at Study Day 0	NaN	Years	84.17	84.17	Non-twin
4	ARM895	SUB134265	Age at Study Day 0	NaN	Years	85.82	85.82	Non-twin

	RESULT_ID	ARM_ACCESSION	BIOSAMPLE_ACCESSION	COMMENTS	EXPERIMENT_ACCESSION	EXPSAMPLE_ACCESSION	REPOSITORY_ACCESSION	REPOSITORY_NAME	STUDY_ACCESSION	STUDY_TIME_COLLECTED	STUDY_TIME_COLLECTED_UNIT	SUBJECT_ACCESSION	UNIT_PREFERRED	UNIT_REPORTED	VALUE_PREFERRED	VALUE_REPORTED	VIRUS_STRAIN_PREFERRED	VIRUS_STRAIN_REPORTED	WORKSPACE_ID
0	7037	ARM894	BS697119	NaN	EXP13382	ES760535	NaN	NaN	SDY212	28	Days	SUB134319	NaN	NaN	1280	1280	A/Brisbane/10/2007	H3N2	2883
1	6842	ARM894	BS697053	NaN	EXP13382	ES760340	NaN	NaN	SDY212	28	Days	SUB134253	NaN	NaN	320	320	A/Brisbane/10/2007	H3N2	2883
2	7009	ARM895	BS697110	NaN	EXP13382	ES760507	NaN	NaN	SDY212	28	Days	SUB134310	NaN	NaN	20	20	B/Florida/4/2006	B	2883
3	6636	ARM895	BS694718	NaN	EXP13382	ES742210	NaN	NaN	SDY212	0	Days	SUB134308	NaN	NaN	20	20	A/Brisbane/59/2007	H1N1	2883
4	6785	ARM895	BS694691	NaN	EXP13382	ES742131	NaN	NaN	SDY212	0	Days	SUB134281	NaN	NaN	40	40	A/Brisbane/10/2007	H3N2	2883

	ARM_ACCESSION	DESCRIPTION	NAME	STUDY_ACCESSION	TYPE	WORKSPACE_ID
0	ARM895	Older participants aged 60 to 89 years, vaccin...	Cohort_2	SDY212	Experimental	2883
1	ARM894	Young participants aged 20 to 30 years, vaccin...	Cohort_1	SDY212	Experimental	2883

	SUBJECT_ACCESSION	GENDER	RACE	ARM_ACCESSION	MIN_SUBJECT_AGE	ARM_NAME
0	SUB134258	Female	White	ARM895	85.41	Old
1	SUB134296	Female	White	ARM895	87.10	Old
2	SUB134304	Female	Asian	ARM894	24.59	Young
3	SUB134309	Male	White	ARM894	24.99	Young
4	SUB134324	Male	Other	ARM894	20.68	Young

	SUBJECT_ACCESSION	ARM_ACCESSION	MIN_SUBJECT_AGE	GENDER	race	ARM_NAME
0	SUB134240	ARM895	62.86	Female	White	Old
1	SUB134251	ARM894	29.23	Female	White	Young
2	SUB134258	ARM895	85.41	Female	White	Old
3	SUB134264	ARM895	68.08	Female	White	Old
4	SUB134271	ARM895	86.61	Female	White	Old

	subjectAccession	armAccession	armName	gender	race	minSubjectAge	fileDetail	filePath
0	SUB134242	ARM894	Cohort_1	Male	White	26.86	Flow cytometry result	/SDY212/ResultFiles/Flow_cytometry_result/pFlo...
1	SUB134249	ARM894	Cohort_1	Female	Other	26.01	Flow cytometry result	/SDY212/ResultFiles/Flow_cytometry_result/pFlo...
24	SUB134242	ARM894	Cohort_1	Male	White	26.86	Flow cytometry result	/SDY212/ResultFiles/Flow_cytometry_result/PHOS...
25	SUB134242	ARM894	Cohort_1	Male	White	26.86	Flow cytometry result	/SDY212/ResultFiles/Flow_cytometry_result/PHOS...
26	SUB134249	ARM894	Cohort_1	Female	Other	26.01	Flow cytometry result	/SDY212/ResultFiles/Flow_cytometry_result/PHOS...