Multiple Methods for Downloading and Analyzing ImmPort data in Python

During this tutorial we will explore 3 methods that can be used to download information from ImmPort and methods to format the data, so it is available for analysis. Details on ways to analyze the data were detailed in both an R and Python tutorials which are available at immport.org. A Python version of this tutorial is available, which is a very good source of information on both downloading and analysis of ImmPort data, and is the basis for much of the code in this tutorial. The original Python tutorial is available in HTML and in Jupyter Notebook formats.

This tutorial only focuses on the download and preparation of data for analysis, not the actual analyis of the data. We should also point out the choice of analysis language is left to the researcher, but this tutorial shows how this can be accomplished using the Python language. The plans are to produce a similar tutorial using R.

Overview of Access to ImmPort data

There are 3 alternative methods for downloading data from ImmPort, which one to choose is up to the researcher, but there are advantages for each method depending on your analysis plan.

Method 1: Download using the ImmPort Data Browser

The ImmPort Data Browser is a web based application that allows the user to download individual files or folders of files and might be a good choice for beginning data scientists. The screen shot below shows the starting screen for downloading data for SDY212.

For this tutorial we will be downloading the SDY212-DR23_Tab.zip file. This zip package contains all the data files for SDY212 in TSV format. Each file represents the content of a database table containing rows of information for study SDY212. An overview of the ImmPort database model is available here and detailed table information is available here. One advantge to this method is you can get all the files you may need with one click.

If you are interested in doing cross study analysis, there is an ALLSTUDIES-DRXX_Tab.zip file that contains files the have the data for all the studies. Another alternative is to download individual files which are in the ALLSTUDIES/ALLSTUDIES-DRXX_Metadata folder. Example screen shot below.

Method 2: Download using the ImmPort File Download API

This method allows the user to access files of interest using a programatic API to download files. A bash shell script has been developed that can be used to download files, and this tutorial will illustrate how you can download files using Python. More details on how to use Python will be in the section for this method. One advantage of this method is you can download individual files and can accomplish this programatically rather than using a web interface.

The ImmPort files are hosted and downloaded using an Aspera files system. Aspera is an application for greatly increasing the performance when downloading large files. The Aspera company provides executables that are packaged in the Shell script zip package and must be downloaded from ImmPort.

Method 3: Download using the ImmPort Query API

This method allows the user to obtain data from a REST API. Currently this REST API supports downloading assay results for experiments using: Elisa, Elispot, HAI, etc. If you are interested in downloading raw experiment files like FCS files, there is an API call to support this need. There will be a small example of how to use this API method. The types of REST methods available and the types of filters you can add to your query are detailed in another document. Because the information returned by the API methods contains many metadata elements, that may be contained in 4 files in either method 1 or 2, it may be easier to start with this method, until you are comfortable with Python Pandas or R DataFrames, to merge multiple files into one coherent data set.

Access to all shared ImmPort data requires you to become a registered user, registration is simple and you can start [here](https://immport-user-admin.niaid.nih.gov:8443/registrationuser/registration). Because access is limited to registered users, the API methods require an authorization token to be used as part of the request. The software we make available hides this complexity and it is really not that complex, but you need to be aware of the need to aquire a token.

Getting Started

The section below sets up the Python environment that is used by all the methods. This tutorial assumes you have downloaded the File Download Tool distribution available from ImmPort, that contains the bash shell script, the immport_download.py file and the Aspera executables. The immport_download.py file contains convience functions access the API, please review if you want more details. The tutorial assumes you have unzipped the distribution in a directory at the same level of notebook directory.

In [1]:
import sys
import os
import pandas as pd

# Set the Python path to the location of the directory containing "immport_download.py"
immport_download_code = "../bin/"
sys.path.insert(0,immport_download_code)
os.chdir(immport_download_code)

import immport_download

Example Configuration Properties

In [2]:
user_name = "REPLACE"
password = "REPLACE"
download_directory = "../output"
data_directory = "../data"
sdy212_directory= "../data/SDY212-DR23_Tab/Tab"

Method 1: Download using the ImmPort DataBrowser

In preparation for this step, the SDY212-DR23_Tab.zip file was downloaded using the DataBrowser and unzipped into the ../data/SDY212-DR23_Tab/Tab directory. The following 4 files from this directory will be used:

  • subjects.txt - general subject demographic information.
  • arm_2_subject.txt - mapping from subjects to study arms/cohorts.
  • arm_or_cohort - study arm names and descriptions
  • hai_results.txt - HAI results

More details on the process for preparing the information from these files and analysis are available in the tutorial mentioned in the top of this tutorial.

In [3]:
# To view the contents of the ../data/SDY212-DR23_Tab/Tab directory, uncomment the command below
# %ls $sdy212_directory

Read in the 4 files and load into Panda's Data Frames

In [4]:
subject_file = sdy212_directory + "/subject.txt"
arm_2_subject_file = sdy212_directory + "/arm_2_subject.txt"
arm_or_cohort_file = sdy212_directory + "/arm_or_cohort.txt"
hai_result_file = sdy212_directory + "/hai_result.txt"
subjects = pd.read_table(subject_file, sep="\t")
arm_2_subject = pd.read_table(arm_2_subject_file, sep="\t")
arm_or_cohort = pd.read_table(arm_or_cohort_file, sep="\t")
hai_results = pd.read_table(hai_result_file,sep="\t")

Review content of the 4 Data Frames

In [5]:
subjects.head(5)
Out[5]:
SUBJECT_ACCESSION ANCESTRAL_POPULATION DESCRIPTION ETHNICITY GENDER RACE RACE_SPECIFY SPECIES STRAIN STRAIN_CHARACTERISTICS WORKSPACE_ID
0 SUB134268 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female White NaN Homo sapiens NaN NaN 2883
1 SUB134304 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female Asian NaN Homo sapiens NaN NaN 2883
2 SUB134309 NaN This subject record was used to consolidate du... Not Hispanic or Latino Male White NaN Homo sapiens NaN NaN 2883
3 SUB134324 NaN This subject record was used to consolidate du... Not Hispanic or Latino Male Other White, Asian Homo sapiens NaN NaN 2883
4 SUB134240 NaN This subject record was used to consolidate du... Not Hispanic or Latino Female White NaN Homo sapiens NaN NaN 2883
In [6]:
arm_2_subject.head(5)
Out[6]:
ARM_ACCESSION SUBJECT_ACCESSION AGE_EVENT AGE_EVENT_SPECIFY AGE_UNIT MAX_SUBJECT_AGE MIN_SUBJECT_AGE SUBJECT_PHENOTYPE
0 ARM894 SUB134323 Age at Study Day 0 NaN Years 23.82 23.82 Non-twin
1 ARM894 SUB134252 Age at Study Day 0 NaN Years 24.15 24.15 Non-twin
2 ARM895 SUB134256 Age at Study Day 0 NaN Years 83.89 83.89 Non-twin
3 ARM895 SUB134262 Age at Study Day 0 NaN Years 84.17 84.17 Non-twin
4 ARM895 SUB134265 Age at Study Day 0 NaN Years 85.82 85.82 Non-twin
In [7]:
arm_or_cohort.head()
Out[7]:
ARM_ACCESSION DESCRIPTION NAME STUDY_ACCESSION TYPE WORKSPACE_ID
0 ARM895 Older participants aged 60 to 89 years, vaccin... Cohort_2 SDY212 Experimental 2883
1 ARM894 Young participants aged 20 to 30 years, vaccin... Cohort_1 SDY212 Experimental 2883
In [8]:
hai_results.head()
Out[8]:
RESULT_ID ARM_ACCESSION BIOSAMPLE_ACCESSION COMMENTS EXPERIMENT_ACCESSION EXPSAMPLE_ACCESSION REPOSITORY_ACCESSION REPOSITORY_NAME STUDY_ACCESSION STUDY_TIME_COLLECTED STUDY_TIME_COLLECTED_UNIT SUBJECT_ACCESSION UNIT_PREFERRED UNIT_REPORTED VALUE_PREFERRED VALUE_REPORTED VIRUS_STRAIN_PREFERRED VIRUS_STRAIN_REPORTED WORKSPACE_ID
0 7037 ARM894 BS697119 NaN EXP13382 ES760535 NaN NaN SDY212 28 Days SUB134319 NaN NaN 1280 1280 A/Brisbane/10/2007 H3N2 2883
1 6842 ARM894 BS697053 NaN EXP13382 ES760340 NaN NaN SDY212 28 Days SUB134253 NaN NaN 320 320 A/Brisbane/10/2007 H3N2 2883
2 7009 ARM895 BS697110 NaN EXP13382 ES760507 NaN NaN SDY212 28 Days SUB134310 NaN NaN 20 20 B/Florida/4/2006 B 2883
3 6636 ARM895 BS694718 NaN EXP13382 ES742210 NaN NaN SDY212 0 Days SUB134308 NaN NaN 20 20 A/Brisbane/59/2007 H1N1 2883
4 6785 ARM895 BS694691 NaN EXP13382 ES742131 NaN NaN SDY212 0 Days SUB134281 NaN NaN 40 40 A/Brisbane/10/2007 H3N2 2883
In [9]:
# To review the columns for all data frames
print(subjects.columns)
print(arm_2_subject.columns)
print(arm_or_cohort.columns)
print(hai_results.columns)
Index(['SUBJECT_ACCESSION', 'ANCESTRAL_POPULATION', 'DESCRIPTION', 'ETHNICITY',
       'GENDER', 'RACE', 'RACE_SPECIFY', 'SPECIES', 'STRAIN',
       'STRAIN_CHARACTERISTICS', 'WORKSPACE_ID'],
      dtype='object')
Index(['ARM_ACCESSION', 'SUBJECT_ACCESSION', 'AGE_EVENT', 'AGE_EVENT_SPECIFY',
       'AGE_UNIT', 'MAX_SUBJECT_AGE', 'MIN_SUBJECT_AGE', 'SUBJECT_PHENOTYPE'],
      dtype='object')
Index(['ARM_ACCESSION', 'DESCRIPTION', 'NAME', 'STUDY_ACCESSION', 'TYPE',
       'WORKSPACE_ID'],
      dtype='object')
Index(['RESULT_ID', 'ARM_ACCESSION', 'BIOSAMPLE_ACCESSION', 'COMMENTS',
       'EXPERIMENT_ACCESSION', 'EXPSAMPLE_ACCESSION', 'REPOSITORY_ACCESSION',
       'REPOSITORY_NAME', 'STUDY_ACCESSION', 'STUDY_TIME_COLLECTED',
       'STUDY_TIME_COLLECTED_UNIT', 'SUBJECT_ACCESSION', 'UNIT_PREFERRED',
       'UNIT_REPORTED', 'VALUE_PREFERRED', 'VALUE_REPORTED',
       'VIRUS_STRAIN_PREFERRED', 'VIRUS_STRAIN_REPORTED', 'WORKSPACE_ID'],
      dtype='object')

Clean up and merge data for analysis

The steps below will remove columns not needed for analysis, plus we will assign more meaningfull labels to the ARM names. When we reviewed the arm_or_cohort contents above, we can see that ARM984/Cohort_1 corresponds to the Young particpants and ARM895/Cohort_2 corresponds to the Old participants. Then merge the ARM information with the Subject demographic information.

In [10]:
subjects = subjects[['SUBJECT_ACCESSION','GENDER','RACE']]
arm_2_subject = arm_2_subject[['SUBJECT_ACCESSION','ARM_ACCESSION','MIN_SUBJECT_AGE']]
arm_2_subject['ARM_NAME'] = ""
arm_2_subject.loc[arm_2_subject['ARM_ACCESSION'] == 'ARM894','ARM_NAME'] = "Young"
arm_2_subject.loc[arm_2_subject['ARM_ACCESSION'] == 'ARM895','ARM_NAME'] = "Old"
arm_2_subject_merged = pd.merge(subjects,arm_2_subject, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
arm_2_subject_merged.head()
Out[10]:
SUBJECT_ACCESSION GENDER RACE ARM_ACCESSION MIN_SUBJECT_AGE ARM_NAME
0 SUB134268 Female White ARM894 29.71 Young
1 SUB134304 Female Asian ARM894 24.59 Young
2 SUB134309 Male White ARM894 24.99 Young
3 SUB134324 Male Other ARM894 20.68 Young
4 SUB134240 Female White ARM895 62.86 Old

Simple descriptive statistics

We will stop here with method one, but the next step would be to merge in the hai_result information into the arm_2_subject_merged data frame, so we could analyze the HAI result information. Details on how to accomplish this plus examples of analysis and plotting of results can be found in the more detailed tutorial mentioned at the beginning of this tutorial.

In [11]:
arm_2_subject_merged.groupby('ARM_NAME').count()['SUBJECT_ACCESSION']
Out[11]:
ARM_NAME
Old      61
Young    30
Name: SUBJECT_ACCESSION, dtype: int64
In [12]:
arm_2_subject_merged.groupby('GENDER').count()['SUBJECT_ACCESSION']
Out[12]:
GENDER
Female    54
Male      37
Name: SUBJECT_ACCESSION, dtype: int64

Method 2: Download using the ImmPort File Download API

For this method, we will be using the files that are available from ALLSTUDIES/ALLSTUDIES-DR23_Metadata directory. We will download the same 4 files that we used for method 1, but will use the Download API to programatically download the files. Because the files contain all the data for all studies, we will then filter out all content not necessary for SDY212 analysis.

The DR23 above represents the version of the data at specific point in time. ImmPort releases updated and new studies approximately 6 times a year. So you will need to replace the DR23 with the latest release number
In [13]:
immport_download.download_file(user_name,password,
                                    "/ALLSTUDIES/ALLSTUDIES-DR23_Metadata/subject.txt",download_directory)
immport_download.download_file(user_name,password,
                                    "/ALLSTUDIES/ALLSTUDIES-DR23_Metadata/arm_2_subject.txt",download_directory)
immport_download.download_file(user_name,password,
                                    "/ALLSTUDIES/ALLSTUDIES-DR23_Metadata/arm_or_cohort.txt",download_directory)
immport_download.download_file(user_name,password,
                                    "/ALLSTUDIES/ALLSTUDIES-DR23_Metadata/hai_result.txt",download_directory)

Check that download was successful

In [14]:
%ls ../output
arm_2_subject.txt  aspera-scp-transfer.0.log  hai_result.txt
arm_or_cohort.txt  aspera-scp-transfer.log    subject.txt

Read in the 4 files and load into Panda's Data Frames, filter to SDY212

For this step we will follow a similar process that was used for method 1 to build our analysis data frame, with the added filtering to only include information from SDY212. We will simplify the filtering because we already know that for SDY212 the 2 ARM_ACCESSION's we are interested in are ARM894 and ARM895.

In [15]:
subject_file = download_directory + "/subject.txt"
arm_2_subject_file = download_directory + "/arm_2_subject.txt"
arm_or_cohort_file = download_directory + "/arm_or_cohort.txt"
hai_result_file = download_directory + "/hai_result.txt"
subjects = pd.read_table(subject_file, sep="\t")
arm_2_subject = pd.read_table(arm_2_subject_file, sep="\t")
arm_or_cohort = pd.read_table(arm_or_cohort_file, sep="\t")
hai_results = pd.read_table(hai_result_file,sep="\t")
In [16]:
subjects = subjects[['SUBJECT_ACCESSION','GENDER','RACE']]
arm_2_subject = arm_2_subject[['SUBJECT_ACCESSION','ARM_ACCESSION','MIN_SUBJECT_AGE']]
# Filter only records for SDY212 using ARM_ACCESSION
arm_2_subject = arm_2_subject[arm_2_subject['ARM_ACCESSION'].isin(['ARM894','ARM895'])]
arm_2_subject.loc[arm_2_subject['ARM_ACCESSION'] == 'ARM894','ARM_NAME'] = "Young"
arm_2_subject.loc[arm_2_subject['ARM_ACCESSION'] == 'ARM895','ARM_NAME'] = "Old"
arm_2_subject_merged = pd.merge(subjects,arm_2_subject, left_on='SUBJECT_ACCESSION', right_on='SUBJECT_ACCESSION')
arm_2_subject_merged.head()
Out[16]:
SUBJECT_ACCESSION GENDER RACE ARM_ACCESSION MIN_SUBJECT_AGE ARM_NAME
0 SUB134258 Female White ARM895 85.41 Old
1 SUB134296 Female White ARM895 87.10 Old
2 SUB134304 Female Asian ARM894 24.59 Young
3 SUB134309 Male White ARM894 24.99 Young
4 SUB134324 Male Other ARM894 20.68 Young

Simple Descriptive Statistics

At this point we should have built a data frame with the same content as in method 1, so we will look at a few descriptive statistics to confirm.

In [17]:
arm_2_subject_merged.groupby('ARM_NAME').count()['SUBJECT_ACCESSION']
Out[17]:
ARM_NAME
Old      61
Young    30
Name: SUBJECT_ACCESSION, dtype: int64
In [18]:
arm_2_subject_merged.groupby('GENDER').count()['SUBJECT_ACCESSION']
Out[18]:
GENDER
Female    54
Male      37
Name: SUBJECT_ACCESSION, dtype: int64

Method 3: Download using the ImmPort Query API

This method will use the Query API to download a JSON file of HAI results for SDY212. Because the JSON file contains all the columns we need for analysis, this will simplify the creation of the data frame for analysis. The final counts for the descriptive statistics, may be slightly off, but this is expected because not all the subjects in the 2 ARM's have HAI results.

In [19]:
# Request a token, then make API call, then load into Pandas's DataFrame
token = immport_download.request_immport_token(user_name, password)
r = immport_download.api("https://api.immport.org/data/query/result/hai?studyAccession=SDY212",token)
hai_results = pd.read_json(r)
In [20]:
hai_results.columns
Out[20]:
Index(['ageEvent', 'ageEventSpecify', 'ageUnit', 'ancestralPopulation',
       'armAccession', 'armName', 'biosampleAccession', 'biosampleSubtype',
       'biosampleType', 'clinical', 'comments', 'ethnicity',
       'experimentAccession', 'expsampleAccession', 'gender', 'maxSubjectAge',
       'measurementTechnique', 'minSubjectAge', 'plannedVisitAccession',
       'race', 'raceSpecify', 'repositoryAccession', 'repositoryName',
       'resultId', 'species', 'strain', 'studyAccession', 'studyTimeCollected',
       'studyTimeCollectedUnit', 'studyTimeT0Event', 'studyTimeT0EventSpecify',
       'studyTitle', 'subjectAccession', 'subjectPhenotype',
       'treatmentAccession', 'unitPreferred', 'unitReported', 'valuePreferred',
       'valueReported', 'virusStrainPreferred', 'virusStrainReported'],
      dtype='object')

Remove column and rename to match names in the first 2 methods

In [21]:
arm_2_subject_merged = hai_results[['subjectAccession','armAccession','minSubjectAge','gender','race']]
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'subjectAccession':'SUBJECT_ACCESSION'})
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'armAccession':'ARM_ACCESSION'})
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'minSubjectAge':'MIN_SUBJECT_AGE'})
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'gender':'GENDER'})
arm_2_subject_merged=arm_2_subject_merged.rename(columns={'RACE':'RACE'})
arm_2_subject_merged.loc[arm_2_subject_merged['ARM_ACCESSION'] == 'ARM894','ARM_NAME'] = "Young"
arm_2_subject_merged.loc[arm_2_subject_merged['ARM_ACCESSION'] == 'ARM895','ARM_NAME'] = "Old"
arm_2_subject_merged.head()
Out[21]:
SUBJECT_ACCESSION ARM_ACCESSION MIN_SUBJECT_AGE GENDER race ARM_NAME
0 SUB134240 ARM895 62.86 Female White Old
1 SUB134251 ARM894 29.23 Female White Young
2 SUB134258 ARM895 85.41 Female White Old
3 SUB134264 ARM895 68.08 Female White Old
4 SUB134271 ARM895 86.61 Female White Old

Simple Descriptive Statistics¶

At this point we should have built a data frame with the similar content as in method 1 and 2, so we will look at a few descriptive statistics to confirm. Because each subject may have multiple HAI results, we need to remove duplicates.

In [22]:
arm_2_subject_merged[['ARM_NAME','SUBJECT_ACCESSION']].drop_duplicates().groupby('ARM_NAME').count()
Out[22]:
SUBJECT_ACCESSION
ARM_NAME
Old 60
Young 29
In [23]:
arm_2_subject_merged[['GENDER','SUBJECT_ACCESSION']].drop_duplicates().groupby('GENDER').count()
Out[23]:
SUBJECT_ACCESSION
GENDER
Female 54
Male 35

Download Files Using Data Query and File Download API

In this final section, we have an example of using the Query API to identify flow cytometry FCS files for SDY212 and ARM Cohort_1 using the filePath API method. Then once we have identified the file paths for these files of interest, we will download them using the Download API. We will only download 5 files for this example.

In [24]:
token = immport_download.request_immport_token(user_name, password)
r = immport_download.api("https://api.immport.org/data/query/result/filePath?studyAccession=SDY212&armName=Cohort_1&measurementTechnique=Flow%20cytometry",token)
df = pd.read_json(r)
df = df[df['fileDetail'] == "Flow cytometry result"]
df_fcs = df[['subjectAccession','armAccession','armName','gender','race','minSubjectAge','studyTimeCollected', \
               'fileDetail','filePath']]
df_fcs.head()
Out[24]:
subjectAccession armAccession armName gender race minSubjectAge studyTimeCollected fileDetail filePath
0 SUB134242 ARM894 Cohort_1 Male White 26.86 0 Flow cytometry result /SDY212/ResultFiles/Flow_cytometry_result/pFlo...
1 SUB134249 ARM894 Cohort_1 Female Other 26.01 0 Flow cytometry result /SDY212/ResultFiles/Flow_cytometry_result/pFlo...
24 SUB134242 ARM894 Cohort_1 Male White 26.86 0 Flow cytometry result /SDY212/ResultFiles/Flow_cytometry_result/PHOS...
25 SUB134242 ARM894 Cohort_1 Male White 26.86 0 Flow cytometry result /SDY212/ResultFiles/Flow_cytometry_result/PHOS...
26 SUB134249 ARM894 Cohort_1 Female Other 26.01 0 Flow cytometry result /SDY212/ResultFiles/Flow_cytometry_result/PHOS...
In [25]:
unique_file_paths = df_fcs.filePath.unique()
unique_fcs_paths = [path for path in unique_file_paths if path.endswith(".fcs")]
for i in range(0,5):
    print("Downloading: ",unique_fcs_paths[i])
    immport_download.download_file(user_name,password,
                                    unique_fcs_paths[i],download_directory)
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_FLU 010V1.390100.fcs
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_FLU 018V1.390104.fcs
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_022V1.390144.fcs
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_047V1.390160.fcs
Downloading:  /SDY212/ResultFiles/Flow_cytometry_result/PHOSPHOFLOW SPECIMEN_039V1.391052.fcs
In [26]:
%ls ../output
arm_2_subject.txt
arm_or_cohort.txt
aspera-scp-transfer.0.log
aspera-scp-transfer.log
hai_result.txt
PHOSPHOFLOW SPECIMEN_022V1.390144.fcs
PHOSPHOFLOW SPECIMEN_039V1.391052.fcs
PHOSPHOFLOW SPECIMEN_047V1.390160.fcs
PHOSPHOFLOW SPECIMEN_FLU 010V1.390100.fcs
PHOSPHOFLOW SPECIMEN_FLU 018V1.390104.fcs
subject.txt
In [ ]: