Use Deidentified Data

Last modified 2024-07-15

Abbreviations Key
ATACassays for transposase-accessible chromatin sequencing
dpGaPDatabase of Genotypes and Phenotypes (NIH)
DOBdate of birth
GEOGene Expression Omnibus (NCBI)
HIPAAHealth Insurance Accountability and Portability Act
HISEHuman Immune System Explorer
IDEintegrated development environment
NCBINational Center for Biotechnology Information
NIHNational Institutes of Health
PHIprotected health information
PIIpersonally identifiable information
RNAseqribonucleic acid sequencing
SRASequence Read Archive (NCBI)
WGSwhole genome sequencing

At a Glance

HISE supports the delivery of deidentified data from human subjects research. AIFI is careful not to ingest data that contains PHI. All sensitive data remains housed outside of HISE in third-party systems at collaborating clinical sites. Instead, we use random subject IDs that can't be linked to specific information about human subjects.

Description

Open science at AIFI depends on responsible collaboration with members of our partnership collective. When we exchange data, we aim to balance our scientific needs with the privacy rights of the human subjects who participate in our research. We also strive to maintain the trust of IRBs tasked with overseeing such research.

Data Sharing Policy

In addition to basic data about cohorts, samples, specimens, and subject demographics ingested through AIFI's LIMS system, HISE accepts clinical questionnaire data, CBC results, and selected subject metadata. The time frame for public release of data is governed by applicable data-sharing agreements, laws, and regulations. Studies that receive even partial NIH funding for example, are subject to agency regulations. These rules require that deidentified data be made publicly available no more than 12 months after the completion of each longitudinal study cohort. When we share such data in HISE, we follow NIH guidance, which recommends two data access tiers:

Tier 1

Tier 1 includes whole genome or whole exome sequencing data. If there is a risk that a patient could be identified, only deidentified data is shared. It's placed in an NIH-designated managed-access repository.

Tier 2

Tier 2 includes all other kinds of deidentified data. Such data poses little or no risk of exposing the patient's identity, and the data can therefore be shared publicly in HISE.

Definitions

For key terms that pertain to AIFI data sharing, see the following table.

TermDefinition
informationTechniques and methods, test data, results (including pharmacological, toxicological, and clinical test data and results), analytical and quality control data, and algorithms.
PHIA subtype of PII that includes all individually identifiable health information, including demographic data, medical histories, test results, insurance information, and other information used to identify a patient or provide healthcare services or coverage.
PIIInformation that can be used to distinguish or trace an individual’s identity, either alone (direct) or in combination with other personal or identifying information linked to a specific individual (indirect).
protected Information covered by the HIPAA Privacy Rule, a 1996 U.S. law that protects patients' privacy rights.
sampleA biological sample obtained from a human subject.

Data Masking

Data can be handled in a way that protects personally identifiable information (PII) but keeps the anonymized data available for analysis and testing. This process is called data masking. If you have a Data App, it's important to mask PII fields on your Certificate of Reproducibility (CertPro). You can either mask select fields on a vertex or delete the metadata for an entire vertex (that is, remove the vertex entry from the metadata field). If the metadata has a revision history, you should remove the other revisions.

Data Release

During the 12 months preceding public release of NIH-associated data, members of the partnership collective have full access to it in HISE. Supporting data for any interim published results is shared in accordance with journal requirements. This policy covers raw data files and analyzed data in the following areas:

AreaSpecial considerations (if any)
Human metadataTo facilitate full deidentification, partner organizations that provide samples remove the month of the specified event, such as a blood draw. To allow longitudinal data analysis, they instead preserve a variable that represents, for example, the number of days elapsed since a specified baseline event, such as the number of days elapsed from the initial study sample collection to the current blood draw.
Plasma proteomics/targeted proteomicsNone
Flow cytometry dataNone
Single cell and bulk ATACseq dataProcessed H5 data files are made publicly available within the same time frame as single cell and bulk ATACseq data. Supporting RNAseq data for interim results is placed in an open access data repository, such as SRA or GEO (both hosted by NCBI), and H5 files are made publicly available in HISE.
WGSSupporting WGS data for interim results is placed in dbGaP or another NIH-designated controlled-access repository.

Data Processing

When we receive a metadata payload, we compare the data set with a known data dictionary and allowlist only recognized fields. All other data is rejected. We also validate selected fields. For example, the DOB field is validated to ensure that we receive the birth year and month but not the day.

Revision history

Processed metadata is stored in a multitenant database. To record any changes in the metadata, we keep a revision history. It documents which entries were changed, when they were changed (for example, during reingest), and what the previous values were.

Legal review

Any data dictionary AIFI uses to process metadata is considered part of the research agreement and is therefore subject to legal review. If you work with one of our partners, be sure you understand your institution's data sharing policy. For detailed information, check with your legal representative.


Related Resources

Attach Metadata

Create or Delete Metadata

Submit and Monitor Pipeline Batches