Use Deidentified Data

Last modified 2024-07-15

	Abbreviations Key
ATAC-seq	assays for transposase-accessible chromatin sequencing
dbGaP	Database of Genotypes and Phenotypes (NIH)
DOB	date of birth
GEO	Gene Expression Omnibus (NCBI)
HIPAA	Health Insurance Accountability and Portability Act
HISE	Human Immune System Explorer
IDE	integrated development environment
NCBI	National Center for Biotechnology Information
NIH	National Institutes of Health
PHI	protected health information
PII	personally identifiable information
RNA-seq	ribonucleic acid sequencing
SRA	Sequence Read Archive (NCBI)
WGS	whole-genome sequencing

At a Glance

HISE supports the delivery of deidentified data from human subjects research. AIFI is careful not to ingest data that contains PHI. All sensitive data remains housed outside of HISE in third-party systems at collaborating clinical sites. Instead, we use random subject IDs that can't be linked to specific information about human subjects.

Description

Open science at AIFI depends on responsible collaboration with members of our partnership collective. When we exchange data, we aim to balance our scientific needs with the privacy rights of the human subjects who participate in our research. We also strive to maintain the trust of IRBs tasked with overseeing such research.

Data Sharing Policy

In addition to basic data about cohorts, samples, specimens, and subject demographics ingested through AIFI's LIMS system, HISE accepts clinical questionnaire data, CBC results, and selected subject metadata. The time frame for public release of data is governed by applicable data-sharing agreements, laws, and regulations. Studies that receive even partial NIH funding for example, are subject to agency regulations. These rules require that deidentified data be made publicly available no more than 12 months after the completion of each longitudinal study cohort. When we share such data in HISE, we follow NIH guidance, which recommends two data access tiers:

Tier 1

Tier 1 includes whole genome or whole exome sequencing data. If there is a risk that a patient could be identified, only deidentified data is shared. It's placed in an NIH-designated managed-access repository.

Tier 2

Tier 2 includes all other kinds of deidentified data. Such data poses little or no risk of exposing the patient's identity, and the data can therefore be shared publicly in HISE.

Definitions

For key terms that pertain to AIFI data sharing, see the following table.

Term	Definition
information	Techniques and methods, test data, results (including pharmacological, toxicological, and clinical test data and results), analytical and quality control data, and algorithms.
PHI	A subtype of PII that includes all individually identifiable health information, including demographic data, medical histories, test results, insurance information, and other information used to identify a patient or provide healthcare services or coverage.
PII	Information that can be used to distinguish or trace an individual’s identity, either alone (direct) or in combination with other personal or identifying information linked to a specific individual (indirect).
protected	Information covered by the HIPAA Privacy Rule, a 1996 U.S. law that protects patients' privacy rights.
sample	A biological sample obtained from a human subject.

Data Masking

Data can be handled in a way that protects personally identifiable information (PII) but keeps the anonymized data available for analysis and testing. This process is called data masking. If you have a Data App, it's important to mask PII fields on your Certificate of Reproducibility (CertPro). You can either mask select fields on a vertex or delete the metadata for an entire vertex (that is, remove the vertex entry from the metadata field). If the metadata has a revision history, you should remove the other revisions.

Data Release

During the 12 months preceding public release of NIH-associated data, members of the partnership collective have full access to it in HISE. Supporting data for any interim published results is shared in accordance with journal requirements. This policy covers raw data files and analyzed data in the following areas:

Area	Special considerations (if any)
Human metadata	To facilitate full deidentification, partner organizations that provide samples remove the month of the specified event, such as a blood draw. To allow longitudinal data analysis, they instead preserve a variable that represents, for example, the number of days elapsed since a specified baseline event, such as the number of days elapsed from the initial study sample collection to the current blood draw.
Plasma proteomics/targeted proteomics	None
Flow cytometry data	None
Single cell and bulk ATAC-seq data	Processed H5 data files are made publicly available within the same time frame as single cell and bulk ATAC-seq data. Supporting RNAseq data for interim results is placed in an open access data repository, such as SRA or GEO (both hosted by NCBI), and H5 files are made publicly available in HISE.
WGS	Supporting WGS data for interim results is placed in dbGaP or another NIH-designated controlled-access repository.

Data Processing

When we receive a metadata payload, we compare the data set with a known data dictionary and allowlist only recognized fields. All other data is rejected. We also validate selected fields. For example, the DOB field is validated to ensure that we receive the birth year and month but not the day.

Revision history

Processed metadata is stored in a multitenant database. To record any changes in the metadata, we keep a revision history. It documents which entries were changed, when they were changed (for example, during reingest), and what the previous values were.

Legal review

Any data dictionary AIFI uses to process metadata is considered part of the research agreement and is therefore subject to legal review. If you work with one of our partners, be sure you understand your institution's data sharing policy. For detailed information, check with your legal representative.

Related Resources

Attach Metadata

Create or Delete Metadata

Submit and Monitor Pipeline Batches