Regulatory Science Virtual Symposium: “Innovation to Translation: Role of Genomics in Medical Product Development:” Session 5: Big Data and Genomics (2021)


My Perspective

  1. PhD in Chemical and Biomolecular Engineering (JHU)
  2. Ellison Institute & University of Southern California
  3. National Cancer Institute

How did we get here?

  1. The struggle against internet overload is real
  2. Creating a level of 5 exabyte every two days
  3. How do we absorb this data?
    1. The curation of data
      1. National Cancer Institute, Blood Pac BDGC, ECC, etc.…
      2. Open-source framework, data that also interoperates with each other
  4. The human genome turns 20!
    1. Technology is becoming more accessible
    2. Cost to do one genome is much less today
      1. Potential overload of data now

NIH Announces Two Integral Components of The Cancer Genome Atlas Pilot Project

  1. Feasibility of using large-scale genome analysis technologies
  2. How to make it accessible and respect the privacy of research participants
  3. Balance between data protection and data utility/value

Translation Data Spheres

  1. Data curation ties each translation of the clinical trials
  2. When analyses are conducted without understanding the basic science or clinical context of data capture, there is a risk of inappropriate interpretation, lack of clinical relevance, or lack of feasible implementation for broader impact

Making Data Fair and Transparent

  1. Findable
  2. Accessible
  3. Interoperable
  4. Reusable

Why can’t we just have a universal consent?

  1. People do not read terms and conditions

Visualization of a Genome

  1. 3 billion base pairs
  2. Genome = a book
  3. Written in 4 letters

Human Autocorrect Works (Mostly)

  1. Some errors can arise however

Single Nucleotide Polymorphism (SMPs)

  1. Does not just change the letter but can change the entire sentence
  2. SNPs are variation that involve a change in just one nucleotide

Copy Number Variants (CNVs)

  1. CNVS are defined as chromosomal segments that are 1000+ base pairs that are deleted, copied, flipped, or otherwise rearranged

Read/Write Error

  1. Transcription à Translation there can be a rewrite error at the protein level

Janet Rowley’s 1970 microscopy studies of leukemia cell chromosomes suggested certain alterations lead to cancer

  1. By early 1999, six months after beginning the phase 1 study…virtually all of our patients were responding and experiencing few, if any, side effects… Internet chat rooms were a new phenomenon, and patients were describing their experiences with imatinib even before we had presented clinical data or published our results
  2. Vast technological advancement in the past 20 years
  3. Initial sequencing and analysis of the human genome
    1. 2001

Cancer has been progressively redefined over the past 20 years

Tumor, Cancer, and Metastasis: Length-scale and Time-scale Matter

  1. Green – Localized
  2. Red – Metastasis

Engineer’s Dream: Develop A Continuum

Potential Use of New Technologies

  1. Imaging, blood technologies can only sample when patients only discover that they have something to discover

Early Years

  1. Jean McKibben, an ovarian-cancer survivor, rushed to take OvaSure on the first day it was available, and her results showed a 0.00 chance of cancer.
  2. A week later, scans showed that her cancer was back. She was crushed.
  3. FDA on 7 August 2008 sent a letter to LabCorp saying that the test ‘has not received adequate clinical validation and may harm the public health’.
  4. A second letter, sent on 29 September 2008, alleged that LabCorp did not have the necessary marketing clearance or approval for the test from the FDA.
  5. LabCorp replied to the FDA on 20 October 2008, disagreeing with the agency’s assertions, but agreed to pull OvaSure from the market.

2006-2015: A Decade of Illuminating the Underlying Causes of Primary Untreated Tumors

  1. 8-year long project that took ½ billion dollars

An Open Letter to Cancer Researchers

  1. The goal was to accelerate the discovery of cures for caners
  2. Skepticism around feasibility as 99% of the mutation by removing errors did not alter the protein

First Pass at Cancer Genome Reveals Complex Landscape

Adding an Engineer Perspective

  1. Anna Barker as the lead

Potential Source of Variability

Platforms Only

  1. Maturity and Heterogeneity of Platforms
  2. Lots of variability within the platforms
    1. IBM, 454, Illumina, Pac Bio
  3. Hard coded every data platform

Rapid Acceleration from Stimulus Funding (2009-2011)

  1. President Obama Stimulus to accelerate this project and develop the genome

QA, QC, and Optimization: Metadata Matters

  1. Lots of samples would degrade or other preservation issues
  2. Overall 58%

The Cancer Genome Atlas – 2013

  1. Reveal common, underlying mechanism between different cancers rather

Personalized Oncology Through Integrative High-Throughput Sequencing: A Pilot Study

  1. First paper describing a sequencing paper board

Still Learning: ACMG Secondary Findings v3.0 [2021]

  1. Another update to the 2013 documents
  2. Now 73

Drug Discovery and Development

  1. Accelerate the translation of patient genomic data into clinical application
    1. Innovate integration of computational mining of largescale genomic data analysis
    2. Identify and confirm new therapeutic target candidates
  2. How to connect cancer therapeutics and cancer genomics


  1. How do I take this data and make it actionable
  2. CRISPR for example.

NCI-MATCH Central Screening Summary

  1. Precision medicine trial explores treating patients based on the molecular profiles of their rumors
  2. NCI-Match is for adults with tumors that no long respond to standard treatment
  3. The biopsied tumor tissue will undergo gene sequencing
  4. Many patients sequence success – but only 18% matched
    1. We had to do this on patients that no longer responded to care standard (i.e. why there was a low match rate)
    2. Enrollment rate: 69%


  1. VP Biden had a large passion for treating this disease
  2. How many patients from the atlas are still alive? – we could not track this
  3. Only some data can be shared in the public domain,
    1. Longitudinal data
    2. Some data shared amongst partners only
  4. NIH Genomic Data Community
    1. Continuing to be developed
  5. At the June 29th Cancer Moonshot Summit, Foundation Medicine announced the release of 18,000 genomic profiles to the NCI GDC
    1. Expansion of private companies donating their data
      1. Over 117,000 cases

Without a National Learning Healthcare System for Cancer

  1. President Obama just launched the precision medicine initiative
  2. US had 1.8 million cases diagnosed each year
  3. 85% of cancer patients are first diagnosed and treated in community setting
  4. Unable to share cancer data/difficulty in getting these metrics on patients (i.e. telling them “sign this consent it’s your last treatment”)
  5. Apollo Program
    • The nation’s first integrated molecular driven cancer cares early discovery-to-clinical health care implementation system for active duty (AYAs), beneficiary, and veteran cancer patients
      • DOD and DA allowed to really look at this 85%
      • Similar to the ATOM example
      • The Cancer Institute is working on a personal identification ID
        1. COVID and vaccinations are very relevant to this
      • Low accuracy and Low precision
        1. The target moves as a patient lives his or her life
      • Need professional to develop ecosystems for students to be exposed to this
      • Cancer registries
      • Difficulties in death registry and certificates in the way they code the ultimate reason (i.e. double coding)

What about liquid biopsies?

  1. Another question Biden brought up
  2. CTC and Circulating tumor DNA
    1. Exampling how variability a tube of blood can be and the difficulties that ensue

White House Cancer Moonshot

  1. Pulled companies together and asked to pull their data
  2. BloodPac Collaboration 501C

BloodPac Pre-Analytical Requirements

  1. Minimum Clinical Trial Elements for Liquid Biopsy Data Submitted to Public Data bases study
  2. Open data and increased collaboration

Precision Health and Precision Oncology

  1. Expanding in 2017 on the Oncology Precision Medicine Data Landscape
  2.  APOLLO Program and BloodPac are great models
  3. In 2017 FDA Approves first cancer treatment for any solid tumor with a specific genetic feature
  4. In 2018 FDA approves first treatment for breast cancer with a certain inherited mutation
  5. Harnessing the Power of Collaboration and Training within the Clinical Data Science to Generate Real-World Evidence in the Era of Precision Oncology
  6. In 2018 FDA approves first treatment for breast cancer with a certain inherited mutation


  1. Provide data that could help advance new, successful methods
    1. Chemical and in vitro biological data from more than 2 million compounds from its historic and current screening collection
    2.  Preclinical and clinical information on 500 molecules that failed in development

Precision Health and Precision Medicine

  1. Exist on a continuum together
  2. Healthy, morbidity, multi-morbidity, disease
  3. Promote patient access to their health information in a single longitudinal format that is easy to understand, secure and updated automatically

2019 and 2020

  1. FDAS Framework for Real-world
  2. Continuation of FDA approvals for new drugs

Your Insights Matter and are Critical

  1. Harassing the Power of Collaboration and Training Within Clinical Data Science to Generate Real-World Evidence in the Era of Precision Oncology

An integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics

  1. 100+ issues resolved
  2. 4 endpoints derived and evaluated

Tempus and Leidos Biomedical Research Inc.

  1. Continuing to add to obtain up to date and recur ate it
  2. Continue to obtain clinical and molecular relevance

Patient Trajectory Comprised of All Relevant Cancer Case and Cofactor Trajectories

  1. How do we continue to improve these outcomes
  2. Welcome more discussions


  1. Does will TCGA tap into the perhaps more phenotype data in DGBAT gap that are deposited by cancer researchers?
  2. Is there an application to actually go back and reuse from failed drug development projects?
  3. how do we future proof our data collection efforts are their best practices for consented patients for future big data efforts?
  4. Is there type of repository where your research, whether it's sanctioned by the government or you know you're excluded, for some reason from HIPPA laws. Have access to the actual names of patients and would be able to conduct a new type of clinical trial, using the data from old trials?
  5. How should be training the next generation to tackle these problems?
  6. Every year lots of new cases of cancer and recurrent cancer were reported, but the data collection rate is still low or slow, is there any way to increase its rate?


Accompanying text created by Annie Ly | Graduate Student, Regulatory Science, USC School of Pharmacy and Emily Donahue | Undergraduate Student, Pharmacology and Drug Development, USC School of Pharmacy


Jerry S. H. Lee, PhD
Chief Science and Innovation Officer, Lawrence J. Ellison Institute; Associate Professor of Clinical Medicine and Chemical Engineering at Keck School of Medicine and Viterbi School of Engineering

NIH Funding Acknowledgment: Important - All publications resulting from the utilization of SC CTSI resources are required to credit the SC CTSI grant by including the NIH funding acknowledgment and must comply with the NIH Public Access Policy.