A new collaboration between NIST's Human Genomics Team, Mass Spectrometry Data Center, and Information Technology Laboratory to use and develop artificial intelligence (deep learning and other machine learning) methods for a new class of genomics Standard Reference Materials and proteomics Standard Reference Data. Research includes uncertainty analysis and explainability of deep learning methods.
The measurement of diverse molecular classes in biospecimens containing millions of molecules, known as omics, is increasingly important in the life sciences and medicine, and used to find novel cancer therapeutics, discover COVID-19 infection pathways, and diagnose diseases. NIST RMs and SRDs for genomics (measurement of DNA sequence) and proteomics (the measurement of proteins translated from the genome) are widely used in omics research and regulated clinical applications, including the first FDA authorization of a next-generation sequencing device. However, these standards have limited uncertainty information and only cover 80-90% of the genome and less than 20% of the proteome. The standards omit challenging but important genetic loci, such as loci essential to understanding the immune system. Reducing this gap in coverage requires analyzing billions of data points, which is beyond human abilities, but not beyond the capability of modern AI. We propose to create a suite of AI models trained on our large, high quality genomic and proteomic datasets to increase the coverage of the genome and proteome to >99%. It is essential that AI used in SRM's and SRD's be certified and trustworthy so that our industrial, academic, and government stakeholders can perform groundbreaking biomedical research and create novel applications for the clinic, such as testing for immunological and neurological disorders. To that end, we will investigate two bases of trust: accurate estimates of uncertainty and explanations of AI outputs. Both are active areas of research in AI and largely unexplored in biometrology, so application of these methods in light of our deep understanding of structures and biases in genomic and proteomic data will expand the use of AI not only into the different omics subfields, but throughout biometrology.