Assembling the Human Genome
The human genome project (HGP), a global collaboration involving 20 groups, released the first draft of the human reference genome in 2001. Since its release, the HGP reference genome has formed the backbone of human genomics research and led to numerous discoveries in the field of human healthcare. However, the initial reference genome was not fully complete.
Over the years, the HGP reference genome has undergone several updates, with the latest updates including the Genome Research Consortium Human Build 38 patch release 7 (GRCh38.p7) and the Telomere-to-Telomere Consortium Human Genome Build 13 (T2T-CHM13). The T2T-CHM13 was mainly assembled using sequences from PacBio and Oxford Nanopore sequencers, which generate long reads that span repetitive regions and aid in resolving complex genomic structures such as highly repetitive regions, centromeres, and telomeres.
Towards Capturing the Full Diversity of Human Genome Variation
The T2T-CHM13 assembly provides a more complete and contiguous sequence, with fewer gaps and improved accuracy in challenging genomic regions. However, it became clear that reference genomes derived from a small number of individuals cannot capture the full extent of genetic diversity within the human population. For instance, more than two-thirds of structural variants, which consist of insertions, deletions, duplications, and translocation events, are overlooked when aligning sequencing data to a single reference genome. This problem is critical to address since structural variants often have a greater impact on gene function than single nucleotide polymorphisms (SNPs) or indels.
In a recent publication in Nature, Liao et al. report the first release of a human pan-genomic reference from the Human Pangenome Reference Consortium (HPRC). The HPRC pangenome reference consists of high-quality genomic assemblies from a diverse set of individuals and hopes to better capture global genomic diversity.
Assembling the First Human Pangenome Reference
The HPRC pangenome reference consists of 47 genomic assemblies, with 29 samples sequenced by HPRC using long and linked read sequencing data and 18 samples sequenced by other efforts. The 29 sample group, selected from 1KG lymphoblastoid cell lines with normal karyotypes and low passage, were sequenced using PacBio High Fidelity and Oxford Nanopore long read sequencers, as well as Illumina short-read sequencers to encompass reads with varying lengths and error profiles.
The samples were subjected to an average depth of 39.7X, with a quality and N50 value (a measure of contiguity or average sequence length) of 54.5 (1 error per 227,509 bp) and 19.6 Kb, respectively. The individual haploid genomes were assembled using the Trio-Hifiasm software, followed by annotation using a custom Ensemble mapping pipeline to label GENCODE genes and transcripts.
New Human Pangenome Reference Outperforms Existing Reference Genomes
The assembled genomes were aligned to the T2T-CHM13 to assess completeness and copy number polymorphisms and demonstrated high concordance. Additionally, over 99% of protein-coding genes and transcripts were identified in the HPRC genomes. This demonstrated that the HPRC assembled genomes were high quality, structurally sound, complete, and encompassed known human copy number variation in the latest genome release.
The HPRC human pangenome was drafted from the 47 genome assemblies using the Minigraph, Minigraph-Cactus (MC), and PanGenome graph builder (PGGB) software. The average length of the pangenome was over 3 Gb, with the MC graph reporting the most accurate alignment. The MC graph also showed the highest recall and precision rate for small variants when comparing the pangenome decoded variants with the GRCh38 variant sets. The authors showed that alignment to the HPRC pangenome outperforms the current reference genomes in capturing genomic variation, such as SNPs, indels, and SVs, and that most errors reported by conventional mapping techniques are real variants.
Future Implications for Life Science Research
Overall, the human pangenome is a huge step towards creating a more comprehensive human reference genome that represents more populations as a whole. Since the pangenome reference is able to better capture the genomic diversity in the global population, it paves the way for more accurate variant detection, including structural variants, SNPs, and indels, relative to the current standards. This will allow us to detect novel variants that would have otherwise been misclassified as non-aligned reads or errors as biomarkers of human diseases.
Outsourcing Bioinformatics Analysis: How Bridge Informatics Can Help
Groundbreaking progress in drug development, including targeted cancer therapies, is made possible by technological advances making biological data generation, storage, and analysis faster and more accessible than ever before. From pipeline development and software engineering to deploying existing bioinformatics tools, Bridge Informatics can help you on every step of your research journey.
Haider M. Hassan, Data Scientist, Bridge Informatics
Haider is one of our premier data scientists. He provides bioinformatic services to clients, including high throughput sequencing, data pre-processing, analysis, and custom pipeline development. Drawing on his rich experience with a variety of high-throughput sequencing technologies, Haider analyzes transcriptional (spatial and single-cell), epigenetic, and genetic landscapes.
Before joining Bridge Informatics, Haider was a Postdoctoral Associate at the London Regional Cancer Centre in Ontario, Canada. During his postdoc, he investigated the epigenetics of late-onset liver cancer using murine and human models. Haider holds a Ph.D. in biochemistry from Western University, where he studied the molecular mechanisms behind oncogenesis. Haider still lives in Ontario and enjoys spending his spare time visiting local parks. If you’re interested in reaching out, please email [email protected] or [email protected]