Preview |
PDF (Original Article)
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
13MB |
Other (Supplemental Material)
20MB |
Item Type: | Article |
---|---|
Title: | Gaps and complex structurally variant loci in phased genome assemblies |
Creators Name: | Porubsky, D., Vollger, M.R., Harvey, W.T., Rozanski, A.N., Ebert, P., Hickey, G., Hasenfeld, P., Sanders, A.D., Stober, C., Korbel, J.O., Paten, B., Marschall, T. and Eichler, E.E. |
Abstract: | There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation. |
Keywords: | DNA Sequence Analysis, Genetic Polymorphism, Genomic Segmental Duplications, Haplotypes, Satellite DNA |
Source: | Genome Research |
ISSN: | 1088-9051 |
Publisher: | Cold Spring Harbor Laboratory Press |
Volume: | 33 |
Number: | 4 |
Page Range: | 496-510 |
Date: | April 2023 |
Official Publication: | https://doi.org/10.1101/gr.277334.122 |
PubMed: | View item in PubMed |
Repository Staff Only: item control page