Robust detection of tandem repeat expansions from long DNA reads

Mapping Intimacies ◽

10.1101/356931 ◽

2018 ◽

Cited By ~ 1

Author(s):

Satomi Mitsuhashi ◽

Martin C Frith ◽

Takeshi Mizuguchi ◽

Satoko Miyatake ◽

Tomoko Toyota ◽

...

Keyword(s):

Tandem Repeat ◽

Tandem Repeats ◽

Genetic Diseases ◽

Error Rates ◽

Robust Detection ◽

Sequencing Errors ◽

Tandem Repeat Sequences ◽

Long Read ◽

Repeat Expansions ◽

The Many

AbstractTandemly repeated sequences are highly mutable and variable features of genomes. Tandem repeat expansions are responsible for a growing list of human diseases, even though it is hard to determine tandem repeat sequences with current DNA sequencing technology. Recent long-read technologies are promising, because the DNA reads are often longer than the repetitive regions, but are hampered by high error rates. Here, we report robust detection of human repeat expansions from careful alignments of long (PacBio and nanopore) reads to a reference genome. Our method (tandem-genotypes) is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we can prioritize pathological expansions within the top 10 out of 700000 tandem repeats in the genome. This may help to elucidate the many genetic diseases whose causes remain unknown.

Download Full-text

Human-specific tandem repeat expansion and differential gene expression during primate evolution

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1912175116 ◽

2019 ◽

Vol 116 (46) ◽

pp. 23243-23253 ◽

Cited By ~ 13

Author(s):

Arvis Sulovari ◽

Ruiyang Li ◽

Peter A. Audano ◽

David Porubsky ◽

Mitchell R. Vollger ◽

...

Keyword(s):

Tandem Repeat ◽

Tandem Repeats ◽

Sequence Data ◽

Variable Number ◽

Specific Expression ◽

Sequence Composition ◽

Transcription Profiles ◽

Long Read ◽

Repeat Expansions ◽

Human Specific

Short tandem repeats (STRs) and variable number tandem repeats (VNTRs) are important sources of natural and disease-causing variation, yet they have been problematic to resolve in reference genomes and genotype with short-read technology. We created a framework to model the evolution and instability of STRs and VNTRs in apes. We phased and assembled 3 ape genomes (chimpanzee, gorilla, and orangutan) using long-read and 10x Genomics linked-read sequence data for 21,442 human tandem repeats discovered in 6 haplotype-resolved assemblies of Yoruban, Chinese, and Puerto Rican origin. We define a set of 1,584 STRs/VNTRs expanded specifically in humans, including large tandem repeats affecting coding and noncoding portions of genes (e.g., MUC3A, CACNA1C). We show that short interspersed nuclear element–VNTR–Alu (SVA) retrotransposition is the main mechanism for distributing GC-rich human-specific tandem repeat expansions throughout the genome but with a bias against genes. In contrast, we observe that VNTRs not originating from retrotransposons have a propensity to cluster near genes, especially in the subtelomere. Using tissue-specific expression from human and chimpanzee brains, we identify genes where transcript isoform usage differs significantly, likely caused by cryptic splicing variation within VNTRs. Using single-cell expression from cerebral organoids, we observe a strong effect for genes associated with transcription profiles analogous to intermediate progenitor cells. Finally, we compare the sequence composition of some of the largest human-specific repeat expansions and identify 52 STRs/VNTRs with at least 40 uninterrupted pure tracts as candidates for genetically unstable regions associated with disease.

Download Full-text

Finding Long Tandem Repeats In Long Noisy Reads

Bioinformatics ◽

10.1093/bioinformatics/btaa865 ◽

2020 ◽

Author(s):

Shinichi Morishita ◽

Kazuki Ichikawa ◽

Gene Myers

Keyword(s):

Tandem Repeat ◽

Error Rate ◽

Tandem Repeats ◽

Repeat Unit ◽

Error Rates ◽

De Bruijn Graph ◽

Frequency Distributions ◽

Sequencing Technologies ◽

Long Reads ◽

Repeat Expansions

Abstract Motivation Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10,000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10%-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (< 1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. Results Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder (TRF), a widely used program for finding tandem repeats, in terms of sensitivity. Software availability https://github.com/morisUtokyo/mTR

Download Full-text

Accurate measurement of microsatellite length by disrupting its tandem repeat structure

10.1101/2021.12.09.471828 ◽

2021 ◽

Author(s):

Dan Levy ◽

Zihua Wang ◽

Andrea Moffitt ◽

Michael H. Wigler

Keyword(s):

Tandem Repeat ◽

Error Rate ◽

Tandem Repeats ◽

Clinical Applications ◽

Error Rates ◽

Sequence Motifs ◽

High Error Rate ◽

Repeat Structure ◽

Flanking Regions ◽

Simple Sequence

Replication of tandem repeats of simple sequence motifs, also known as microsatellites, is error prone and variable lengths frequently occur during population expansions. Therefore, microsatellite length variations could serve as markers for cancer. However, accurate error-free quantitation of microsatellite lengths is difficult with current methods because of a high error rate during amplification and sequencing. We have solved this problem by using partial mutagenesis to disrupt enough of the repeat structure so that it can replicate faithfully, yet not so much that the flanking regions cannot be reliably identified. In this work we use bisulfite mutagenesis to convert a C to a U, later read as T. Compared to untreated templates, we achieve three orders of magnitude reduction in the error rate per round of replication. By requiring two independent first copies of an initial template, we reach error rates below one in a million. We discuss potential clinical applications of this method.

Download Full-text

Comprehensive genetic diagnosis of tandem repeat expansion disorders with programmable targeted nanopore sequencing

10.1101/2021.09.27.21263187 ◽

2021 ◽

Author(s):

Igor Stevanovski ◽

Sanjog R. Chintalaphani ◽

Hasindu Gamaarachchi ◽

James M. Ferguson ◽

Sandy S. Pineda ◽

...

Keyword(s):

Tandem Repeat ◽

Tandem Repeats ◽

Fragile X ◽

Genetic Diagnosis ◽

Neuromuscular Diseases ◽

Nanopore Sequencing ◽

Molecular Tests ◽

Genetic Landscape ◽

Long Read ◽

Short Tandem

ABSTRACTShort-tandem repeat (STR) expansions are an important class of pathogenic genetic variants. Over forty neurological and neuromuscular diseases are caused by STR expansions, with 37 different genes implicated to date. Here we describe the use of programmable targeted long-read sequencing with Oxford Nanopore’s ReadUntil function for parallel genotyping of all known neuropathogenic STRs in a single, simple assay. Our approach enables accurate, haplotype-resolved assembly and DNA methylation profiling of expanded and non-expanded STR sites. In doing so, the assay correctly diagnoses all individuals in a cohort of patients (n = 27) with various neurogenetic diseases, including Huntington’s disease, fragile X syndrome and cerebellar ataxia (CANVAS) and others. Targeted long-read sequencing solves large and complex STR expansions that confound established molecular tests and short-read sequencing, and identifies non-canonical STR motif conformations and internal sequence interruptions. Even in our relatively small cohort, we observe a wide diversity of STR alleles of known and unknown pathogenicity, suggesting that long-read sequencing will redefine the genetic landscape of STR expansion disorders. Finally, we show how the flexible inclusion of pharmacogenomics (PGx) genes as secondary ReadUntil targets can identify clinically actionable PGx genotypes to further inform patient care, at no extra cost. Our study addresses the need for improved techniques for genetic diagnosis of STR expansion disorders and illustrates the broad utility of programmable long-read sequencing for clinical genomics.One sentence summaryThis study describes the development and validation of a programmable targeted nanopore sequencing assay for parallel genetic diagnosis of all known pathogenic short-tandem repeats (STRs) in a single, simple test.

Download Full-text

Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads

Genome Biology ◽

10.1186/s13059-019-1667-6 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 25

Author(s):

Satomi Mitsuhashi ◽

Martin C. Frith ◽

Takeshi Mizuguchi ◽

Satoko Miyatake ◽

Tomoko Toyota ◽

...

Keyword(s):

Tandem Repeat ◽

Robust Detection ◽

Repeat Expansions

Download Full-text

NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION

Genome Biology ◽

10.1186/s13059-019-1856-3 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 10

Author(s):

Arne De Roeck ◽

Wouter De Coster ◽

Liene Bossaerts ◽

Rita Cacace ◽

Tim De Pooter ◽

...

Keyword(s):

Tandem Repeat ◽

Large Scale ◽

Tandem Repeats ◽

Current Data ◽

Flip Flop ◽

Base Calling ◽

Oxford Nanopore ◽

Long Read ◽

Technological Limitations ◽

Repeat Assessment

AbstractTechnological limitations have hindered the large-scale genetic investigation of tandem repeats in disease. We show that long-read sequencing with a single Oxford Nanopore Technologies PromethION flow cell per individual achieves 30× human genome coverage and enables accurate assessment of tandem repeats including the 10,000-bp Alzheimer’s disease-associated ABCA7 VNTR. The Guppy “flip-flop” base caller and tandem-genotypes tandem repeat caller are efficient for large-scale tandem repeat assessment, but base calling and alignment challenges persist. We present NanoSatellite, which analyzes tandem repeats directly on electric current data and improves calling of GC-rich tandem repeats, expanded alleles, and motif interruptions.

Download Full-text

Diverse origins of high copy tandem repeats in grass genomes

10.7287/peerj.preprints.2314 ◽

2016 ◽

Author(s):

Paul Bilinski ◽

Yonghua Han ◽

Matthew B Hufford ◽

Anne Lorant ◽

Pingdong Zhang ◽

...

Keyword(s):

Tandem Repeat ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

De Novo ◽

Repetitive Sequences ◽

Read Mapping ◽

Hybridization Data ◽

Centromeric Repeat ◽

Tandem Repeat Sequences

In studying genomic architecture, highly repetitive regions have historically posed a challenge when investigating sequence variation and content. High-throughput sequencing has enabled researchers to use whole-genome shotgun sequencing to estimate the abundance of repetitive sequence, and these methodologies have been recently applied to centromeres. Here, we utilize sequence assembly and read mapping to identify and quantify the genomic abundance of different tandem repeat sequences. Previous research has posited that the highest abundance tandem repeat in eukaryotic genomes is often the centromeric repeat, and we pair our bioinformatic pipeline with fluorescent in-situ hybridization data to test this hypothesis. We find that de novo assembly and bioinformatic filters can successfully identify repeats with homology to known tandem repeats. Fluorescent in-situ hybridization, however, shows that de novo assembly fails to identify novel centromeric repeats, instead identifying other potentially important repetitive sequences. Together, our results test the applicability and limitations of using de novo repeat assembly of tandem repeats to identify novel centromeric repeats. Building on our findings of genomic composition, we also set forth a method for exploring the repetitive regions of non-model genomes whose diversity limits the applicability of established genetic resources.

Download Full-text

Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btz484 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4809-4811 ◽

Cited By ~ 8

Author(s):

Robert S Harris ◽

Monika Cechova ◽

Kateryna D Makova

Keyword(s):

Tandem Repeats ◽

Error Rates ◽

Superior Performance ◽

Supplementary Information ◽

Whole Genome Sequencing Data ◽

Dna Repeats ◽

Sequencing Data ◽

Heat Shock Stress ◽

Noise Cancelling ◽

Long Read

Abstract Summary Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response. Availability and implementation NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences

Genome Biology ◽

10.1186/s13059-021-02447-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Readman Chiu ◽

Indhu-Shree Rajan-Babu ◽

Jan M. Friedman ◽

Inanc Birol

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Tandem Repeat ◽

Neurological Disorders ◽

Software Tool ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Long Read ◽

Repeat Expansions

AbstractTandem repeat (TR) expansion is the underlying cause of over 40 neurological disorders. Long-read sequencing offers an exciting avenue over conventional technologies for detecting TR expansions. Here, we present Straglr, a robust software tool for both targeted genotyping and novel expansion detection from long-read alignments. We benchmark Straglr using various simulations, targeted genotyping data of cell lines carrying expansions of known diseases, and whole genome sequencing data with chromosome-scale assembly. Our results suggest that Straglr may be useful for investigating disease-associated TR expansions using long-read sequencing.

Download Full-text

Induction of Recombinant Lectin Expression by an Artificially Constructed Tandem Repeat Structure: A Case Study Using Bryopsis plumosa Mannose-Binding Lectin

Biomolecules ◽

10.3390/biom8040146 ◽

2018 ◽

Vol 8 (4) ◽

pp. 146 ◽

Cited By ~ 3

Author(s):

Hyun-Ju Hwang ◽

Jin-Woo Han ◽

Hancheol Jeon ◽

Jong Han

Keyword(s):

Tandem Repeat ◽

Large Scale ◽

Tandem Repeats ◽

Expression System ◽

Mannose Binding Lectin ◽

Repeat Structure ◽

Tandem Repeat Sequences ◽

Mannose Binding ◽

Bryopsis Plumosa ◽

Binding Lectin

Lectin is an important protein in medical and pharmacological applications. Impurities in lectin derived from natural sources and the generation of inactive proteins by recombinant technology are major obstacles for the use of lectins. Expressing recombinant lectin with a tandem repeat structure can potentially overcome these problems, but few studies have systematically examined this possibility. This was investigated in the present study using three distinct forms of recombinant mannose-binding lectin from Bryopsis plumosa (BPL2)—i.e., the monomer (rD1BPL2), as well as the dimer (rD2BPL2), and tetramer (rD4BPL2) arranged as tandem repeats. The concentration of the inducer molecule isopropyl β-D-1-thiogalactopyranoside and the induction time had no effect on the efficiency of the expression of each construct. Of the tested constructs, only rD4BPL2 showed hemagglutination activity towards horse erythrocytes; the activity of towards the former was 64 times higher than that of native BPL2. Recombinant and native BPL2 showed differences in carbohydrate specificity; the activity of rD4BPL2 was inhibited by the glycoprotein fetuin, whereas that of native BPL2 was also inhibited by d-mannose. Our results indicate that expression as tandem repeat sequences can increase the efficiency of lectin production on a large scale using a bacterial expression system.

Download Full-text