The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences

Avraam Tapinos; Bede Constantinides; My V. T. Phan; Samaneh Kouchaki; Matthew Cotten; David L. Robertson

doi:10.3390/v11050394

The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences

Viruses ◽

10.3390/v11050394 ◽

2019 ◽

Vol 11 (5) ◽

pp. 394

Author(s):

Avraam Tapinos ◽

Bede Constantinides ◽

My V. T. Phan ◽

Samaneh Kouchaki ◽

Matthew Cotten ◽

...

Keyword(s):

Dimensionality Reduction ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Sequential Data ◽

Biological Sequence ◽

Viral Pathogen ◽

Data Intensive ◽

Reduction Methods ◽

Virus Sequence

Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work, we explored the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Leveraging highly compressed sequence transformations to accelerate sequence comparison, our approach yielded comparable accuracy to existing approaches, further demonstrating its suitability for sequences originating from diverse virus populations. We assessed the application of our methodology using both synthetic and real viral pathogen sequences. Our results show that the use of highly compressed sequence approximations can provide accurate results, with analytical performance retained and even enhanced through appropriate dimensionality reduction of sequence data.

Download Full-text

The Utility of Data Transformation for Alignment, de novo Assembly and Classification of Short Read Virus Sequences

10.20944/preprints201904.0014.v1 ◽

2019 ◽

Author(s):

Avraam Tapinos ◽

Bede Constantinides ◽

My VT Phan ◽

Samaneh Kouchaki ◽

Matthew Cotten ◽

...

Keyword(s):

Dimensionality Reduction ◽

De Novo ◽

Sequence Data ◽

Sequential Data ◽

Biological Sequence ◽

Viral Pathogen ◽

Data Intensive ◽

Reduction Methods ◽

Virus Sequence ◽

Comparable Accuracy

Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work we explore the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Despite using highly compressed sequence transformations to accelerate the processes, our sequence processing approach yielded comparable accuracy to existing approaches, and are ideally suited for sequences originating from highly diverse virus populations. We demonstrate the application of our methodology to both synthetic and real viral pathogen sequence data. Our results show that the use of highly compressed sequence approximations can provide accurate results and that useful analytical performance can be retained and even enhanced through appropriate dimensionality reduction of sequence data.

Download Full-text

Alignment by numbers: sequence assembly using compressed numerical representations

10.1101/011940 ◽

2014 ◽

Cited By ~ 2

Author(s):

Avraam Tapinos ◽

Bede Constantinides ◽

Douglas B Kell ◽

David L Robertson

Keyword(s):

Dimensionality Reduction ◽

Sequence Alignment ◽

De Novo ◽

Sequence Data ◽

Sequence Assembly ◽

Viral Population ◽

Sequential Data ◽

Data Intensive ◽

Reduction Methods ◽

Feature Selection Approach

Motivation: DNA sequencing instruments are enabling genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and interpret sequence data. Established methods for computational sequence analysis generally use nucleotide-level resolution of sequences, and while such approaches can be very accurate, increasingly ambitious and data-intensive analyses are rendering them impractical for applications such as genome and metagenome assembly. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction methods are routinely used to reduce the computational burden of analyses. We therefore seek to address the question of whether it is possible to improve the efficiency of sequence alignment by applying dimensionality reduction methods to numerically represented nucleotide sequences. Results: To explore the applicability of signal transformation and dimensionality reduction methods to sequence assembly, we implemented a short read aligner and evaluated its performance against simulated high diversity viral sequences alongside four existing aligners. Using our sequence transformation and feature selection approach, alignment time was reduced by up to 14-fold compared to uncompressed sequences and without reducing alignment accuracy. Despite using highly compressed sequence transformations, our implementation yielded alignments of similar overall accuracy to existing aligners, outperforming all other tools tested at high levels of sequence variation. Our approach was also applied to the de novo assembly of a simulated diverse viral population. Our results demonstrate that full sequence resolution is not a prerequisite of accurate sequence alignment and that analytical performance can be retained and even enhanced through appropriate dimensionality reduction of sequences.

Download Full-text

A long reads-based de-novo assembly of the genome of the Arlee homozygous line reveals chromosomal rearrangements in rainbow trout

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab052 ◽

2021 ◽

Author(s):

Guangtu Gao ◽

Susana Magadan ◽

Geoffrey C Waldbieser ◽

Ramey C Youngblood ◽

Paul A Wheeler ◽

...

Keyword(s):

Rainbow Trout ◽

Chromosome Number ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Structural Variations ◽

High Coverage ◽

Haploid Chromosome Number ◽

Long Reads

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.

Download Full-text

The Chaperonin-60 Universal Target Is a Barcode for Bacteria That Enables De Novo Assembly of Metagenomic Sequence Data

PLoS ONE ◽

10.1371/journal.pone.0049755 ◽

2012 ◽

Vol 7 (11) ◽

pp. e49755 ◽

Cited By ~ 83

Author(s):

Matthew G. Links ◽

Tim J. Dumonceaux ◽

Sean M. Hemmingsen ◽

Janet E. Hill

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Metagenomic Sequence ◽

Chaperonin 60 ◽

Metagenomic Sequence Data

Download Full-text

De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae

Genome Research ◽

10.1101/gr.083311.108 ◽

2008 ◽

Vol 19 (2) ◽

pp. 294-305 ◽

Cited By ~ 103

Author(s):

J. A. Reinhardt ◽

D. A. Baltrus ◽

M. T. Nishimura ◽

W. R. Jeck ◽

C. D. Jones ◽

...

Keyword(s):

Pseudomonas Syringae ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Short Read ◽

Short Read Sequence ◽

Low Coverage

Download Full-text

Faculty Opinions recommendation of De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1136870.593958 ◽

2008 ◽

Author(s):

Steven Salzberg ◽

Michael Schatz

Keyword(s):

Pseudomonas Syringae ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Short Read ◽

Short Read Sequence ◽

Low Coverage

Download Full-text

A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours

Bioinformatics ◽

10.1093/bioinformatics/btaa890 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i625-i633

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Removal Rate ◽

Noise Removal ◽

Memory Usage ◽

Counting Method ◽

Sequencing Data ◽

Sequencing Errors ◽

Human Sequence

Abstract Motivation In de novo sequence assembly, a standard pre-processing step is k-mer counting, which computes the number of occurrences of every length-k sub-sequence in the sequencing reads. Sequencing errors can produce many k-mers that do not appear in the genome, leading to the need for an excessive amount of memory during counting. This issue is particularly serious when the genome to be assembled is large, the sequencing depth is high, or when the memory available is limited. Results Here, we propose a fast near-exact k-mer counting method, CQF-deNoise, which has a module for dynamically removing noisy false k-mers. It automatically determines the suitable time and number of rounds of noise removal according to a user-specified wrong removal rate. We tested CQF-deNoise comprehensively using data generated from a diverse set of genomes with various data properties, and found that the memory consumed was almost constant regardless of the sequencing errors while the noise removal procedure had minimal effects on counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consistently performed the best in terms of memory usage, consuming 49–76% less memory than the second best method. When counting the k-mers from a human dataset with around 60× coverage, the peak memory usage of CQF-deNoise was only 10.9 GB (gigabytes) for k = 28 and 21.5 GB for k = 55. De novo assembly of 106× human sequencing data using CQF-deNoise for k-mer counting required only 2.7 h and 90 GB peak memory. Availability and implementation The source codes of CQF-deNoise and SH-assembly are available at https://github.com/Christina-hshi/CQF-deNoise.git and https://github.com/Christina-hshi/SH-assembly.git, respectively, both under the BSD 3-Clause license.

Download Full-text

A long reads-based de-novo assembly of the genome of the Arlee homozygous line reveals structural genome variation in rainbow trout

10.1101/2020.12.28.424581 ◽

2020 ◽

Author(s):

Guangtu Gao ◽

Susana Magadan ◽

Geoffrey C. Waldbieser ◽

Ramey C. Youngblood ◽

Paul A. Wheeler ◽

...

Keyword(s):

Rainbow Trout ◽

Chromosome Number ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Haploid Chromosome Number ◽

Long Reads ◽

Homozygous Line ◽

Igh Genes

AbstractCurrently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2N=64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.Article SummaryA de-novo genome assembly was generated for the Arlee homozygous line of rainbow trout to enable identification and characterization of genome variants towards developing a rainbow trout pan-genome reference. The new assembly was generated using the PacBio sequencing technology and scaffolding with Hi-C contact maps and Bionano optical mapping. A contiguous genome assembly was obtained, with the contig and scaffold N50 over 15.6 Mb and 39 Mb, respectively, and 95% of the assembly in chromosome sequences. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes.

Download Full-text

Phylogenomic inferences from reference-mapped and de novo assembled short-read sequence data using RADseq sequencing of California white oaks (Quercus section Quercus)

Genome ◽

10.1139/gen-2016-0202 ◽

2017 ◽

Vol 60 (9) ◽

pp. 743-755 ◽

Cited By ~ 21

Author(s):

Sorel Fitz-Gibbon ◽

Andrew L. Hipp ◽

Kasey K. Pham ◽

Paul S. Manos ◽

Victoria L. Sork

Keyword(s):

De Novo Assembly ◽

Reference Genome ◽

De Novo ◽

Sequence Data ◽

Draft Genome ◽

Reduced Representation ◽

Advantages And Disadvantages ◽

Mapping Sequence ◽

Wide Range ◽

Short Read Sequence

The emergence of next generation sequencing has increased by several orders of magnitude the amount of data available for phylogenetics. Reduced representation approaches, such as restriction-sited associated DNA sequencing (RADseq), have proven useful for phylogenetic studies of non-model species at a wide range of phylogenetic depths. However, analysis of these datasets is not uniform and we know little about the potential benefits and drawbacks of de novo assembly versus assembly by mapping to a reference genome. Using RADseq data for 83 oak samples representing 16 taxa, we identified variants via three pipelines: mapping sequence reads to a recently published draft genome of Quercus lobata, and de novo assembly under two sets of locus filters. For each pipeline, we inferred the maximum likelihood phylogeny. All pipelines produced similar trees, with minor shifts in relationships within well-supported clades, despite the fact that they yielded different numbers of loci (68 000 – 111 000 loci) and different degrees of overlap with the reference genome. We conclude that both the reference-aligned and de novo assembly pipelines yield reliable results, and that advantages and disadvantages of these approaches pertain mainly to downstream uses of RADseq data, not to phylogenetic inference per se.

Download Full-text

Speech Emotion Recognition Based on Sparse Representation

Archives of Acoustics ◽

10.2478/aoa-2013-0055 ◽

2013 ◽

Vol 38 (4) ◽

pp. 465-470 ◽

Cited By ~ 11

Author(s):

Jingjie Yan ◽

Xiaolan Wang ◽

Weiyi Gu ◽

LiLi Ma

Keyword(s):

Dimensionality Reduction ◽

Emotion Recognition ◽

Least Squares ◽

Partial Least Squares ◽

Partial Least Squares Regression ◽

Speech Emotion Recognition ◽

Least Squares Regression ◽

Computer Science Pedagogy ◽

Reduction Methods ◽

Analysis Computer

Abstract Speech emotion recognition is deemed to be a meaningful and intractable issue among a number of do- mains comprising sentiment analysis, computer science, pedagogy, and so on. In this study, we investigate speech emotion recognition based on sparse partial least squares regression (SPLSR) approach in depth. We make use of the sparse partial least squares regression method to implement the feature selection and dimensionality reduction on the whole acquired speech emotion features. By the means of exploiting the SPLSR method, the component parts of those redundant and meaningless speech emotion features are lessened to zero while those serviceable and informative speech emotion features are maintained and selected to the following classification step. A number of tests on Berlin database reveal that the recogni- tion rate of the SPLSR method can reach up to 79.23% and is superior to other compared dimensionality reduction methods.

Download Full-text