scholarly journals Sequence Compression Benchmark (SCB) database — a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

2019 ◽  
Author(s):  
Kirill Kryukov ◽  
Mahoko Takahashi Ueda ◽  
So Nakagawa ◽  
Tadashi Imanishi

AbstractBackgroundNearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available.FindingsWe systematically benchmarked 410 settings of 44 compressors (including 26 specialized sequence compressors and 18 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 25 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/) that allows building custom visualizations for selected subsets of benchmark results.ConclusionWe found that modern compressors offer large improvement in compactness and speed compared to gzip. Our benchmark allows comparing compressors and their settings using a variety of performance measures, offering the opportunity to select the optimal compressor based on the data type and usage scenario specific to particular application.

GigaScience ◽  
2020 ◽  
Vol 9 (7) ◽  
Author(s):  
Kirill Kryukov ◽  
Mahoko Takahashi Ueda ◽  
So Nakagawa ◽  
Tadashi Imanishi

Abstract Background Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available. Findings We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA, and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/), which allows custom visualizations to be built for selected subsets of benchmark results. Conclusion We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows compressors and their settings to be compared using a variety of performance measures, offering the opportunity to select the optimal compressor on the basis of the data type and usage scenario specific to a particular application.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Johannes Linder ◽  
Georg Seelig

Abstract Background Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence. Results Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp’s capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor. Conclusions Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.


1969 ◽  
Vol 115 (3) ◽  
pp. 377-382 ◽  
Author(s):  
D. C. Watts ◽  
J. D. Reid

1. Although the total weight of leg muscle increased with the age of a normal mouse the DNA and RNA content per leg did not change significantly. 2. The weight of leg muscle from a dystrophic mouse was only about 45% of that from a normal mouse but the DNA and RNA contents were the same and hence similar DNA/RNA ratios were obtained. 3. The total ribosome contents of normal and dystrophic mice were the same on a whole-leg basis, and for both the free ribosomes were about 60% of the total. However, comparison with similar data from liver suggested that some loss of ribosomes occurred during the isolation procedure. 4. The polyribosome patterns obtained by density-gradient centrifugation were the same for normal and dystrophic muscle, and comparable polyribosome fractions of different sizes obtained from such gradients had similar capacities for the incorporation of radioactive amino acids in a standard protein-synthesizing system. 5. By using a standard protein-synthesizing system with normal polyribosomes similar extents of incorporation were found with normal- or dystrophic-muscle pH5 fraction or partially purified transfer RNA preparation. 6. It is concluded that there is no absolute difference between the protein-synthesizing systems of normal and dystrophic mouse muscle and that the observed apparent differences result from concentration differences caused by changes in muscle volume. 7. A possible cause of the failure of dystrophic muscle to resynthesize myofibrils is also suggested.


2021 ◽  
Author(s):  
Shaofang Li ◽  
Lang Liu ◽  
Wenxian Sun ◽  
Xueping Zhou ◽  
Huanbin Zhou

The high-activity adenine base editors (ABEs), engineered with the recently-developed tRNA adenosine deaminases (TadA8e and TadA9), show robust base editing activity but raise concerns about off-target effects. In this study, we performed a comprehensive evaluation of ABE8e- and ABE9-induced DNA and RNA mutations in Oryza sativa. Whole-genome sequencing analysis of plants transformed with four ABEs, including SpCas9n-TadA8e, SpCas9n-TadA9, SpCas9n-NG-TadA8e, and SpCas9n-NG-TadA9, revealed that ABEs harboring TadA9 lead to a higher number of off-target A-to-G (A>G) single-nucleotide variants (SNVs), and that those harboring the CRISPR/SpCas9n-NG lead to a higher total number of off-target SNVs in the rice genome. An analysis of the T-DNAs carrying the ABEs indicated that the on-target mutations could be introduced before and/or after T-DNA integration into plant genomes, with more off-target A>G SNVs forming after the ABEs had integrated into the plant genome. Furthermore, we detected off-target A>G RNA mutations in plants with high expression of ABEs but not in plants with low expression of ABEs. The off-target A>G RNA mutations tended to cluster, while off-target A>G DNA mutations rarely clustered.Our findings that Cas proteins, TadA variants, temporal expression of ABEs, and expression levels of ABEs contribute to ABE specificity in rice provide insight into the specificity of ABEs and suggest alternative ways to increase ABE specificity besides engineering TadA variants.


2018 ◽  
Vol 144 (2) ◽  
pp. 04017072
Author(s):  
Eneliko Mulokozi ◽  
Hualiang (Harry) Teng ◽  
Valerian Kwigizile ◽  
Deo Chimba ◽  
Thobias Sando

2020 ◽  
Author(s):  
Thordis Thorarinsdottir ◽  
Jana Sillmann ◽  
Marion Haugen

<p>Climate models aim to project future changes in important drivers of climate including atmosphere, oceans and ice, and their interactions. A comprehensive evaluation of climate models thus requires evaluation methods, or performance measures, that are flexible, specific and can address also extreme events. Climate models have traditionally been assessed by comparing summary statistics or point estimates that derive from the simulated model output to corresponding observed quantities using e.g. RMSE. However, it has been argued persuasively that probability distributions of model output need to be compared to the corresponding empirical distributions of observations or observation-based data products. Observation-based gridded datasets for climate extremes, despite having limitations, are particularly useful and necessary to assess model performance with respect to extremes.  We discuss proper performance measures for comparing distributions of model output against corresponding distributions from data products that are flexible and robust enough to handle the particular aspects of extremes such as limited data availability. The new measures are applied to evaluate CMIP5 and CMIP6 projections of extreme temperature indices over Europe and North-America against the HadEX2 data set as well as the ERA5 and ERA-Interim reanalyses. Several models perform well to the extent that when compared to the HadEX2 data product, these models' performance is competitive with the performance of the reanalysis. While the model rankings vary with region, season and index, the model evaluation is robust against changes in the grid resolution considered in the analysis. </p>


CNS Spectrums ◽  
1999 ◽  
Vol 4 (5) ◽  
pp. 59-74 ◽  
Author(s):  
Pamela Sklar ◽  
David Altshuler ◽  
Michele Cargill ◽  
Joel N. Hirschhorn

AbstractAs the Human Genome Project completes the first human genome sequence, attention has turned to how this information can be used to understand disease. The availability of sequences for all genes will allow a comprehensive evaluation of each gene's contribution to disease. Approaches involving collecting specific gene variants and monitoring expression levels using DNA microarrays facilitate collecting information about DNA and RNA in a rapid and highly parallel manner. Developing an extensive catalogue of polymorphisms will become increasingly important in the context of studies of complex genetic diseases such as schizophrenia and bipolar disorder.


1999 ◽  
Vol 122 (4) ◽  
pp. 753-759 ◽  
Author(s):  
Jinwook Kim ◽  
Changbeom Park ◽  
Jongwon Kim ◽  
F. C. Park

In this paper we propose a set of criteria to evaluate the performance of various parallel mechanism architectures for CNC machining applications. In the robotics literature mathematical formulations of qualities like manipulability, stiffness, and workspace volume have been proposed to evaluate the performance of general-purpose robots. Here we propose a set of performance measures that specifically address features of the machining process. We define precise notions of machine tool workspace, joint and link stiffness, and position and orientation manipulability. The performance of various existing 6 d.o.f. architectures are evaluated with these measures. The analytical methodology presented here, in combination with a graphics-based CAD software environment, can serve as a useful tool in the design of high-performance parallel mechanism machine tools. [S1087-1357(00)01804-9]


Author(s):  
B.A. Hamkalo ◽  
S. Narayanswami ◽  
A.P. Kausch

The availability of nonradioactive methods to label nucleic acids an the resultant rapid and greater sensitivity of detection has catapulted the technique of in situ hybridization to become the method of choice to locate of specific DNA and RNA sequences on chromosomes and in whole cells in cytological preparations in many areas of biology. It is being applied to problems of fundamental interest to basic cell and molecular biologists such as the organization of the interphase nucleus in the context of putative functional domains; it is making major contributions to genome mapping efforts; and it is being applied to the analysis of clinical specimens. Although fluorescence detection of nucleic acid hybrids is routinely used, certain questions require greater resolution. For example, very closely linked sequences may not be separable using fluorescence; the precise location of sequences with respect to chromosome structures may be below the resolution of light microscopy(LM); and the relative positions of sequences on very small chromosomes may not be feasible.


Sign in / Sign up

Export Citation Format

Share Document