scholarly journals Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm

2016 ◽  
Author(s):  
Aleksey V. Zimin ◽  
Daniela Puiu ◽  
Ming-Cheng Luo ◽  
Tingting Zhu ◽  
Sergey Koren ◽  
...  

AbstractLong sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and highly repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

Author(s):  
Lior Shamir

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .


2019 ◽  
Author(s):  
Reto Sterchi ◽  
Pascal Haegeli ◽  
Patrick Mair

Abstract. While guides in mechanized skiing operations use a well-established terrain selection process to limit their exposure to avalanche hazard and keep the residual risk at an acceptable level, the relationship between the open/closed status of runs and environmental factors is complex and has so far only received limited attention from research. Using a large data set of over 25 000 operational run list codes from a mechanized skiing operation, we applied a general linear mixed effects model to explore the relationship between acceptable skiing terrain (i.e., status open) and avalanche hazard conditions. Our results show that the magnitude of the effect of avalanche hazard on run list codes depends on the type of terrain that is being assessed by the guiding team. Ski runs in severe alpine terrain with steep lines through large avalanche slopes are much more susceptible to increases in avalanche hazard than less severe terrain. However, our results also highlight the strong effects of recent skiing on the run coding and thus the importance of prior first-hand experience. Expressing these relationships numerically provides an important step towards the development of meaningful decision aids, which can assist commercial operations to manage their avalanche risk more effectively and efficiently.


2020 ◽  
Author(s):  
Markus Wiedemann ◽  
Bernhard S.A. Schuberth ◽  
Lorenzo Colli ◽  
Hans-Peter Bunge ◽  
Dieter Kranzlmüller

&lt;p&gt;Precise knowledge of the forces acting at the base of tectonic plates is of fundamental importance, but models of mantle dynamics are still often qualitative in nature to date. One particular problem is that we cannot access the deep interior of our planet and can therefore not make direct in situ measurements of the relevant physical parameters. Fortunately, modern software and powerful high-performance computing infrastructures allow us to generate complex three-dimensional models of the time evolution of mantle flow through large-scale numerical simulations.&lt;/p&gt;&lt;p&gt;In this project, we aim at visualizing the resulting convective patterns that occur thousands of kilometres below our feet and to make them &quot;accessible&quot; using high-end virtual reality techniques.&lt;/p&gt;&lt;p&gt;Models with several hundred million grid cells are nowadays possible using the modern supercomputing facilities, such as those available at the Leibniz Supercomputing Centre. These models provide quantitative estimates on the inaccessible parameters, such as buoyancy and temperature, as well as predictions of the associated gravity field and seismic wavefield that can be tested against Earth observations.&lt;/p&gt;&lt;p&gt;3-D visualizations of the computed physical parameters allow us to inspect the models such as if one were actually travelling down into the Earth. This way, convective processes that occur thousands of kilometres below our feet are virtually accessible by combining the simulations with high-end VR techniques.&lt;/p&gt;&lt;p&gt;The large data set used here poses severe challenges for real time visualization, because it cannot fit into graphics memory, while requiring rendering with strict deadlines. This raises the necessity to balance the amount of displayed data versus the time needed for rendering it.&lt;/p&gt;&lt;p&gt;As a solution, we introduce a rendering framework and describe our workflow that allows us to visualize this geoscientific dataset. Our example exceeds 16 TByte in size, which is beyond the capabilities of most visualization tools. To display this dataset in real-time, we reduce and declutter the dataset through isosurfacing and mesh optimization techniques.&lt;/p&gt;&lt;p&gt;Our rendering framework relies on multithreading and data decoupling mechanisms that allow to upload data to graphics memory while maintaining high frame rates. The final visualization application can be executed in a CAVE installation as well as on head mounted displays such as the HTC Vive or Oculus Rift. The latter devices will allow for viewing our example on-site at the EGU conference.&lt;/p&gt;


2007 ◽  
Vol 4 (5) ◽  
pp. 3639-3671 ◽  
Author(s):  
A. V. Borges ◽  
B. Tilbrook ◽  
N. Metzl ◽  
A. Lenton ◽  
B. Delille

Abstract. We compiled a large data-set from 22 cruises spanning from 1991 to 2003, of the partial pressure of CO2 (pCO2) in surface waters over the continental shelf (CS) and adjacent open ocean (43° to 46° S; 145° to 150° E), south of Tasmania. Sea surface temperature (SST) anomalies (as intense as 2°C) are apparent in the subtropical zone (STZ) and subAntarctic zone (SAZ). These SST anomalies also occur on the CS, and seem to be related to large-scale coupled atmosphere-ocean oscillations. Anomalies of pCO2 normalized to a constant temperature are negatively related to SST anomalies. A depressed winter-time vertical input of dissolved inorganic carbon (DIC) during phases of positive SST anomalies, related to a poleward shift of westerly winds, and a concomitant local decrease in wind stress are the likely cause of the negative relationship between pCO2 and SST anomalies. The observed trend is an increase of the sink for atmospheric CO2 associated with positive SST anomalies, although strongly modulated by inter-annual variability of wind speed. Assuming that phases of positive SST anomalies are indicative of the future evolution of regional ocean biogeochemistry under global warming, we show using a purely observational based approach that some provinces of the Southern Ocean could provide a potential negative feedback on increasing atmospheric CO2.


2019 ◽  
Author(s):  
Sylvain Lehmann ◽  
Christophe Hirtz ◽  
Jérôme Vialaret ◽  
Maxence Ory ◽  
Guillaume Gras Combes ◽  
...  

SummaryThe extraction of accurate physiological parameters from clinical samples provides a unique perspective to understand disease etiology and evolution, including under therapy. We introduce a new proteomics framework to map patient proteome dynamics in vivo, either proteome wide or in large targeted panels. We applied it to ventricular cerebrospinal fluid (CSF) and could determine the turnover parameters of almost 200 proteins, whereas a handful were known previously. We covered a large number of neuron biology- and immune system-related proteins including many biomarkers and drug targets. This first large data set unraveled a significant relationship between turnover and protein origin that relates to our ability to investigate the central nervous system physiology precisely in future studies. Our data constitute a reference in CSF biology as well as a repertoire of peptides for the community to design new proteome dynamics analyses. The disclosed methods apply to other fluids or tissues provided sequential sample collection can be performed.


Author(s):  
Vikrant Tiwari ◽  
Nimisha Sharma

In the absence of the detailed COVID-19 epidemiological data or large benchmark studies, an effort has been made to explore and correlate the relation of parameters like environment, economic indicators, and the large scale exposure of different prevalent diseases, with COVID-19 spread and severity amongst the different countries affected by COVID-19. Data for environmental, socio-economic and others important infectious diseases were collected from reliable and open source resources like World Health Organization, World Bank, etc. Further, this large data set is utilized to understand the COVID-19 worldwide spread using simple statistical tools. Important observations that are made in this study are the high degree of resemblance in the pattern of temperature and humidity distribution among the cities severely affected by COVID-19. Further, It is surprising to see that in spite of the presence of many environmental parameters that are considered favorable (like clean air, clean water, EPI, etc.), many countries are suffering with the severe consequences of this disease. Lastly a noticeable segregation among the locations affected by different prevalent diseases (like Malaria, HIV, Tuberculosis, and Cholera) was also observed. Among the considered environmental factors, temperature, humidity and EPI should be an important parameter in understanding and modelling COVID-19 spreads. Further, contrary to intuition, countries with strong economies, good health infrastructure and cleaner environment suffered disproportionately higher with the severity of this disease. Therefore, policymaker should sincerely review their country preparedness toward the potential future contagious diseases, weather natural or manmade.


2021 ◽  
Vol 30 (1) ◽  
pp. 479-486
Author(s):  
Lingrui Bu ◽  
Hui Zhang ◽  
Haiyan Xing ◽  
Lijun Wu

Abstract The efficient processing of large-scale data has very important practical value. In this study, a data mining platform based on Hadoop distributed file system was designed, and then K-means algorithm was improved with the idea of max-min distance. On Hadoop distributed file system platform, the parallelization was realized by MapReduce. Finally, the data processing effect of the algorithm was analyzed with Iris data set. The results showed that the parallel algorithm divided more correct samples than the traditional algorithm; in the single-machine environment, the parallel algorithm ran longer; in the face of large data sets, the traditional algorithm had insufficient memory, but the parallel algorithm completed the calculation task; the acceleration ratio of the parallel algorithm was raised with the expansion of cluster size and data set size, showing a good parallel effect. The experimental results verifies the reliability of parallel algorithm in big data processing, which makes some contributions to further improve the efficiency of data mining.


2020 ◽  
Vol 29 (05) ◽  
pp. 2050010
Author(s):  
Pola Lydia Lagari ◽  
Lefteri H. Tsoukalas ◽  
Isaac E. Lagaris

Stochastic Gradient Descent (SGD) is perhaps the most frequently used method for large scale training. A common example is training a neural network over a large data set, which amounts to minimizing the corresponding mean squared error (MSE). Since the convergence of SGD is rather slow, acceleration techniques based on the notion of “Mini-Batches” have been developed. All of them however, mimicking SGD, impose diminishing step-sizes as a means to inhibit large variations in the MSE objective. In this article, we introduce random sets of mini-batches instead of individual mini-batches. We employ an objective function that minimizes the average MSE and its variance over these sets, eliminating so the need for the systematic step size reduction. This approach permits the use of state-of-the-art optimization methods, far more efficient than the gradient descent, and yields a significant performance enhancement.


2019 ◽  
Author(s):  
Laura H. Tung ◽  
Mingfu Shao ◽  
Carl Kingsford

AbstractThird-generation sequencing technologies benefit transcriptome analysis by generating longer sequencing reads. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and the sequencing length limit of the platform. This drives a need for long read transcript assembly. We quantify the benefit that can be achieved by using a transcript assembler on long reads. Adding long-read-specific algorithms, we evolved Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates. Analyzing 26 SRA PacBio datasets using Scallop-LR, Iso-Seq Analysis, and StringTie, we quantified the amount by which assembly improved Iso-Seq results. Through combined evaluation methods, we found that Scallop-LR identifies 2100–4000 more (for 18 human datasets) or 1100–2200 more (for eight mouse datasets) known transcripts than Iso-Seq Analysis, which does not do assembly. Further, Scallop-LR finds 2.4–4.4 times more potentially novel isoforms than Iso-Seq Analysis for the human and mouse datasets. StringTie also identifies more transcripts than Iso-Seq Analysis. Adding long-read-specific optimizations in Scallop-LR increases the numbers of predicted known transcripts and potentially novel isoforms for the human transcriptome compared to several recent short-read assemblers (e.g. StringTie). Our findings indicate that transcript assembly by Scallop-LR can reveal a more complete human transcriptome.


2013 ◽  
Vol 7 (1) ◽  
pp. 19-24
Author(s):  
Kevin Blighe

Elaborate downstream methods are required to analyze large microarray data-sets. At times, where the end goal is to look for relationships between (or patterns within) different subgroups or even just individual samples, large data-sets must first be filtered using statistical thresholds in order to reduce their overall volume. As an example, in anthropological microarray studies, such ‘dimension reduction’ techniques are essential to elucidate any links between polymorphisms and phenotypes for given populations. In such large data-sets, a subset can first be taken to represent the larger data-set. For example, polling results taken during elections are used to infer the opinions of the population at large. However, what is the best and easiest method of capturing a sub-set of variation in a data-set that can represent the overall portrait of variation? In this article, principal components analysis (PCA) is discussed in detail, including its history, the mathematics behind the process, and in which ways it can be applied to modern large-scale biological datasets. New methods of analysis using PCA are also suggested, with tentative results outlined.


Sign in / Sign up

Export Citation Format

Share Document