scholarly journals Advances in the recovery of haplotypes from the metagenome

2016 ◽  
Author(s):  
Samuel M. Nicholls ◽  
Wayne Aubrey ◽  
Kurt de Grave ◽  
Leander Schietgat ◽  
Christopher J. Creevey ◽  
...  

AbstractHigh-throughput DNA sequencing has enabled us to look beyond consensus reference sequences to the variation observed in sequences within organisms; their haplotypes. Recovery, or assembly of haplotypes has proved computationally difficult and there exist many probabilistic heuristics that attempt to recover the original haplotypes for a single organism of known ploidy. However, existing approaches make simplifications or assumptions that are easily violated when investigating sequence variation within a metagenome.We propose the metahaplome as the set of haplotypes for any particular genomic region of interest within a metagenomic data set and present Hansel and Gretel, a data structure and algorithm that together provide a proof of concept framework for the recovery of true haplotypes from a metagenomic data set. The algorithm performs incremental haplotype recovery, using smoothed Naive Bayes — a simple, efficient and effective method.Hansel and Gretel pose several advantages over existing solutions: the framework is capable of recovering haplotypes from metagenomes, does not require a priori knowledge about the input data, makes no assumptions regarding the distribution of alleles at variant sites, is robust to error, and uses all available evidence from aligned reads, without altering or discarding observed variation. We evaluate our approach using synthetic metahaplomes constructed from sets of real genes and show that up to 99% of SNPs on a haplotype can be correctly recovered from short reads that originate from a metagenomic data set.

2017 ◽  
Author(s):  
Samuel M. Nicholls ◽  
Wayne Aubrey ◽  
Kurt de Grave ◽  
Leander Schietgat ◽  
Christopher J. Creevey ◽  
...  

AbstractThe cryptic diversity of microbial communities represent an untapped biotechnological resource for biomining, biorefining and synthetic biology. Revealing this information requires the recovery of the exact sequence of DNA bases (or “haplotype”) that constitutes the genes and genomes of every individual present. This is a computationally difficult problem complicated by the requirement for environmental sequencing approaches (metagenomics) due to the resistance of the constituent organisms to culturing in vitro.Haplotypes are identified by their unique combination of DNA variants. However, standard approaches for working with metagenomic data require simplifications that violate assumptions in the process of identifying such variation. Furthermore, current haplotyping methods lack objective mechanisms for choosing between alternative haplotype reconstructions from microbial communities.To address this, we have developed a novel probabilistic approach for reconstructing haplotypes from complex microbial communities and propose the “metahaplome” as a definition for the set of haplotypes for any particular genomic region of interest within a metagenomic dataset. Implemented in the twin software tools Hansel and Gretel, the algorithm performs incremental probabilistic haplotype recovery using Naive Bayes — an efficient and effective technique.Our approach is capable of reconstructing the haplotypes with the highest likelihoods from metagenomic datasets without a priori knowledge or making assumptions of the distribution or number of variants. Additionally, the algorithm is robust to sequencing and alignment error without altering or discarding observed variation and uses all available evidence from aligned reads. We validate our approach using synthetic metahaplomes constructed from sets of real genes, and demonstrate its capability using metagenomic data from a complex HIV-1 strain mix. The results show that the likelihood framework can allow recovery from microbial communities of cryptic functional isoforms of genes with 100% accuracy.


2021 ◽  
Vol 4 (1) ◽  
pp. 251524592095492
Author(s):  
Marco Del Giudice ◽  
Steven W. Gangestad

Decisions made by researchers while analyzing data (e.g., how to measure variables, how to handle outliers) are sometimes arbitrary, without an objective justification for choosing one alternative over another. Multiverse-style methods (e.g., specification curve, vibration of effects) estimate an effect across an entire set of possible specifications to expose the impact of hidden degrees of freedom and/or obtain robust, less biased estimates of the effect of interest. However, if specifications are not truly arbitrary, multiverse-style analyses can produce misleading results, potentially hiding meaningful effects within a mass of poorly justified alternatives. So far, a key question has received scant attention: How does one decide whether alternatives are arbitrary? We offer a framework and conceptual tools for doing so. We discuss three kinds of a priori nonequivalence among alternatives—measurement nonequivalence, effect nonequivalence, and power/precision nonequivalence. The criteria we review lead to three decision scenarios: Type E decisions (principled equivalence), Type N decisions (principled nonequivalence), and Type U decisions (uncertainty). In uncertain scenarios, multiverse-style analysis should be conducted in a deliberately exploratory fashion. The framework is discussed with reference to published examples and illustrated with the help of a simulated data set. Our framework will help researchers reap the benefits of multiverse-style methods while avoiding their pitfalls.


Genetics ◽  
2002 ◽  
Vol 161 (3) ◽  
pp. 1333-1337
Author(s):  
Thomas I Milac ◽  
Frederick R Adler ◽  
Gerald R Smith

Abstract We have determined the marker separations (genetic distances) that maximize the probability, or power, of detecting meiotic recombination deficiency when only a limited number of meiotic progeny can be assayed. We find that the optimal marker separation is as large as 30–100 cM in many cases. Provided the appropriate marker separation is used, small reductions in recombination potential (as little as 50%) can be detected by assaying a single interval in as few as 100 progeny. If recombination is uniformly altered across the genomic region of interest, the same sensitivity can be obtained by assaying multiple independent intervals in correspondingly fewer progeny. A reduction or abolition of crossover interference, with or without a reduction of recombination proficiency, can be detected with similar sensitivity. We present a set of graphs that display the optimal marker separation and the number of meiotic progeny that must be assayed to detect a given recombination deficiency in the presence of various levels of crossover interference. These results will aid the optimal design of experiments to detect meiotic recombination deficiency in any organism.


2015 ◽  
Vol 8 (2) ◽  
pp. 941-963 ◽  
Author(s):  
T. Vlemmix ◽  
F. Hendrick ◽  
G. Pinardi ◽  
I. De Smedt ◽  
C. Fayt ◽  
...  

Abstract. A 4-year data set of MAX-DOAS observations in the Beijing area (2008–2012) is analysed with a focus on NO2, HCHO and aerosols. Two very different retrieval methods are applied. Method A describes the tropospheric profile with 13 layers and makes use of the optimal estimation method. Method B uses 2–4 parameters to describe the tropospheric profile and an inversion based on a least-squares fit. For each constituent (NO2, HCHO and aerosols) the retrieval outcomes are compared in terms of tropospheric column densities, surface concentrations and "characteristic profile heights" (i.e. the height below which 75% of the vertically integrated tropospheric column density resides). We find best agreement between the two methods for tropospheric NO2 column densities, with a standard deviation of relative differences below 10%, a correlation of 0.99 and a linear regression with a slope of 1.03. For tropospheric HCHO column densities we find a similar slope, but also a systematic bias of almost 10% which is likely related to differences in profile height. Aerosol optical depths (AODs) retrieved with method B are 20% high compared to method A. They are more in agreement with AERONET measurements, which are on average only 5% lower, however with considerable relative differences (standard deviation ~ 25%). With respect to near-surface volume mixing ratios and aerosol extinction we find considerably larger relative differences: 10 ± 30, −23 ± 28 and −8 ± 33% for aerosols, HCHO and NO2 respectively. The frequency distributions of these near-surface concentrations show however a quite good agreement, and this indicates that near-surface concentrations derived from MAX-DOAS are certainly useful in a climatological sense. A major difference between the two methods is the dynamic range of retrieved characteristic profile heights which is larger for method B than for method A. This effect is most pronounced for HCHO, where retrieved profile shapes with method A are very close to the a priori, and moderate for NO2 and aerosol extinction which on average show quite good agreement for characteristic profile heights below 1.5 km. One of the main advantages of method A is the stability, even under suboptimal conditions (e.g. in the presence of clouds). Method B is generally more unstable and this explains probably a substantial part of the quite large relative differences between the two methods. However, despite a relatively low precision for individual profile retrievals it appears as if seasonally averaged profile heights retrieved with method B are less biased towards a priori assumptions than those retrieved with method A. This gives confidence in the result obtained with method B, namely that aerosol extinction profiles tend on average to be higher than NO2 profiles in spring and summer, whereas they seem on average to be of the same height in winter, a result which is especially relevant in relation to the validation of satellite retrievals.


Geophysics ◽  
2007 ◽  
Vol 72 (1) ◽  
pp. F25-F34 ◽  
Author(s):  
Benoit Tournerie ◽  
Michel Chouteau ◽  
Denis Marcotte

We present and test a new method to correct for the static shift affecting magnetotelluric (MT) apparent resistivity sounding curves. We use geostatistical analysis of apparent resistivity and phase data for selected periods. For each period, we first estimate and model the experimental variograms and cross variogram between phase and apparent resistivity. We then use the geostatistical model to estimate, by cokriging, the corrected apparent resistivities using the measured phases and apparent resistivities. The static shift factor is obtained as the difference between the logarithm of the corrected and measured apparent resistivities. We retain as final static shift estimates the ones for the period displaying the best correlation with the estimates at all periods. We present a 3D synthetic case study showing that the static shift is retrieved quite precisely when the static shift factors are uniformly distributed around zero. If the static shift distribution has a nonzero mean, we obtained best results when an apparent resistivity data subset can be identified a priori as unaffected by static shift and cokriging is done using only this subset. The method has been successfully tested on the synthetic COPROD-2S2 2D MT data set and on a 3D-survey data set from Las Cañadas Caldera (Tenerife, Canary Islands) severely affected by static shift.


Paleobiology ◽  
2016 ◽  
Vol 43 (1) ◽  
pp. 68-84 ◽  
Author(s):  
Bradley Deline ◽  
William I. Ausich

AbstractA priori choices in the detail and breadth of a study are important in addressing scientific hypotheses. In particular, choices in the number and type of characters can greatly influence the results in studies of morphological diversity. A new character suite was constructed to examine trends in the disparity of early Paleozoic crinoids. Character-based rarefaction analysis indicated that a small subset of these characters (~20% of the complete data set) could be used to capture most of the properties of the entire data set in analyses of crinoids as a whole, noncamerate crinoids, and to a lesser extent camerate crinoids. This pattern may be the result of the covariance between characters and the characterization of rare morphologies that are not represented in the primary axes in morphospace. Shifting emphasis on different body regions (oral system, calyx, periproct system, and pelma) also influenced estimates of relative disparity between subclasses of crinoids. Given these results, morphological studies should include a pilot analysis to better examine the amount and type of data needed to address specific scientific hypotheses.


2014 ◽  
Vol 104 (10) ◽  
pp. 1125-1129 ◽  
Author(s):  
A. H. Stobbe ◽  
W. L. Schneider ◽  
P. R. Hoyt ◽  
U. Melcher

Next generation sequencing (NGS) is not used commonly in diagnostics, in part due to the large amount of time and computational power needed to identify the taxonomic origin of each sequence in a NGS data set. By using the unassembled NGS data sets as the target for searches, pathogen-specific sequences, termed e-probes, could be used as queries to enable detection of specific viruses or organisms in plant sample metagenomes. This method, designated e-probe diagnostic nucleic acid assay, first tested with mock sequence databases, was tested with NGS data sets generated from plants infected with a DNA (Bean golden yellow mosaic virus, BGYMV) or an RNA (Plum pox virus, PPV) virus. In addition, the ability to detect and differentiate among strains of a single virus species, PPV, was examined by using probe sets that were specific to strains. The use of probe sets for multiple viruses determined that one sample was dually infected with BGYMV and Bean golden mosaic virus.


2018 ◽  
Vol 4 (1) ◽  
pp. 331-335
Author(s):  
David Schote ◽  
Tim Pfeiffer ◽  
Georg Rose

AbstractComputed tomography (CT) scans are frequently used intraoperatively, for example to control the positioning of implants during intervention. Often, to provide the required information, a full field of view is unnecessary. I nstead, the region-of-interest (ROI) imaging can be performed, allowing for substantial reduction in the applied X-ray dose. However, ROI imaging leads to data inconsistencies, caused by the truncation of the projections. This lack of information severely impairs the quality of the reconstructed images. This study presents a proof-of-concept for a new approach that combines the incomplete CT data with ultrasound data and time of flight measurements in order to restore some of the lacking information. The routine is evaluated in a simulation study using the original Shepp-Logan phantom in ROI cases with different degrees of truncation. Image quality is assessed by means of normalized root mean square error. The proposed method significantly reduces truncation artifacts in the reconstructions and achieves considerable radiation exposure reductions.


2020 ◽  
Vol 2 (Supplement_4) ◽  
pp. iv3-iv14
Author(s):  
Niha Beig ◽  
Kaustav Bera ◽  
Pallavi Tiwari

Abstract Neuro-oncology largely consists of malignancies of the brain and central nervous system including both primary as well as metastatic tumors. Currently, a significant clinical challenge in neuro-oncology is to tailor therapies for patients based on a priori knowledge of their survival outcome or treatment response to conventional or experimental therapies. Radiomics or the quantitative extraction of subvisual data from conventional radiographic imaging has recently emerged as a powerful data-driven approach to offer insights into clinically relevant questions related to diagnosis, prediction, prognosis, as well as assessing treatment response. Furthermore, radiogenomic approaches provide a mechanism to establish statistical correlations of radiomic features with point mutations and next-generation sequencing data to further leverage the potential of routine MRI scans to serve as “virtual biopsy” maps. In this review, we provide an introduction to radiomic and radiogenomic approaches in neuro-oncology, including a brief description of the workflow involving preprocessing, tumor segmentation, and extraction of “hand-crafted” features from the segmented region of interest, as well as identifying radiogenomic associations that could ultimately lead to the development of reliable prognostic and predictive models in neuro-oncology applications. Lastly, we discuss the promise of radiomics and radiogenomic approaches in personalizing treatment decisions in neuro-oncology, as well as the challenges with clinical adoption, which will rely heavily on their demonstrated resilience to nonstandardization in imaging protocols across sites and scanners, as well as in their ability to demonstrate reproducibility across large multi-institutional cohorts.


Author(s):  
N. Seube

Abstract. This paper introduce a new method for validating the precision of an airborne or a mobile LiDAR data set. The proposed method is based on the knowledge of an a Combined Standard Measurement Uncertainty (CSMU) model which describes LiDAR point covariance matrix and thus uncertainty ellipsoid. The model we consider includes timing errors and most importantly the incidence of the LiDAR beam. After describing the relationship between the beam incidence and other variable uncertainty (especially attitude uncertainty), we show that we can construct a CSMU model giving the covariance of each oint as a function of the relative geometry between the LiDAR beam and the point normal. The validation method we propose consist in comparing the CSMU model (predictive a priori uncertainty) t the Standard Deviation Alog the Surface Normal (SDASN), for all set of quasi planr segments of the point cloud. Whenever the a posteriori (i.e; observed by the SDASN) level of uncertainty is greater than a priori (i.e; expected) level of uncertainty, the point fails the validation test. We illustrate this approach on a dataset acquired by a Microdrones mdLiDAR1000 system.


Sign in / Sign up

Export Citation Format

Share Document