A Phylogenomic Supertree of Birds

Rebecca T. Kimball; Carl H. Oliveros; Ning Wang; Noor D. White; F. Keith Barker; Daniel J. Field; Daniel T. Ksepka; R. Terry Chesser; Robert G. Moyle; Michael J. Braun; Robb T. Brumfield; Brant C. Faircloth; Brian Tilston Smith; Edward L. Braun

doi:10.3390/d11070109

A Phylogenomic Supertree of Birds

Diversity ◽

10.3390/d11070109 ◽

2019 ◽

Vol 11 (7) ◽

pp. 109 ◽

Cited By ~ 17

Author(s):

Rebecca T. Kimball ◽

Carl H. Oliveros ◽

Ning Wang ◽

Noor D. White ◽

F. Keith Barker ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bird Species ◽

Divide And Conquer ◽

Clear Understanding ◽

Whole Genome ◽

Efficient Manner ◽

Sequence Capture ◽

Branch Lengths ◽

Supertree Methods

It has long been appreciated that analyses of genomic data (e.g., whole genome sequencing or sequence capture) have the potential to reveal the tree of life, but it remains challenging to move from sequence data to a clear understanding of evolutionary history, in part due to the computational challenges of phylogenetic estimation using genome-scale data. Supertree methods solve that challenge because they facilitate a divide-and-conquer approach for large-scale phylogeny inference by integrating smaller subtrees in a computationally efficient manner. Here, we combined information from sequence capture and whole-genome phylogenies using supertree methods. However, the available phylogenomic trees had limited overlap so we used taxon-rich (but not phylogenomic) megaphylogenies to weave them together. This allowed us to construct a phylogenomic supertree, with support values, that included 707 bird species (~7% of avian species diversity). We estimated branch lengths using mitochondrial sequence data and we used these branch lengths to estimate divergence times. Our time-calibrated supertree supports radiation of all three major avian clades (Palaeognathae, Galloanseres, and Neoaves) near the Cretaceous-Paleogene (K-Pg) boundary. The approach we used will permit the continued addition of taxa to this supertree as new phylogenomic data are published, and it could be applied to other taxa as well.

Download Full-text

Rapture-ready darters: choice of reference genome and genotyping method (whole-genome or sequence capture) influence population genomic inference in Etheostoma

10.1101/2020.05.21.108274 ◽

2020 ◽

Author(s):

Brendan N. Reid ◽

Rachel L. Moran ◽

Christopher J. Kopack ◽

Sarah W. Fitzpatrick

Keyword(s):

Reference Genome ◽

Sequence Data ◽

Low Cost ◽

Read Depth ◽

Model Organisms ◽

Whole Genome ◽

Reduced Representation ◽

Sequence Capture ◽

Population Genomic ◽

The Impact

AbstractResearchers studying non-model organisms have an increasing number of methods available for generating genomic data. However, the applicability of different methods across species, as well as the effect of reference genome choice on population genomic inference, are still difficult to predict in many cases. We evaluated the impact of data type (whole-genome vs. reduced representation) and reference genome choice on data quality and on population genomic and phylogenomic inference across several species of darters (subfamily Etheostomatinae), a highly diverse radiation of freshwater fish. We generated a high-quality reference genome and developed a hybrid RADseq/sequence capture (Rapture) protocol for the Arkansas darter (Etheostoma cragini). Rapture data from 1900 individuals spanning four darter species showed recovery of most loci across darter species at high depth and consistent estimates of heterozygosity regardless of reference genome choice. Loci with baits spanning both sides of the restriction enzyme cut site performed especially well across species. For low-coverage whole-genome data, choice of reference genome affected read depth and inferred heterozygosity. For similar amounts of sequence data, Rapture performed better at identifying fine-scale genetic structure compared to whole-genome sequencing. Rapture loci also recovered an accurate phylogeny for the study species and demonstrated high phylogenetic informativeness across the evolutionary history of the genus Etheostoma. Low cost and high cross-species effectiveness regardless of reference genome suggest that Rapture and similar sequence capture methods may be worthwhile choices for studies of diverse species radiations.

Download Full-text

Finding functional disease-associated non-coding variation using next-generation sequencing

10.1101/060285 ◽

2016 ◽

Author(s):

Paolo Devanna ◽

Xiaowei Sylvia Chen ◽

Joses Ho ◽

Dario Gajewski ◽

Alessandro Gialluisi ◽

...

Keyword(s):

Next Generation Sequencing ◽

Binding Sites ◽

Large Scale ◽

Sequence Data ◽

Whole Genome Sequence ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Next Generation ◽

Whole Exome ◽

Generation Sequencing

ABSTRACTNext generation sequencing has opened the way for the large scale interrogation of cohorts at the whole exome, or whole genome level. Currently, the field largely focuses on potential disease causing variants that fall within coding sequences and that are predicted to cause protein sequence changes, generally discarding non-coding variants. However non-coding DNA makes up ~98% of the genome and contains a range of sequences essential for controlling the expression of protein coding genes. Thus, potentially causative non-coding variation is currently being overlooked. To address this, we have designed an approach to assess variation in one class of non-coding regulatory DNA; the 3′UTRome. Variants in the 3'UTR region of genes are of particular interest because 3'UTRs are responsible for modulating protein expression levels via their interactions with microRNAs. Furthermore they are amenable to large scale analysis as 3′UTR-microRNA interactions are based on complementary base pairing and as such can be predicted in silico at the genome-wide level. We report a strategy for identifying and functionally testing variants in microRNA binding sites within the 3'UTRome and demonstrate the efficacy of this pipeline in a cohort of language impaired children. Using whole exome sequence data from 43 probands, we extracted variants that lay within 3'UTR microRNA binding sites. We identified a common variant (SNP) in a microRNA binding site and found this SNP to be associated with an endophenotype of language impairment (non-word repetition). We showed that this variant disrupted microRNA regulation in cells and was linked to altered gene expression in the brain, suggesting it may represent a risk factor contributing to SLI. This work demonstrates that biologically relevant variants are currently being under-investigated despite the wealth of next-generation sequencing data available and presents a simple strategy for interrogating non-coding regions of the genome. We propose that this strategy should be routinely applied to whole exome and whole genome sequence data in order to broaden our understanding of how non-coding genetic variation underlies complex phenotypes such as neurodevelopmental disorders.

Download Full-text

SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

The American Journal of Human Genetics ◽

10.1016/j.ajhg.2017.05.017 ◽

2017 ◽

Vol 101 (1) ◽

pp. 115-122 ◽

Cited By ~ 5

Author(s):

Di Zhang ◽

Linhai Zhao ◽

Biao Li ◽

Zongxiao He ◽

Gao T. Wang ◽

...

Keyword(s):

Rare Variant ◽

Large Scale ◽

Sequence Data ◽

Association Studies ◽

Complete Analysis ◽

Analysis Tool ◽

Whole Genome ◽

Rare Variant Association ◽

Exome Sequence Data ◽

Exome Sequence

Download Full-text

DeepVariant-on-Spark: Small-Scale Genome Analysis Using a Cloud-Based Computing Framework

Computational and Mathematical Methods in Medicine ◽

10.1155/2020/7231205 ◽

2020 ◽

Vol 2020 ◽

pp. 1-7

Author(s):

Po-Jung Huang ◽

Jui-Huan Chang ◽

Hou-Hsien Lin ◽

Yu-Xuan Li ◽

Chi-Ching Lee ◽

...

Keyword(s):

Genome Analysis ◽

Genetic Variants ◽

Large Scale ◽

Sequence Data ◽

Classification Model ◽

Whole Genome Sequence ◽

Small Scale ◽

Whole Genome ◽

Gold Standard Method ◽

Computing Framework

Although sequencing a human genome has become affordable, identifying genetic variants from whole-genome sequence data is still a hurdle for researchers without adequate computing equipment or bioinformatics support. GATK is a gold standard method for the identification of genetic variants and has been widely used in genome projects and population genetic studies for many years. This was until the Google Brain team developed a new method, DeepVariant, which utilizes deep neural networks to construct an image classification model to identify genetic variants. However, the superior accuracy of DeepVariant comes at the cost of computational intensity, largely constraining its applications. Accordingly, we present DeepVariant-on-Spark to optimize resource allocation, enable multi-GPU support, and accelerate the processing of the DeepVariant pipeline. To make DeepVariant-on-Spark more accessible to everyone, we have deployed the DeepVariant-on-Spark to the Google Cloud Platform (GCP). Users can deploy DeepVariant-on-Spark on the GCP following our instruction within 20 minutes and start to analyze at least ten whole-genome sequencing datasets using free credits provided by the GCP. DeepVaraint-on-Spark is freely available for small-scale genome analysis using a cloud-based computing framework, which is suitable for pilot testing or preliminary study, while reserving the flexibility and scalability for large-scale sequencing projects.

Download Full-text

Efficient Merging of Genome Profile Alignments

10.1101/309047 ◽

2018 ◽

Author(s):

André Hennig ◽

Kay Nieselt

Keyword(s):

Data Structure ◽

Parallel Computation ◽

Large Scale ◽

Divide And Conquer ◽

Data Sets ◽

Whole Genome ◽

Multiple Sequence ◽

Construction Methods ◽

Current Implementation ◽

Whole Genomes

AbstractMotivationWhole-genome alignment methods show insufficient scalability towards the generation of large-scale whole-genome alignments (WGAs). Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which makes the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.ResultsHere, we present GPA, an approach that aligns the profiles of WGAs and is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve (Darling et al., 2010) and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial data sets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool.AvailabilityGPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of [email protected]

Download Full-text

Genotyping by low-coverage whole-genome sequencing in intercross pedigrees from outbred founders: a cost efficient approach

10.1101/421768 ◽

2018 ◽

Author(s):

Yanjun Zan ◽

Thibaut Payen ◽

Mette Lillie ◽

Christa F. Honaker ◽

Paul B. Siegel ◽

...

Keyword(s):

High Resolution ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Genotype Imputation ◽

Whole Genome ◽

Efficient Manner ◽

Founder Line ◽

Cost Efficient ◽

Low Coverage

ABSTRACTBackgroundExperimental intercrosses between outbred founder populations are powerful resources for mapping loci contributing to complex traits (Quantitative Trait Loci or QTL). Here, we present an approach and accompanying software for high-resolution genotype imputation in such populations using whole-genome high coverage sequence data on founder individuals (∼30×) and low coverage sequence data on intercross individuals (∼0.4×). The method is illustrated in a large F2 pedigree between lines of chickens that have been divergently selected for 40 generations for the same trait (body weight at 8 weeks of age).ResultsDescribed is how hundreds of individuals were whole-genome sequenced in a cost- and time-efficient manner using a Tn5-based library preparation protocol optimized for this application. In total, 7.6M markers segregated in this pedigree and 10.0 to 13.7% were informative for imputing the founder line genotypes within the F0-F2 families. The genotypes imputed from low coverage sequence data were consistent with the founder line genotypes estimated using SNP and microsatellite markers both at individual imputed sites (92%) and across the genome of individual chickens (93%). The resolution of the recombination breakpoints was high with 50% being resolved within <10kb.ConclusionsA method for genotype imputation from low-coverage whole-genome sequencing in outbred intercrosses is described and evaluated. By applying it to an outbred chicken F2 cross it is illustrated that it provides high quality, high-resolution genotypes in a time and cost efficient manner.

Download Full-text

Fast and accurate statistical inference of phylogenetic networks using large-scale genomic sequence data

10.1101/132795 ◽

2017 ◽

Cited By ~ 1

Author(s):

Hussein A. Hejase ◽

Natalie VandePol ◽

Gregory M. Bonito ◽

Kevin J. Liu

Keyword(s):

Gene Flow ◽

Large Scale ◽

Genomic Sequence ◽

State Of The Art ◽

Sequence Data ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Divide And Conquer ◽

Performance Study ◽

Art Methods

AbstractAn emerging discovery in phylogenomics is that interspecific gene flow has played a major role in the evolution of many different organisms. To what extent is the Tree of Life not truly a tree reflecting strict “vertical” divergence, but rather a more general graph structure known as a phylogenetic network which also captures “horizontal”gene flow? The answer to this fundamental question not only depends upon densely sampled and divergent genomic sequence data, but also compu-tational methods which are capable of accurately and efficiently inferring phylogenetic networks from large-scale genomic sequence datasets. Re-cent methodological advances have attempted to address this gap. How-ever, in the 2016 performance study of Hejase and Liu, state-of-the-art methods fell well short of the scalability requirements of existing phy-logenomic studies.The methodological gap remains: how can phylogenetic networks be ac-curately and efficiently inferred using genomic sequence data involving many dozens or hundreds of taxa? In this study, we address this gap by proposing a new phylogenetic divide-and-conquer method which we call FastNet. We conduct a performance study involving a range of evolu-tionary scenarios, and we demonstrate that FastNet outperforms state-of-the-art methods in terms of computational efficiency and topological accuracy.

Download Full-text

Efficient merging of genome profile alignments

Bioinformatics ◽

10.1093/bioinformatics/btz377 ◽

2019 ◽

Vol 35 (14) ◽

pp. i71-i80

Author(s):

André Hennig ◽

Kay Nieselt

Keyword(s):

Data Structure ◽

Parallel Computation ◽

Large Scale ◽

Divide And Conquer ◽

Supplementary Information ◽

Whole Genome ◽

Multiple Sequence ◽

Genome Profile ◽

Construction Methods ◽

Profile Alignment

Abstract Motivation Whole-genome alignment (WGA) methods show insufficient scalability toward the generation of large-scale WGAs. Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which make the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles. Results Here, we present genome profile alignment, an approach that aligns the profiles of WGAs and that is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial datasets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool. Availability and implementation GPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of WGAs. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Maize Practical Haplotype Graph Leverages Diverse NAM Assemblies

10.1101/2020.08.31.268425 ◽

2020 ◽

Author(s):

Jose A. Valdes Franco ◽

Joseph L. Gage ◽

Peter J. Bradbury ◽

Lynn C. Johnson ◽

Zachary R. Miller ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Recombinant Inbred Lines ◽

Imputation Accuracy ◽

Haplotype Diversity ◽

Structural Diversity ◽

Efficient Manner ◽

Large Populations ◽

Tropical Germplasm ◽

Transposon Activity

AbstractAs a result of millions of years of transposon activity, multiple rounds of ancient polyploidization, and large populations that preserve diversity, maize has an extremely structurally diverse genome, evidenced by high-quality genome assemblies that capture substantial levels of both tropical and temperate diversity. We generated a pangenome representation (the Practical Haplotype Graph, PHG) of these assemblies in a database, representing the pangenome haplotype diversity and providing an initial estimate of structural diversity. We leveraged the pangenome to accurately impute haplotypes and genotypes of taxa using various kinds of sequence data, ranging from WGS to extremely-low coverage GBS. We imputed the genotypes of the recombinant inbred lines of the NAM population with over 99% mean accuracy, while unrelated germplasm attained a mean imputation accuracy of 92 or 95% when using GBS or WGS data, respectively. Most of the imputation errors occur in haplotypes within European or tropical germplasm, which have yet to be represented in the maize PHG database. Also, the PHG stores the imputation data in a 30,000-fold more space-efficient manner than a standard genotype file, which is a key improvement when dealing with large scale data.

Download Full-text

Population Genotype Calling from Low-coverage Sequencing Data

10.1101/085936 ◽

2016 ◽

Author(s):

Lin Huang ◽

Petr Danecek ◽

Sivan Bercovici ◽

Serafim Batzoglou

Keyword(s):

Large Scale ◽

Whole Genome ◽

Sequencing Data ◽

Efficient Manner ◽

Entire Cohort ◽

The Public ◽

Wide Range ◽

Scale Population ◽

Cost Efficient ◽

Low Coverage

In recent years, several large-scale whole-genome projects sequencing tens of thousands of individuals were completed, with larger studies are underway. These projects aim to provide high-quality genotypes for a large number of whole genomes in a cost-efficient manner, by sequencing each genome at low coverage and subsequently identifying alleles jointly in the entire cohort. Here we present Ref-Reveel, a novel method for large-scale population genotyping. We show that Ref-Reveel provides genotyping at a higher accuracy and higher efficiency in comparison to existing methods by applying our method to one of the largest whole-genome sequencing datasets presently available to the public. We further show that utilizing the resulting genotype panel as references, through the Ref-Reveel framework, greatly improves the ability to call genotypes accurately on newly sequenced genomes. In addition, we present a Ref-Reveel pipeline that is applicable for genotyping of very small datasets. In summary, Ref-Reveel is an accurate, scalable and applicable method for a wide range of genotyping scenarios, and will greatly improves the quality of calling genomic alterations in current and future large-scale sequencing projects.

Download Full-text