FUSTr: a tool to find gene Families Under Selection in Transcriptomes

FUSTr: a tool to find gene families under selection in transcriptomes

PeerJ ◽

10.7717/peerj.4234 ◽

2018 ◽

Vol 6 ◽

pp. e4234 ◽

Cited By ~ 6

Author(s):

T. Jeffrey Cole ◽

Michael S. Brewer

Keyword(s):

Molecular Evolution ◽

Positive Selection ◽

High Performance ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Strong Positive Selection ◽

Transcriptomic Data ◽

Downstream Analysis ◽

User Friendly

Background The recent proliferation of large amounts of biodiversity transcriptomic data has resulted in an ever-expanding need for scalable and user-friendly tools capable of answering large scale molecular evolution questions. FUSTr identifies gene families involved in the process of adaptation. This is a tool that finds genes in transcriptomic datasets under strong positive selection that automatically detects isoform designation patterns in transcriptome assemblies to maximize phylogenetic independence in downstream analysis. Results When applied to previously studied spider transcriptomic data as well as simulated data, FUSTr successfully grouped coding sequences into proper gene families as well as correctly identified those under strong positive selection in relatively little time. Conclusions FUSTr provides a useful tool for novice bioinformaticians to characterize the molecular evolution of organisms throughout the tree of life using large transcriptomic biodiversity datasets and can utilize multi-processor high-performance computational facilities.

Download Full-text

Global dissection of the BAHD acyltransferase gene family in soybean: Expression profiling, metabolic functions, and evolution

10.21203/rs.2.21482/v1 ◽

2020 ◽

Author(s):

Muhammad Zulfiqar Ahmad ◽

Xiangsheng Zeng ◽

Qiang Dong ◽

Sehrish Manan ◽

Huanan Jin ◽

...

Keyword(s):

Positive Selection ◽

Gene Family ◽

Expression Profiles ◽

Plant Defence ◽

Constitutive Expression ◽

Gene Families ◽

Strong Positive Selection ◽

Genome Wide ◽

A Genome ◽

Bahd Acyltransferase

Abstract Background: Members of the BAHD acyltransferase (ACT) family play important roles in plant defence against biotic and abiotic stresses. Previous genome-wide studies explored different acyltransferase gene families, but not a single study was found so far on the overall genome-wide or positive selection analyses of the BAHD family genes in Glycine max . A better understanding of the functions that specific members of this family play in stress defence can lead to better breeding strategies for stress tolerance. Results: A total of 103 genes of the BAHD family (GmACT genes) were mined from the soybean genome, which could be grouped into four phylogenetic clades (I- IV). Clade III was further divided into two sub-clades (IIIA and IIIB). In each clade, the constituent part of the gene structures and motifs were relatively conserved. These 103 genes were distributed unequally on all 20 chromosomes, and 16 paralogous pairs were found within the family. Positive selection analysis revealed important amino acids under strong positive selection, which suggests that the evolution of this gene family modulated soybean domestication. Most of the expression of ACT genes in soybean was repressed with Al 3+ and fungal elicitor exposure, except for GmACT84 , which expression increased in these conditions 2- and 3-fold, respectively. The promoter region of GmACT84 contains the maximum number of stress-responsive elements among all GmACT genes and is especially enriched in MYB-related elements. Some GmACT genes showed expression specific under specific conditions, while others showed constitutive expression in all soybean tissues or conditions analysed. Conclusions: This study provided a genome-wide analysis of the BAHD gene family and assessed their expression profiles. We found evidence of a strong positive selection of GmACT genes. Our findings will help efforts of functional characterisation of ACT genes in soybean in order to discover their involvement in growth, development, and defence mechanisms.

Download Full-text

Large genetic diversity and strong positive selection in F-box and GPCR genes among the wild isolates of Caenorhabditis elegans

Genome Biology and Evolution ◽

10.1093/gbe/evab048 ◽

2021 ◽

Author(s):

Fuqiang Ma ◽

Chun Yin Lau ◽

Chaogu Zheng

Keyword(s):

Population Structure ◽

Caenorhabditis Elegans ◽

Positive Selection ◽

Model Organism ◽

Gene Families ◽

Extended Haplotype ◽

Strong Positive Selection ◽

C Elegans ◽

Population Structure Analysis ◽

Wild Strains

Abstract The F-box and chemosensory GPCR (csGPCR) gene families are greatly expanded in nematodes, including the model organism Caenorhabditis elegans, compared to insects and vertebrates. However, the intraspecific evolution of these two gene families in nematodes remain unexamined. In this study, we analyzed the genomic sequences of 330 recently sequenced wild isolates of C. elegans using a range of population genetics approaches. We found that F-box and csGPCR genes, especially the Srw family csGPCRs, showed much more diversity than other gene families. Population structure analysis and phylogenetic analysis divided the wild strains into eight non-Hawaiian and three Hawaiian subpopulations. Some Hawaiian strains appeared to be more ancestral than all other strains. F-box and csGPCR genes maintained a great amount of the ancestral variants in the Hawaiian subpopulation and their divergence among the non-Hawaiian subpopulations contributed significantly to population structure. F-box genes are mostly located at the chromosomal arms and high recombination rate correlates with their large polymorphism. Moreover, using both neutrality tests and Extended Haplotype Homozygosity analysis, we identified signatures of strong positive selection in the F-box and csGPCR genes among the wild isolates, especially in the non-Hawaiian population. Accumulation of high-frequency derived alleles in these genes was found in non-Hawaiian population, leading to divergence from the ancestral genotype. In summary, we found that F-box and csGPCR genes harbour a large pool of natural variants, which may be subjected to positive selection. These variants are mostly mapped to the substrate-recognition domains of F-box proteins and the extracellular and intracellular regions of csGPCRs, possibly resulting in advantages during adaptation by affecting protein degradation and the sensing of environmental cues, respectively.

Download Full-text

Large genetic diversity and strong positive selection in F-box and GPCR genes among the wild isolates of Caenorhabditis elegans

10.1101/2020.07.09.194670 ◽

2020 ◽

Author(s):

Fuqiang Ma ◽

Chun Yin Lau ◽

Chaogu Zheng

Keyword(s):

Population Structure ◽

Gene Flow ◽

Caenorhabditis Elegans ◽

Positive Selection ◽

Recombination Rate ◽

Selective Sweep ◽

Gene Families ◽

Strong Positive Selection ◽

High Recombination Rate ◽

C Elegans

AbstractThe F-box and chemosensory GPCR (csGPCR) gene families are greatly expanded in nematodes, including the model organism Caenorhabditis elegans, compared to insects and vertebrates. However, the intraspecific evolution of these two gene families in nematodes remain unexamined. In this study, we analyzed the genomic sequences of 330 recently sequenced wild isolates of C. elegans using a range of population genetics approaches. We found that F-box and csGPCR genes, especially the Srw family csGPCRs, showed much more diversity than other gene families. Population structure analysis and phylogenetic analysis divided the wild strains into eight non-Hawaiian and three Hawaiian subpopulations. Some Hawaiian strains appeared to be more ancestral than all other strains. F-box and csGPCR genes maintained a great amount of the ancestral variants in the Hawaiian subpopulation and their divergence among the non-Hawaiian subpopulations contributed significantly to population structure. These genes are mostly located at the chromosomal arms and high recombination rate correlates with their large polymorphism. Gene flow might also contribute to their diversity. Moreover, we identified signatures of strong positive selection in the F-box and csGPCR genes in the non-Hawaiian population using both neutrality tests and Extended Haplotype Homozygosity analysis. Accumulation of high frequency derived alleles in these genes were found in non-Hawaiian population, leading to divergence from the ancestral genotype found in Hawaiian strains. In summary, we found that F-box and csGPCR genes harbour a large pool of natural variants, which may be subjected to positive selection during the recent selective sweep in non-Hawaiian population. These variants are mostly mapped to the substrate-recognition domains of F-box proteins and the extracellular regions of csGPCRs, possibly resulting in advantages during adaptation by affecting protein degradation and the sensing of environmental cues, respectively.Significance statementThe small nematode Caenorhabditis elegans has emerged as an important organism in studying the genetic mechanisms of evolution. F-box and chemosensory GPCR are two of the largest gene families in C. elegans, but their intraspecific evolution within C. elegans was not studied before. In this work, using the nonsynonymous SNV data of 330 C. elegans wild isolates, we found that F-box and chemosensory GPCR genes showed larger polymorphisms and stronger positive selection than other genes. The large diversity is likely the result of rapid gene family expansion, high recombination rate, and gene flow. Analysis of subpopulation suggests that positive selection of these genes occurred most strongly in the non-Hawaiian population, which underwent a selective sweep possibly linked to human activities.

Download Full-text

Artificial Intelligence-Assisted Colonoscopy for Detection of Colon Polyps: a Prospective, Randomized Cohort Study

Journal of Gastrointestinal Surgery ◽

10.1007/s11605-020-04802-4 ◽

2020 ◽

Author(s):

Yuchen Luo ◽

Yi Zhang ◽

Ming Liu ◽

Yihong Lai ◽

Panpan Liu ◽

...

Keyword(s):

Artificial Intelligence ◽

Real Time ◽

High Performance ◽

Detection System ◽

Random Order ◽

Colon Polyps ◽

Clinical Environment ◽

Polyp Detection ◽

Link Type ◽

Polyp Detection Rate

Abstract Background and aims Improving the rate of polyp detection is an important measure to prevent colorectal cancer (CRC). Real-time automatic polyp detection systems, through deep learning methods, can learn and perform specific endoscopic tasks previously performed by endoscopists. The purpose of this study was to explore whether a high-performance, real-time automatic polyp detection system could improve the polyp detection rate (PDR) in the actual clinical environment. Methods The selected patients underwent same-day, back-to-back colonoscopies in a random order, with either traditional colonoscopy or artificial intelligence (AI)-assisted colonoscopy performed first by different experienced endoscopists (> 3000 colonoscopies). The primary outcome was the PDR. It was registered with clinicaltrials.gov. (NCT047126265). Results In this study, we randomized 150 patients. The AI system significantly increased the PDR (34.0% vs 38.7%, p < 0.001). In addition, AI-assisted colonoscopy increased the detection of polyps smaller than 6 mm (69 vs 91, p < 0.001), but no difference was found with regard to larger lesions. Conclusions A real-time automatic polyp detection system can increase the PDR, primarily for diminutive polyps. However, a larger sample size is still needed in the follow-up study to further verify this conclusion. Trial Registration clinicaltrials.gov Identifier: NCT047126265

Download Full-text

Constructing Large-Scale Genetic Maps Using an Evolutionary Strategy Algorithm

Genetics ◽

10.1093/genetics/165.4.2269 ◽

2003 ◽

Vol 165 (4) ◽

pp. 2269-2282

Author(s):

D Mester ◽

Y Ronin ◽

D Minkov ◽

E Nevo ◽

A Korol

Keyword(s):

Discrete Optimization ◽

High Performance ◽

Large Scale ◽

Simulated Data ◽

Real Data ◽

Genetic Maps ◽

Chromosome 1 ◽

Evolutionary Strategy ◽

Group A ◽

The One

Abstract This article is devoted to the problem of ordering in linkage groups with many dozens or even hundreds of markers. The ordering problem belongs to the field of discrete optimization on a set of all possible orders, amounting to n!/2 for n loci; hence it is considered an NP-hard problem. Several authors attempted to employ the methods developed in the well-known traveling salesman problem (TSP) for multilocus ordering, using the assumption that for a set of linked loci the true order will be the one that minimizes the total length of the linkage group. A novel, fast, and reliable algorithm developed for the TSP and based on evolution-strategy discrete optimization was applied in this study for multilocus ordering on the basis of pairwise recombination frequencies. The quality of derived maps under various complications (dominant vs. codominant markers, marker misclassification, negative and positive interference, and missing data) was analyzed using simulated data with ∼50-400 markers. High performance of the employed algorithm allows systematic treatment of the problem of verification of the obtained multilocus orders on the basis of computing-intensive bootstrap and/or jackknife approaches for detecting and removing questionable marker scores, thereby stabilizing the resulting maps. Parallel calculation technology can easily be adopted for further acceleration of the proposed algorithm. Real data analysis (on maize chromosome 1 with 230 markers) is provided to illustrate the proposed methodology.

Download Full-text

Robust inference of positive selection from recombining coding sequences

Bioinformatics ◽

10.1093/bioinformatics/btl427 ◽

2006 ◽

Vol 22 (20) ◽

pp. 2493-2499 ◽

Cited By ~ 138

Author(s):

K. Scheffler ◽

D. P. Martin ◽

C. Seoighe

Keyword(s):

Positive Selection ◽

Robust Inference ◽

Coding Sequences

Download Full-text

Unique structure and positive selection promote the rapid divergence of Drosophila Y chromosomes

10.1101/2021.08.16.456461 ◽

2021 ◽

Author(s):

Ching-Ho Chang ◽

Lauren E. Gregory ◽

Kathleen E. Gordon ◽

Colin D. Meiklejohn ◽

Amanda M. Larracuente

Keyword(s):

Positive Selection ◽

Y Chromosome ◽

Related Species ◽

De Novo ◽

Gene Families ◽

Chromosome Organization ◽

End Joining ◽

Sexual Antagonism ◽

Closely Related Species ◽

Y Chromosomes

AbstractY chromosomes across diverse species convergently evolve a gene-poor, heterochromatic organization enriched for duplicated genes, LTR retrotransposable elements, and satellite DNA. Sexual antagonism and a loss of recombination play major roles in the degeneration of young Y chromosomes. However, the processes shaping the evolution of mature, already degenerated Y chromosomes are less well-understood. Because Y chromosomes evolve rapidly, comparisons between closely related species are particularly useful. We generated de novo long read assemblies complemented with cytological validation to reveal Y chromosome organization in three closely related species of the Drosophila simulans complex, which diverged only 250,000 years ago and share >98% sequence identity. We find these Y chromosomes are divergent in their organization and repetitive DNA composition and discover new Y-linked gene families whose evolution is driven by both positive selection and gene conversion. These Y chromosomes are also enriched for large deletions, suggesting that the repair of double-strand breaks on Y chromosomes may be biased toward microhomology-mediated end joining over canonical non-homologous end-joining. We propose that this repair mechanism generally contributes to the convergent evolution of Y chromosome organization.

Download Full-text

CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009631 ◽

2021 ◽

Vol 17 (11) ◽

pp. e1009631

Author(s):

Raquel Linheiro ◽

John Archer

Keyword(s):

De Novo ◽

Simulated Data ◽

Real Data ◽

Gene Families ◽

Classification Systems ◽

Whole Body ◽

Cdna Libraries ◽

Sequence Information ◽

Rna Seq ◽

High Quality

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.

Download Full-text

GeneRax: A tool for species tree-aware maximum likelihood based gene family tree inference under gene duplication, transfer, and loss

10.1101/779066 ◽

2019 ◽

Cited By ~ 3

Author(s):

Benoit Morel ◽

Alexey M. Kozlov ◽

Alexandros Stamatakis ◽

Gergely J. Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.

Download Full-text