Understanding the Early Evolutionary Stages of a Tandem Drosophilamelanogaster-Specific Gene Family: A Structural and Functional Population Study

Bryan D Clifton; Jamie Jimenez; Ashlyn Kimura; Zeinab Chahine; Pablo Librado; Alejandro Sánchez-Gracia; Mashya Abbassi; Francisco Carranza; Carolus Chan; Marcella Marchetti; Wanting Zhang; Mijuan Shi; Christine Vu; Shudan Yeh; Laura Fanti; Xiao-Qin Xia; Julio Rozas; José M Ranz

doi:10.1093/molbev/msaa109

Understanding the Early Evolutionary Stages of a Tandem Drosophilamelanogaster-Specific Gene Family: A Structural and Functional Population Study

Molecular Biology and Evolution ◽

10.1093/molbev/msaa109 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2584-2600 ◽

Cited By ~ 3

Author(s):

Bryan D Clifton ◽

Jamie Jimenez ◽

Ashlyn Kimura ◽

Zeinab Chahine ◽

Pablo Librado ◽

...

Keyword(s):

Gene Family ◽

Sequence Similarity ◽

Gene Families ◽

Read Depth ◽

Specific Gene ◽

Protein Variant ◽

Protein Coding ◽

Expression Levels ◽

Number Variation ◽

Reference Quality

Abstract Gene families underlie genetic innovation and phenotypic diversification. However, our understanding of the early genomic and functional evolution of tandemly arranged gene families remains incomplete as paralog sequence similarity hinders their accurate characterization. The Drosophila melanogaster-specific gene family Sdic is tandemly repeated and impacts sperm competition. We scrutinized Sdic in 20 geographically diverse populations using reference-quality genome assemblies, read-depth methodologies, and qPCR, finding that ∼90% of the individuals harbor 3–7 copies as well as evidence of population differentiation. In strains with reliable gene annotations, copy number variation (CNV) and differential transposable element insertions distinguish one structurally distinct version of the Sdic region per strain. All 31 annotated copies featured protein-coding potential and, based on the protein variant encoded, were categorized into 13 paratypes differing in their 3′ ends, with 3–5 paratypes coexisting in any strain examined. Despite widespread gene conversion, the only copy present in all strains has functionally diverged at both coding and regulatory levels under positive selection. Contrary to artificial tandem duplications of the Sdic region that resulted in increased male expression, CNV in cosmopolitan strains did not correlate with expression levels, likely as a result of differential genome modifier composition. Duplicating the region did not enhance sperm competitiveness, suggesting a fitness cost at high expression levels or a plateau effect. Beyond facilitating a minimally optimal expression level, Sdic CNV acts as a catalyst of protein and regulatory diversity, showcasing a possible evolutionary path recently formed tandem multigene families can follow toward long-term consolidation in eukaryotic genomes.

Download Full-text

A novel non-protein-coding infection-specific gene family is clustered throughout the genome of Phytophthora infestans

Microbiology ◽

10.1099/mic.0.2006/002220-0 ◽

2007 ◽

Vol 153 (3) ◽

pp. 747-759 ◽

Cited By ~ 19

Author(s):

A. O. Avrova ◽

S. C. Whisson ◽

L. Pritchard ◽

E. Venter ◽

S. De Luca ◽

...

Keyword(s):

Gene Family ◽

Phytophthora Infestans ◽

Specific Gene ◽

Protein Coding

Download Full-text

The roles of LINEs, LTRs and SINEs in lineage-specific gene family expansions in the human and mouse genomes

10.1101/042309 ◽

2016 ◽

Cited By ~ 1

Author(s):

Václav Janoušek ◽

Christina M Laukaitis ◽

Alexey Yanchukov ◽

Robert Karn

Keyword(s):

Gene Family ◽

Gene Families ◽

Specific Gene ◽

Second Phase ◽

Open Chromatin ◽

Gene Family Expansion ◽

Family Expansion ◽

Human And Mouse ◽

Lineage Specific Gene ◽

Runaway Process

We explored genome-wide patterns of RT content surrounding lineage-specific gene family expansions in the human and mouse genomes. Our results suggest that the size of a gene family is an important predictor of the RT distribution in close proximity to the family members. The distribution differs considerably between the three most common RT classes (LINEs, LTRs and SINEs). LINEs and LTRs tend to be more abundant around genes of multi-copy gene families, whereas SINEs tend to be depleted around such genes. Detailed analysis of the distribution and diversity of LINEs and LTRs with respect to gene family size suggests that each has a distinct involvement in gene family expansion. LTRs are associated with open chromatin sites surrounding the gene families, supporting their involvement in gene regulation, whereas LINEs may play a structural role, promoting gene duplication. This suggests that gene family expansions, especially in the mouse genome, might undergo two phases, the first is characterized by elevated deposition of LTRs and their utilization in reshaping gene regulatory networks. The second phase is characterized by rapid gene family expansion due to continuous accumulation of LINEs and it appears that, in some instances at least, this could become a runaway process. We provide an example in which this has happened and we present a simulation supporting the possibility of the runaway process. Our observations also suggest that specific differences exist in this gene family expansion process between human and mouse genomes.

Download Full-text

A high-quality chromosome-level genome assembly reveals genetics for important traits in eggplant

Horticulture Research ◽

10.1038/s41438-020-00391-0 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Qingzhen Wei ◽

Jinglei Wang ◽

Wuhong Wang ◽

Tianhua Hu ◽

Haijiao Hu ◽

...

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

Repetitive Sequences ◽

Gene Families ◽

Specific Gene ◽

High Quality ◽

Total Size ◽

Protein Coding ◽

Fruit Length ◽

Protein Coding Genes

Abstract Eggplant (Solanum melongena L.) is an economically important vegetable crop in the Solanaceae family, with extensive diversity among landraces and close relatives. Here, we report a high-quality reference genome for the eggplant inbred line HQ-1315 (S. melongena-HQ) using a combination of Illumina, Nanopore and 10X genomics sequencing technologies and Hi-C technology for genome assembly. The assembled genome has a total size of ~1.17 Gb and 12 chromosomes, with a contig N50 of 5.26 Mb, consisting of 36,582 protein-coding genes. Repetitive sequences comprise 70.09% (811.14 Mb) of the eggplant genome, most of which are long terminal repeat (LTR) retrotransposons (65.80%), followed by long interspersed nuclear elements (LINEs, 1.54%) and DNA transposons (0.85%). The S. melongena-HQ eggplant genome carries a total of 563 accession-specific gene families containing 1009 genes. In total, 73 expanded gene families (892 genes) and 34 contraction gene families (114 genes) were functionally annotated. Comparative analysis of different eggplant genomes identified three types of variations, including single-nucleotide polymorphisms (SNPs), insertions/deletions (indels) and structural variants (SVs). Asymmetric SV accumulation was found in potential regulatory regions of protein-coding genes among the different eggplant genomes. Furthermore, we performed QTL-seq for eggplant fruit length using the S. melongena-HQ reference genome and detected a QTL interval of 71.29–78.26 Mb on chromosome E03. The gene Smechr0301963, which belongs to the SUN gene family, is predicted to be a key candidate gene for eggplant fruit length regulation. Moreover, we anchored a total of 210 linkage markers associated with 71 traits to the eggplant chromosomes and finally obtained 26 QTL hotspots. The eggplant HQ-1315 genome assembly can be accessed at http://eggplant-hq.cn. In conclusion, the eggplant genome presented herein provides a global view of genomic divergence at the whole-genome level and powerful tools for the identification of candidate genes for important traits in eggplant.

Download Full-text

Ab Initio Construction and Evolutionary Analysis of Protein-Coding Gene Families with Partially Homologous Relationships: Closely Related Drosophila Genomes as a Case Study

Genome Biology and Evolution ◽

10.1093/gbe/evaa041 ◽

2020 ◽

Vol 12 (3) ◽

pp. 185-202

Author(s):

Xia Han ◽

Jindan Guo ◽

Erli Pang ◽

Hongtao Song ◽

Kui Lin

Keyword(s):

Gene Family ◽

De Novo ◽

Gene Families ◽

Gene Family Evolution ◽

Evolutionary Analysis ◽

Protein Coding ◽

Protein Coding Genes ◽

Genome Phylogeny ◽

Partial Homology

Abstract How have genes evolved within a well-known genome phylogeny? Many protein-coding genes should have evolved as a whole at the gene level, and some should have evolved partly through fragments at the subgene level. To comprehensively explore such complex homologous relationships and better understand gene family evolution, here, with de novo-identified modules, the subgene units which could consecutively cover proteins within a set of closely related species, we applied a new phylogeny-based approach that considers evolutionary models with partial homology to classify all protein-coding genes in nine Drosophila genomes. Compared with two other popular methods for gene family construction, our approach improved practical gene family classifications with a more reasonable view of homology and provided a much more complete landscape of gene family evolution at the gene and subgene levels. In the case study, we found that most expanded gene families might have evolved mainly through module rearrangements rather than gene duplications and mainly generated single-module genes through partial gene duplication, suggesting that there might be pervasive subgene rearrangement in the evolution of protein-coding gene families. The use of a phylogeny-based approach with partial homology to classify and analyze protein-coding gene families may provide us with a more comprehensive landscape depicting how genes evolve within a well-known genome phylogeny.

Download Full-text

Additional description and genome analyses of Caenorhabditis auriculariae representing the basal lineage of genus Caenorhabditis

Scientific Reports ◽

10.1038/s41598-021-85967-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Mehmet Dayi ◽

Natsumi Kanzaki ◽

Simo Sun ◽

Tatsuya Ide ◽

Ryusei Tanaka ◽

...

Keyword(s):

Phylogenetic Analyses ◽

Morphological Characteristics ◽

Gene Families ◽

Single Copy ◽

Specific Gene ◽

Protein Coding ◽

Molecular Phylogenetic ◽

Tight Association ◽

Genome Analyses ◽

Species Specific

AbstractCaenorhabditis auriculariae, which was morphologically described in 1999, was re-isolated from a Platydema mushroom-associated beetle. Based on the re-isolated materials, some morphological characteristics were re-examined and ascribed to the species. In addition, to clarify phylogenetic relationships with other Caenorhabditis species and biological features of the nematode, the whole genome was sequenced and assembled into 109.5 Mb with 16,279 predicted protein-coding genes. Molecular phylogenetic analyses based on ribosomal RNA and 269 single-copy genes revealed the species is closely related to C. sonorae and C. monodelphis placing them at the most basal clade of the genus. C. auriculariae has morphological characteristics clearly differed from those two species and harbours a number of species-specific gene families, indicating its usefulness as a new outgroup species for Caenorhabditis evolutionary studies. A comparison of carbohydrate-active enzyme (CAZy) repertoires in genomes, which we found useful to speculate about the lifestyle of Caenorhabditis nematodes, suggested that C. auriculariae likely has a life-cycle with tight-association with insects.

Download Full-text

A Nodule-Specific Gene Family from Alnus glutinosa Encodes Glycine- and Histidine-Rich Proteins Expressed in the Early Stages of Actinorhizal Nodule Development

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi.1997.10.5.656 ◽

1997 ◽

Vol 10 (5) ◽

pp. 656-664 ◽

Cited By ~ 31

Author(s):

Katharina Pawlowski ◽

Paul Twigg ◽

Svetlana Dobritsa ◽

Changhui Guan ◽

Beth C. Mullin

Keyword(s):

Gene Family ◽

Sequence Similarity ◽

Root Nodules ◽

Rna Stability ◽

Alnus Glutinosa ◽

Chelating Resin ◽

Cdna Libraries ◽

Nodule Development ◽

Specific Gene ◽

Signal Peptides

Two cDNAs representing different members (agNt84 and ag164) of a gene family encoding glycine- and histidinerich proteins have been isolated from cDNA libraries from Alnus glutinosa root nodules. Expression of the corresponding genes could only be detected in nodules. With in situ hybridization, the expression in nodules was found to occur in young, infected cells of the prefixation zone (zone 2). The encoded proteins contain putative signal peptides for targeting to the endomembrane system, sharing sequence similarity with signal peptides from plant glycinerich proteins, among them nodulin 24, a nodule-specific protein from soybean. This similarity suggests that, analogous to nodulin-24, proteins encoded by agNt84/ag164 may be located at the interface between the host plant membrane and the matrix surrounding the endosymbiont. The 3′untranslated regions of the cDNAs contain unusual poly(AT)n stretches that may play a role in the regulation of RNA stability. The protein encoded by agNt84 cDNA was expressed in Escherichia coli as a fusion with maltosebinding protein, and was shown to have the ability to bind to a nickel-chelating resin, indicating that it may function as a metal-binding protein.

Download Full-text

CAARS: comparative assembly and annotation of RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/bty903 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2199-2207 ◽

Cited By ~ 1

Author(s):

Carine Rey ◽

Philippe Veber ◽

Bastien Boussau ◽

Marie Sémon

Keyword(s):

Gene Family ◽

De Novo ◽

Sequence Similarity ◽

Gene Families ◽

Supplementary Information ◽

Model Organisms ◽

Difficult Case ◽

Rna Seq ◽

Comparative Analyses ◽

Family Reconstruction

Abstract Motivation RNA sequencing (RNA-Seq) is a widely used approach to obtain transcript sequences in non-model organisms, notably for performing comparative analyses. However, current bioinformatic pipelines do not take full advantage of pre-existing reference data in related species for improving RNA-Seq assembly, annotation and gene family reconstruction. Results We built an automated pipeline named CAARS to combine novel data from RNA-Seq experiments with existing multi-species gene family alignments. RNA-Seq reads are assembled into transcripts by both de novo and assisted assemblies. Then, CAARS incorporates transcripts into gene families, builds gene alignments and trees and uses phylogenetic information to classify the genes as orthologs and paralogs of existing genes. We used CAARS to assemble and annotate RNA-Seq data in rodents and fishes using distantly related genomes as reference, a difficult case for this kind of analysis. We showed CAARS assemblies are more complete and accurate than those assembled by a standard pipeline consisting of de novo assembly coupled with annotation by sequence similarity on a guide species. In addition to annotated transcripts, CAARS provides gene family alignments and trees, annotated with orthology relationships, directly usable for downstream comparative analyses. Availability and implementation CAARS is implemented in Python and Ocaml and is freely available at https://github.com/carinerey/caars. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identifying and removing haplotypic duplication in primary genome assemblies

10.1101/729962 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dengfeng Guan ◽

Shane A. McCarthy ◽

Jonathan Wood ◽

Kerstin Howe ◽

Yadong Wang ◽

...

Keyword(s):

Gene Annotation ◽

Sequence Similarity ◽

Rapid Development ◽

Relevant Information ◽

Read Depth ◽

Current Standard ◽

Long Read ◽

Reference Quality ◽

Fully Automatic ◽

Genome Assemblies

AbstractMotivationRapid development in long read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either only focus on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.ResultsHere we present a novel tool “purge_dups” that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with the current standard, purge_haplotigs, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can be easy integrated into assembly pipelines.AvailabilityThe source code is written in C and is available at https://github.com/dfguan/[email protected], [email protected]

Download Full-text

Comparative analysis of de novo genomes reveals dynamic intra-species divergence of NLRs in pepper

BMC Plant Biology ◽

10.1186/s12870-021-03057-8 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Myung-Shin Kim ◽

Geun Young Chae ◽

Soohyun Oh ◽

Jihyun Kim ◽

Hyunggon Mang ◽

...

Keyword(s):

Capsicum Annuum ◽

De Novo ◽

Genomic Diversity ◽

Specific Gene ◽

Protein Coding ◽

Genomic Variations ◽

Gene Annotations ◽

Small Fruit ◽

Number Variation ◽

Genome Assemblies

Abstract Background Peppers (Capsicum annuum L.) containing distinct capsaicinoids are the most widely cultivated spices in the world. However, extreme genomic diversity among species represents an obstacle to breeding pepper. Results Here, we report de novo genome assemblies of Capsicum annuum ‘Early Calwonder (non-pungent, ECW)’ and ‘Small Fruit (pungent, SF)’ along with their annotations. In total, we assembled 2.9 Gb of ECW and SF genome sequences, representing over 91% of the estimated genome sizes. Structural and functional annotation of the two pepper genomes generated about 35,000 protein-coding genes each, of which 93% were assigned putative functions. Comparison between newly and publicly available pepper gene annotations revealed both shared and specific gene content. In addition, a comprehensive analysis of nucleotide-binding and leucine-rich repeat (NLR) genes through whole-genome alignment identified five significant regions of NLR copy number variation (CNV). Detailed comparisons of those regions revealed that these CNVs were generated by intra-specific genomic variations that accelerated diversification of NLRs among peppers. Conclusions Our analyses unveil an evolutionary mechanism responsible for generating CNVs of NLRs among pepper accessions, and provide novel genomic resources for functional genomics and molecular breeding of disease resistance in Capsicum species.

Download Full-text

New genes and functional innovation in mammals

10.1101/090860 ◽

2016 ◽

Cited By ~ 1

Author(s):

José Luis Villanueva-Cañas ◽

Jorge Ruiz-Orera ◽

M.Isabel Agea ◽

Maria Gallo ◽

David Andreu ◽

...

Keyword(s):

De Novo ◽

Gene Families ◽

Specific Gene ◽

Protein Coding ◽

Evolutionary Innovation ◽

New Genes ◽

Recent Origin ◽

Mammalian Genes ◽

Genomic Regions ◽

New Protein

ABSTRACTThe birth of genes that encode new protein sequences is a major source of evolutionary innovation. However, we still understand relatively little about how these genes come into being and which functions they are selected for. To address these questions we have obtained a large collection of mammalian-specific gene families that lack homologues in other eukaryotic groups. We have combined gene annotations and de novo transcript assemblies from 30 different mamalian species, obtaining about 6,000 gene families. In general, the proteins in mammalian-specific gene families tend to be short and depleted in aromatic and negatively charged residues. Proteins which arose early in mammalian evolution include milk and skin polypeptides, immune response components, and proteins involved in reproduction. In contrast, the functions of proteins which have a more recent origin remain largely unknown, despite the fact that these proteins also have extensive proteomics support. We identify several previously described cases of genes originated de novo from non-coding genomic regions, supporting the idea that this mechanism frequently underlies the evolution of new protein-coding genes in mammals. Finally, we show that most young mammalian genes are preferentially expressed in testis, suggesting that sexual selection plays an important role in the emergence of new functional genes.

Download Full-text