Whole-Genome k-mer Topic Modeling Associates Bacterial Families

Ernesto Borrayo; Isaias May-Canche; Omar Paredes; J. Alejandro Morales; Rebeca Romo-Vázquez; Hugo Vélez-Pérez

doi:10.3390/genes11020197

Whole-Genome k-mer Topic Modeling Associates Bacterial Families

Genes ◽

10.3390/genes11020197 ◽

2020 ◽

Vol 11 (2) ◽

pp. 197

Author(s):

Ernesto Borrayo ◽

Isaias May-Canche ◽

Omar Paredes ◽

J. Alejandro Morales ◽

Rebeca Romo-Vázquez ◽

...

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Hierarchical Classification ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sequence Comparisons ◽

Alignment Free ◽

Biological Phenomena ◽

Topic Distribution ◽

Genome Comparisons

Alignment-free k-mer-based algorithms in whole genome sequence comparisons remain an ongoing challenge. Here, we explore the possibility to use Topic Modeling for organism whole-genome comparisons. We analyzed 30 complete genomes from three bacterial families by topic modeling. For this, each genome was considered as a document and 13-mer nucleotide representations as words. Latent Dirichlet allocation was used as the probabilistic modeling of the corpus. We where able to identify the topic distribution among analyzed genomes, which is highly consistent with traditional hierarchical classification. It is possible that topic modeling may be applied to establish relationships between genome’s composition and biological phenomena.

Download Full-text

Whole Genome Sequence Comparisons and "Full-Length" cDNA Sequences: A Combined Approach to Evaluate and Improve Arabidopsis Genome Annotation

Genome Research ◽

10.1101/gr.1515604 ◽

2004 ◽

Vol 14 (3) ◽

pp. 406-413 ◽

Cited By ~ 37

Author(s):

V. Castelli

Keyword(s):

Genome Sequence ◽

Genome Annotation ◽

Full Length ◽

Arabidopsis Genome ◽

Whole Genome Sequence ◽

Whole Genome ◽

Combined Approach ◽

Sequence Comparisons ◽

Full Length Cdna ◽

Cdna Sequences

Download Full-text

Whole-proteome tree of life suggests a deep burst of organism diversity

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1915766117 ◽

2020 ◽

Vol 117 (7) ◽

pp. 3678-3686 ◽

Cited By ~ 5

Author(s):

JaeJin Choi ◽

Sung-Hou Kim

Keyword(s):

Information Theory ◽

Genome Sequence ◽

Tree Of Life ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequences ◽

Alignment Free ◽

Whole Transcriptome ◽

Evolutionary Progression ◽

Feature Frequency

An organism tree of life (organism ToL) is a conceptual and metaphorical tree to capture a simplified narrative of the evolutionary course and kinship among the extant organisms. Such a tree cannot be experimentally validated but may be reconstructed based on characteristics associated with the organisms. Since the whole-genome sequence of an organism is, at present, the most comprehensive descriptor of the organism, a whole-genome sequence-based ToL can be an empirically derivable surrogate for the organism ToL. However, experimentally determining the whole-genome sequences of many diverse organisms was practically impossible until recently. We have constructed three types of ToLs for diversely sampled organisms using the sequences of whole genome, of whole transcriptome, and of whole proteome. Of the three, whole-proteome sequence-based ToL (whole-proteome ToL), constructed by applying information theory-based feature frequency profile method, an “alignment-free” method, gave the most topologically stable ToL. Here, we describe the main features of a whole-proteome ToL for 4,023 species with known complete or almost complete genome sequences on grouping and kinship among the groups at deep evolutionary levels. The ToL reveals 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all of the groups may be assigned on an evolutionary progression scale; 3) all of the founders of the groups have emerged in a “deep burst” at the very beginning period near the root of the ToL—an explosive birth of life’s diversity.

Download Full-text

Analysis of Epstein-Barr Virus Genomes and Expression Profiles in Gastric Adenocarcinoma

Journal of Virology ◽

10.1128/jvi.01239-17 ◽

2017 ◽

Vol 92 (2) ◽

Cited By ~ 11

Author(s):

Ivan Borozan ◽

Marc Zapatka ◽

Lori Frappier ◽

Vincent Ferretti

Keyword(s):

Epstein Barr Virus ◽

Expression Profiles ◽

Whole Genome Sequence ◽

Small Subset ◽

Whole Genome ◽

Genome Sequences ◽

Sequence Comparisons ◽

Barr Virus ◽

Epstein Barr ◽

Virus Genomes

ABSTRACTEpstein-Barr virus (EBV) is a causative agent of a variety of lymphomas, nasopharyngeal carcinoma (NPC), and ∼9% of gastric carcinomas (GCs). An important question is whether particular EBV variants are more oncogenic than others, but conclusions are currently hampered by the lack of sequenced EBV genomes. Here, we contribute to this question by mining whole-genome sequences of 201 GCs to identify 13 EBV-positive GCs and by assembling 13 new EBV genome sequences, almost doubling the number of available GC-derived EBV genome sequences and providing the first non-Asian EBV genome sequences from GC. Whole-genome sequence comparisons of all EBV isolates sequenced to date (85 from tumors and 57 from healthy individuals) showed that most GC and NPC EBV isolates were closely related although American Caucasian GC samples were more distant, suggesting a geographical component. However, EBV GC isolates were found to contain some consistent changes in protein sequences regardless of geographical origin. In addition, transcriptome data available for eight of the EBV-positive GCs were analyzed to determine which EBV genes are expressed in GC. In addition to the expected latency proteins (EBNA1, LMP1, and LMP2A), specific subsets of lytic genes were consistently expressed that did not reflect a typical lytic or abortive lytic infection, suggesting a novel mechanism of EBV gene regulation in the context of GC. These results are consistent with a model in which a combination of specific latent and lytic EBV proteins promotes tumorigenesis.IMPORTANCEEpstein-Barr virus (EBV) is a widespread virus that causes cancer, including gastric carcinoma (GC), in a small subset of individuals. An important question is whether particular EBV variants are more cancer associated than others, but more EBV sequences are required to address this question. Here, we have generated 13 new EBV genome sequences from GC, almost doubling the number of EBV sequences from GC isolates and providing the first EBV sequences from non-Asian GC. We further identify sequence changes in some EBV proteins common to GC isolates. In addition, gene expression analysis of eight of the EBV-positive GCs showed consistent expression of both the expected latency proteins and a subset of lytic proteins that was not consistent with typical lytic or abortive lytic expression. These results suggest that novel mechanisms activate expression of some EBV lytic proteins and that their expression may contribute to oncogenesis.

Download Full-text

Faculty Opinions recommendation of Whole genome sequence comparisons and "full-length" cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1018408.215556 ◽

2004 ◽

Author(s):

Motoaki Seki

Keyword(s):

Genome Sequence ◽

Genome Annotation ◽

Full Length ◽

Arabidopsis Genome ◽

Whole Genome Sequence ◽

Whole Genome ◽

Combined Approach ◽

Sequence Comparisons ◽

Full Length Cdna ◽

Cdna Sequences

Download Full-text

Genome-wide simple sequence repeats (SSR) markers discovered from whole-genome sequence comparisons of multiple spinach accessions

Scientific Reports ◽

10.1038/s41598-021-89473-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Gehendra Bhattarai ◽

Ainong Shi ◽

Devi R. Kandel ◽

Nora Solís-Gracia ◽

Jorge Alberto da Silva ◽

...

Keyword(s):

Ssr Markers ◽

Genome Sequence ◽

Reference Genome ◽

Whole Genome Sequence ◽

Large Set ◽

Whole Genome ◽

Genome Sequences ◽

Sequence Comparisons ◽

Ssr Loci ◽

Simple Sequence

AbstractThe availability of well-assembled genome sequences and reduced sequencing costs have enabled the resequencing of many additional accessions in several crops, thus facilitating the rapid discovery and development of simple sequence repeat (SSR) markers. Although the genome sequence of inbred spinach line Sp75 is available, previous efforts have resulted in a limited number of useful SSR markers. Identification of additional polymorphic SSR markers will support genetics and breeding research in spinach. This study aimed to use the available genomic resources to mine and catalog a large number of polymorphic SSR markers. A search for SSR loci on six chromosome sequences of spinach line Sp75 using GMATA identified a total of 42,155 loci with repeat motifs of two to six nucleotides in the Sp75 reference genome. Whole-genome sequences (30x) of additional 21 accessions were aligned against the chromosome sequences of the reference genome and in silico genotyped using the HipSTR program by comparing and counting repeat numbers variation across the SSR loci among the accessions. The HipSTR program generated SSR genotype data were filtered for monomorphic and high missing loci, and a final set of the 5986 polymorphic SSR loci were identified. The polymorphic SSR loci were present at a density of 12.9 SSRs/Mb and were physically mapped. Out of 36 randomly selected SSR loci for validation, two failed to amplify, while the remaining were all polymorphic in a set of 48 spinach accessions from 34 countries. Genetic diversity analysis performed using the SSRs allele score data on the 48 spinach accessions showed three main population groups. This strategy to mine and develop polymorphic SSR markers by a comparative analysis of the genome sequences of multiple accessions and computational genotyping of the candidate SSR loci eliminates the need for laborious experimental screening. Our approach increased the efficiency of discovering a large set of novel polymorphic SSR markers, as demonstrated in this report.

Download Full-text

Analysis of Salmonella enterica Serovar Typhimurium Variable-Number Tandem-Repeat Data for Public Health Investigation Based on Measured Mutation Rates and Whole-Genome Sequence Comparisons

Journal of Bacteriology ◽

10.1128/jb.01820-14 ◽

2014 ◽

Vol 196 (16) ◽

pp. 3036-3044 ◽

Cited By ~ 23

Author(s):

K. Dimovski ◽

H. Cao ◽

O. L. C. Wijburg ◽

R. A. Strugnell ◽

R. K. Mantena ◽

...

Keyword(s):

Public Health ◽

Tandem Repeat ◽

Salmonella Enterica Serovar Typhimurium ◽

Variable Number Tandem Repeat ◽

Variable Number ◽

Whole Genome Sequence ◽

Mutation Rates ◽

Whole Genome ◽

Sequence Comparisons ◽

Serovar Typhimurium

Download Full-text

Faculty Opinions recommendation of Whole genome sequence comparisons and "full-length" cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1018408.209473 ◽

2004 ◽

Author(s):

John McPherson

Keyword(s):

Genome Sequence ◽

Genome Annotation ◽

Full Length ◽

Arabidopsis Genome ◽

Whole Genome Sequence ◽

Whole Genome ◽

Combined Approach ◽

Sequence Comparisons ◽

Full Length Cdna ◽

Cdna Sequences

Download Full-text

BAG OF WORDS APPROACH AND DOCUMENT-TOPIC MODELING FOR HUMAN ACTIVITY RECOGNITION FROM VIDEOS

Jurnal Muara Sains, Teknologi, Kedokteran dan Ilmu Kesehatan ◽

10.24912/jmstkik.v1i1.433 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Janson Hendryli

Keyword(s):

Logistic Regression ◽

Activity Recognition ◽

Human Activity ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Human Activity Recognition ◽

Bag Of Words ◽

Visual Words ◽

Text Document ◽

Topic Distribution

Human activity recognition from videos have many useful real world applications, ranging from multimedia, entertainment, and security. In this paper, an approach inspired by a popular text document, namely the bag of words and document topic modeling, is explored. The latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF) are used to model the latent topic distribution in videos. Finally, the discovered distribution can be used to transformed the bag of words representation in order to categorize the video into ten daily human activities. The classification is done by feeding the transformed term-frequency of the visual words to the logistic regression and SVM model. The NMF achieved higher F1-score than the LDA when both SVM and logistic regression is used as the classifier.Keywords: human activity recognition, bag of words, document topic modeling

Download Full-text

Understand research hotspots surrounding COVID-19 and other coronavirus infections using topic modeling

10.1101/2020.03.26.20044164 ◽

2020 ◽

Cited By ~ 2

Author(s):

Mengying Dong ◽

Xiaojun Cao ◽

Mingbiao Liang ◽

Lijuan Li ◽

Guangjian Liu ◽

...

Keyword(s):

Epidemiological Study ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Virus Transmission ◽

Respiratory Illness ◽

Future Research ◽

Topic Distribution ◽

Virus Diagnostics ◽

Novel Coronavirus

AbstractBackgroundSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes severe respiratory illness in humans, which results in global outbreak of novel coronavirus disease (COVID-19) currently. This study aimed to evaluate the characteristics of publications involving coronaviruses as well as COVID-19 by using topic modeling.MethodsWe extracted all abstracts and retained the most informative words from the COVID-19 Open Research Dataset, which contains 35,092 pieces of coronavirus related literature published up to March 20, 2020. Using Latent Dirichlet Allocation modeling, we trained a topic model from the corpus, analyzed the semantic relationships between topics and compared the topic distribution between COVID-19 and other CoV infections.ResultsEight topics emerged overall: clinical characterization, pathogenesis research, therapeutics research, epidemiological study, virus transmission, vaccines research, virus diagnostics, and viral genomics. It was observed that current COVID-19 research puts more emphasis on clinical characterization, epidemiological study, and virus transmission. In contrast, topics about diagnostics, therapeutics, vaccines, genomics and pathogenesis only account for less than 10% or even 4% of all the COVID-19 publications, much lower than those of other CoV infections.ConclusionsThese results identified knowledge gaps in the area of COVID-19 and offered directions for future research.

Download Full-text

Whole-genome sequence comparisons reveal the evolution of Vibrio cholerae O1

Trends in Microbiology ◽

10.1016/j.tim.2015.03.010 ◽

2015 ◽

Vol 23 (8) ◽

pp. 479-489 ◽

Cited By ~ 44

Author(s):

Eun Jin Kim ◽

Chan Hee Lee ◽

G. Balakrish Nair ◽

Dong Wook Kim

Keyword(s):

Vibrio Cholerae ◽

Genome Sequence ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sequence Comparisons ◽

Vibrio Cholerae O1

Download Full-text