scholarly journals A simple analytical formula to compute the residual Mutual Information between pairs of data vectors

2016 ◽  
Author(s):  
Jens Kleinjung ◽  
Ton C.C. Coolen

ABSTRACTSummaryThe Mutual Information of pairs of data vectors, for example sequence alignment positions or gene expression profiles, is a quantitative measure of the interdependence between the data. However, data vectors based on a finite number of samples retain non-zero Mutual Information values even for completely random data, which is referred to as background or residual Mutual Information. Estimates of the residual Mutual Information have so far been obtained through heuristic or numerical approximations. Here we introduce a simple analytical formula for the computation of the residual Mutual Information that yields precise values and does not require the joint probabilities between the vector elements as input.Availability and ImplementationA C program arMI is available at http://mathbio.crick.ac.uk/wiki/Software#arMI. Using an input alignment in FASTA format or alternatively an internally created random alignment of specified length and depth, the program computes three types of Mutual information: (i) Shannon’s Mutual Information between all pairs of alignment columns; (ii) the numerical residual Mutual Information by using the same formula on the randomised (shuffled) data; (iii) the analytical residual Mutual Information introduced here. The package depends on the GNU Scientific Library, which is used for vector and matrix operations, factorial expressions and random number generation (Galassi et al., 2009). Reference alignments and result data are included in the program package in the folder ‘tests’. The R environment was used for statistics and plotting (R Core Team, 2014)[email protected] MaterialA detailed derivation of the analytical formula is given in the Supplementary Material.

Author(s):  
Justine Dardaillon ◽  
Delphine Dauga ◽  
Paul Simion ◽  
Emmanuel Faure ◽  
Takeshi A Onuma ◽  
...  

Abstract ANISEED (https://www.aniseed.cnrs.fr) is the main model organism database for the worldwide community of scientists working on tunicates, the vertebrate sister-group. Information provided for each species includes functionally-annotated gene and transcript models with orthology relationships within tunicates, and with echinoderms, cephalochordates and vertebrates. Beyond genes the system describes other genetic elements, including repeated elements and cis-regulatory modules. Gene expression profiles for several thousand genes are formalized in both wild-type and experimentally-manipulated conditions, using formal anatomical ontologies. These data can be explored through three complementary types of browsers, each offering a different view-point. A developmental browser summarizes the information in a gene- or territory-centric manner. Advanced genomic browsers integrate the genetic features surrounding genes or gene sets within a species. A Genomicus synteny browser explores the conservation of local gene order across deuterostome. This new release covers an extended taxonomic range of 14 species, including for the first time a non-ascidian species, the appendicularian Oikopleura dioica. Functional annotations, provided for each species, were enhanced through a combination of manual curation of gene models and the development of an improved orthology detection pipeline. Finally, gene expression profiles and anatomical territories can be explored in 4D online through the newly developed Morphonet morphogenetic browser.


Symmetry ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 1812
Author(s):  
Sergii Babichev ◽  
Lyudmyla Yasinska-Damri ◽  
Igor Liakh ◽  
Bohdan Durnyak

The problems of gene regulatory network (GRN) reconstruction and the creation of disease diagnostic effective systems based on genes expression data are some of the current directions of modern bioinformatics. In this manuscript, we present the results of the research focused on the evaluation of the effectiveness of the most used metrics to estimate the gene expression profiles’ proximity, which can be used to extract the groups of informative gene expression profiles while taking into account the states of the investigated samples. Symmetry is very important in the field of both genes’ and/or proteins’ interaction since it undergirds essentially all interactions between molecular components in the GRN and extraction of gene expression profiles, which allows us to identify how the investigated biological objects (disease, state of patients, etc.) contribute to the further reconstruction of GRN in terms of both the symmetry and understanding the mechanism of molecular element interaction in a biological organism. Within the framework of our research, we have investigated the following metrics: Mutual information maximization (MIM) using various methods of Shannon entropy calculation, Pearson’s χ2 test and correlation distance. The accuracy of the investigated samples classification was used as the main quality criterion to evaluate the appropriate metric effectiveness. The random forest classifier (RF) was used during the simulation process. The research results have shown that results of the use of various methods of Shannon entropy within the framework of the MIM metric disagree with each other. As a result, we have proposed the modified mutual information maximization (MMIM) proximity metric based on the joint use of various methods of Shannon entropy calculation and the Harrington desirability function. The results of the simulation have also shown that the correlation proximity metric is less effective in comparison to both the MMIM metric and Pearson’s χ2 test. Finally, we propose the hybrid proximity metric (HPM) that considers both the MMIM metric and Pearson’s χ2 test. The proposed metric was investigated within the framework of one-cluster structure effectiveness evaluation. To our mind, the main benefit of the proposed HPM is in increasing the objectivity of mutually similar gene expression profiles extraction due to the joint use of the various effective proximity metrics that can contradict with each other when they are used alone.


Cephalalgia ◽  
2019 ◽  
Vol 39 (11) ◽  
pp. 1435-1444 ◽  
Author(s):  
Lisette JA Kogelman ◽  
Katrine Falkenberg ◽  
Gisli H Halldorsson ◽  
Lau U Poulsen ◽  
Jacob Worm ◽  
...  

Background Migraine mechanisms are *These authors contributed equally to this work. only partly known. Some studies have previously described genes differentially expressed between blood from migraineurs and controls. The objective of this study was to describe gene expression in subtypes of migraine outside of attack and in healthy controls. Methods We extensively phenotyped 17 migraine without aura and nine migraine with aura female patients, and 20 age-matched female controls. Cubital venous blood was RNA sequenced. Genes differentially expressed between migraineurs (migraine without aura and migraine with aura) and controls, and between migraine without aura and migraine with aura were identified using a case-control design. A co-expression network was constructed to investigate the difference between migraineurs and healthy controls at the network level. Results We found two differentially expressed genes: NMNAT2 and RETN. Both were differentially expressed between migraine with aura and controls, but they could not be replicated in an independent cohort. Co-expression network analysis resulted in one cluster of highly interconnected genes that was nominally significantly associated with migraine; however, no pathways or gene ontology terms were detected. Conclusions We showed no clear distinct difference in gene expression profiles of peripheral blood of migraineurs and controls and were not able to replicate findings from previous studies. A larger sample size may be needed to detect minor differences.


Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Sven Borchmann

Abstract Background Host tissue infections by bacteria and viruses can cause cancer. Known viral carcinogenic mechanisms are disruption of the host genome via genomic integration and expression of oncogenic viral proteins. An important bacterial carcinogenic mechanism is chronic inflammation. Massively parallel sequencing now routinely generates datasets large enough to contain detectable traces of bacterial and viral nucleic acids of taxa that colonize the examined tissue or are integrated into the host genome. However, this hidden resource has not been comprehensively studied in large patient cohorts. Methods In the present study, 3025 whole genome sequencing datasets and, where available, corresponding RNA-seq datasets are leveraged to gain insight into novel links between viruses, bacteria, and cancer. Datasets were obtained from multiple International Cancer Genome Consortium studies, with additional controls added from the 1000 genome project. A customized pipeline based on KRAKEN was developed and validated to identify bacterial and viral sequences in the datasets. Raw results were stringently filtered to reduce false positives and remove likely contaminants. Results The resulting map confirms known links and expands current knowledge by identifying novel associations. Moreover, the detection of certain bacteria or viruses is associated with profound differences in patient and tumor phenotypes, such as patient age, tumor stage, survival, and somatic mutations in cancer genes or gene expression profiles. Conclusions Overall, these results provide a detailed, unprecedented map of links between viruses, bacteria, and cancer that can serve as a reference for future studies and further experimental validation.


2019 ◽  
Author(s):  
Sven Borchmann

ABSTRACTHost tissue infections by bacteria and viruses can cause cancer. Massively parallel sequencing now routinely generates datasets large enough to contain detectable traces of bacterial and viral nucleic acids of taxa that colonize the examined tissue or are integrated into the host genome. However, this hidden resource has not been comprehensively studied in large patient cohorts.In the present study, 3000 whole genome sequencing datasets are leveraged to gain insight into novel links between viruses, bacteria and cancer. The resulting map confirms known links and expands current knowledge by identifying novel associations. Moreover, the detection of certain bacteria or viruses is associated with profound differences in patient and tumor phenotypes, such as patient age, tumor stage, survival, somatic mutations in cancer genes or gene expression profiles.Overall, these results provide a detailed, unprecedented map of links between viruses, bacteria and cancer that can serve as a reference for future studies.


2004 ◽  
Vol 171 (4S) ◽  
pp. 349-350
Author(s):  
Gaelle Fromont ◽  
Michel Vidaud ◽  
Alain Latil ◽  
Guy Vallancien ◽  
Pierre Validire ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document