Linking Phenotypes and Genotypes with Matrix Factorizations (Preprint)

Mapping Intimacies ◽

10.2196/preprints.13526 ◽

2019 ◽

Author(s):

Jianqiang Li ◽

Yu Guan ◽

Xi Xu ◽

Zerui Ma ◽

Faheem Akhtar ◽

...

Keyword(s):

Genomic Medicine ◽

Real World Data ◽

Computational Framework ◽

Data Set ◽

Disease Symptoms ◽

Similarity Network ◽

A Cell ◽

Genetic Makeup ◽

Highly Correlated ◽

Phenotype Similarity

BACKGROUND Background: Phenotype is defined as the composite of an organism’s observable characteristics or traits, such as human’s eye colors, behaviors and disease symptoms. Genotype is the genetic makeup of a cell, an organism, or an individual usually with reference to a specific characteristic under consideration. Thus phenotype can be regarded as the macroscopic description of an organism while genotype is its microscopic expression. OBJECTIVE Objective: Identification of phenotype-genotype associations is the primary step explaining the pathogenesis of human complex diseases. It is also of key importance for the development of Genomic medicine, sometimes also known as personalized medicine, which is a way to customize medical care to an individual body’s unique genetic makeup. METHODS Methods: In this paper, we propose a unified computational framework, called PheGe , to bridge phenotypes and genotypes. PheGe utilizes phenotype similarity network, genotype similarity network and known phenotype-genotype associations to explore the potential associations among other unlinked phenotypes and genotypes. RESULTS Results: As by-products, PheGe can also discover the phenotype and genotype groups, such that the phenotypes or genotypes within the same group are highly correlated with each other. We also validate the effectiveness of PheGe on a real-world data set, where we discover some interesting phenotype-genotype associations and phenotype/genotype groups. CONCLUSIONS Conclusions: Our method can reveal potential phenotype clusters and genotype clusters and their unknown associations through a variety of phenotype similarities, genotype similarities, as well as known phenotype-genotype associations.

Download Full-text

Exploring fake news identification using word and sentence embeddings

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189865 ◽

2021 ◽

pp. 1-8

Author(s):

V.T Priyanga ◽

J.P Sanjanasri ◽

Vijay Krishna Menon ◽

E.A Gopalakrishnan ◽

K.P Soman

Keyword(s):

Machine Learning ◽

Social Media ◽

Network Analysis ◽

Supervised Machine Learning ◽

Breeding Ground ◽

Fake News ◽

Data Set ◽

Highly Correlated ◽

Use Of Social Media ◽

The Liar

The widespread use of social media like Facebook, Twitter, Whatsapp, etc. has changed the way News is created and published; accessing news has become easy and inexpensive. However, the scale of usage and inability to moderate the content has made social media, a breeding ground for the circulation of fake news. Fake news is deliberately created either to increase the readership or disrupt the order in the society for political and commercial benefits. It is of paramount importance to identify and filter out fake news especially in democratic societies. Most existing methods for detecting fake news involve traditional supervised machine learning which has been quite ineffective. In this paper, we are analyzing word embedding features that can tell apart fake news from true news. We use the LIAR and ISOT data set. We churn out highly correlated news data from the entire data set by using cosine similarity and other such metrices, in order to distinguish their domains based on central topics. We then employ auto-encoders to detect and differentiate between true and fake news while also exploring their separability through network analysis.

Download Full-text

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text

Auto-sharing parameters for transfer learning based on multi-objective optimization

Integrated Computer-Aided Engineering ◽

10.3233/ica-210655 ◽

2021 ◽

pp. 1-13

Author(s):

Hailin Liu ◽

Fangqing Gu ◽

Zixian Lin

Keyword(s):

Transfer Learning ◽

Optimization Problem ◽

Data Sets ◽

Multi Objective Optimization ◽

Particle Swarm Optimizer ◽

Real World Data ◽

Data Set ◽

Target Task ◽

Main Research ◽

Multi Objective

Transfer learning methods exploit similarities between different datasets to improve the performance of the target task by transferring knowledge from source tasks to the target task. “What to transfer” is a main research issue in transfer learning. The existing transfer learning method generally needs to acquire the shared parameters by integrating human knowledge. However, in many real applications, an understanding of which parameters can be shared is unknown beforehand. Transfer learning model is essentially a special multi-objective optimization problem. Consequently, this paper proposes a novel auto-sharing parameter technique for transfer learning based on multi-objective optimization and solves the optimization problem by using a multi-swarm particle swarm optimizer. Each task objective is simultaneously optimized by a sub-swarm. The current best particle from the sub-swarm of the target task is used to guide the search of particles of the source tasks and vice versa. The target task and source task are jointly solved by sharing the information of the best particle, which works as an inductive bias. Experiments are carried out to evaluate the proposed algorithm on several synthetic data sets and two real-world data sets of a school data set and a landmine data set, which show that the proposed algorithm is effective.

Download Full-text

Empirical evaluation of feature subset selection based on a real-world data set

Engineering Applications of Artificial Intelligence ◽

10.1016/j.engappai.2004.03.005 ◽

2004 ◽

Vol 17 (3) ◽

pp. 285-288 ◽

Cited By ~ 5

Author(s):

Petra Perner ◽

Chid Apte

Keyword(s):

Real World ◽

Empirical Evaluation ◽

Subset Selection ◽

Feature Subset Selection ◽

Feature Subset ◽

Real World Data ◽

Data Set ◽

World Data

Download Full-text

Time-ResNeXt for epilepsy recognition based on EEG signals in wireless networks

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-020-01810-5 ◽

2020 ◽

Vol 2020 (1) ◽

Cited By ~ 1

Author(s):

Shaoqiang Wang ◽

Shudong Wang ◽

Song Zhang ◽

Yifan Wang

Keyword(s):

Deep Learning ◽

Network Structure ◽

Signal Recognition ◽

Eeg Signals ◽

Real World Data ◽

Data Set ◽

Practical Applications ◽

Epilepsy Diagnosis ◽

Single Data ◽

Electroencephalogram Eeg

Abstract To automatically detect dynamic EEG signals to reduce the time cost of epilepsy diagnosis. In the signal recognition of electroencephalogram (EEG) of epilepsy, traditional machine learning and statistical methods require manual feature labeling engineering in order to show excellent results on a single data set. And the artificially selected features may carry a bias, and cannot guarantee the validity and expansibility in real-world data. In practical applications, deep learning methods can release people from feature engineering to a certain extent. As long as the focus is on the expansion of data quality and quantity, the algorithm model can learn automatically to get better improvements. In addition, the deep learning method can also extract many features that are difficult for humans to perceive, thereby making the algorithm more robust. Based on the design idea of ResNeXt deep neural network, this paper designs a Time-ResNeXt network structure suitable for time series EEG epilepsy detection to identify EEG signals. The accuracy rate of Time-ResNeXt in the detection of EEG epilepsy can reach 91.50%. The Time-ResNeXt network structure produces extremely advanced performance on the benchmark dataset (Berne-Barcelona dataset) and has great potential for improving clinical practice.

Download Full-text

Rapid video assessment for monitoring testing facility fraud

International Journal of Quality & Reliability Management ◽

10.1108/ijqrm-01-2017-0022 ◽

2018 ◽

Vol 35 (8) ◽

pp. 1508-1518

Author(s):

Rosembergue Pereira Souza ◽

Luiz Fernando Rust da Costa Carmo ◽

Luci Pirmez

Keyword(s):

Processing Time ◽

Threshold Value ◽

Real World Data ◽

Data Set ◽

Content Type ◽

Video Assessment ◽

Technical Requirements ◽

Testing Facility ◽

Rapid Processing ◽

Temporal Differencing

Purpose The purpose of this paper is to present a procedure for finding unusual patterns in accredited tests using a rapid processing method for analyzing video records. The procedure uses the temporal differencing technique for object tracking and considers only frames not identified as statistically redundant. Design/methodology/approach An accreditation organization is responsible for accrediting facilities to undertake testing and calibration activities. Periodically, such organizations evaluate accredited testing facilities. These evaluations could use video records and photographs of the tests performed by the facility to judge their conformity to technical requirements. To validate the proposed procedure, a real-world data set with video records from accredited testing facilities in the field of vehicle safety in Brazil was used. The processing time of this proposed procedure was compared with the time needed to process the video records in a traditional fashion. Findings With an appropriate threshold value, the proposed procedure could successfully identify video records of fraudulent services. Processing time was faster than when a traditional method was employed. Originality/value Manually evaluating video records is time consuming and tedious. This paper proposes a procedure to rapidly find unusual patterns in videos of accredited tests with a minimum of manual effort.

Download Full-text

Predicting lean meat yield in beef cattle using ultrasonic muscle depth and width measurements

Canadian Journal of Animal Science ◽

10.4141/a02-093 ◽

2003 ◽

Vol 83 (3) ◽

pp. 429-434 ◽

Cited By ~ 5

Author(s):

R. Bergen ◽

D. H. Crews ◽

Jr., S. P. Miller ◽

J. J. McKinnon

Keyword(s):

Beef Cattle ◽

Carcass Composition ◽

Ultrasound Measurement ◽

Independent Data ◽

Data Set ◽

Meat Yield ◽

Lean Meat ◽

Ultrasound Measurements ◽

Width Measurements ◽

Highly Correlated

The value of live ultrasound longissimus dorsi depth and width measurements as predictors of estimated carcass lean meat yield of steers (CARLEAN-S) and bulls (CARLEAN-B) was studied. In trial 1, equations were developed to predict estimated lean meat yield of steers (n = 116) from carcass weight (Eq. 1) or liveweight (Eq. 2), fat depth and l. dorsi area or liveweight, fat depth and l. dorsi depth × width (Eq. 3). Equation 1 was most precise (RSD = 25.6 g kg-1), followed by Eq. 2 (RSD = 27.8g kg-1) and Eq. 3 (RSD = 30.2g kg-1). Equations 2 and 3 predicted CARLEAN-S with similar accuracy (SEP = 23.8 vs. 24.9 g kg-1, respectively) and were highly correlated with each other (r = 0.89) in an independent data set (n = 118). Repeatability and accuracy of pre-slaughter l. dorsi depth and width measurements were studied in yearling bulls (trial 2; n = 191). When ultrasound measurements were expressed as a percentage of the average ultrasound measurement, repeatabilities of l. dorsi depth (SER = 6.2 to 7.8%) and width (SER = 4.2 to 6.1%) measurements were similar to fat depth and l. dorsi area measurements (SER = 17.9 and 4.5%, respectively). When ultrasound measurements were compared to the corresponding carcass measurements, l. dorsi depth (SEP = 10.3 to 13.9%) and width (SEP = 6.7 to 8.5%) measurements were as accurate as fat depth and l. dorsi area measurements (SEP = 32.9 and 8.4%, respectively). Equations were developed to predict CARLEAN-B of yearling bulls (n = 82) from liveweight, 12th rib ultrasound fat depth and either l. dorsi depth × width measurements (Eqs. 4 and 5) or two l. dorsi depth measurements (Eq. 6). All equations had similar precision (RSD = 19.4 to 19.5 g kg-1) and predicted CARLEAN-B similarly (SEP = 25.0, 24.6 and 26.1g kg-1 for Eqs. 4, 5 and 6, respectively) in an independent data set (n = 109). All equations were highly correlated (r ≥0.97) with an equation using ultrasound fat depth and l. dorsi area in the independent data set. Longissimus muscle depth and width measurements were as valuable as l. dorsi area for predicting carcass composition of yearling beef bulls in the present study. Key words: Ultrasound, beef cattle, carcass traits

Download Full-text

CRISPR enriches for cells with mutations in a p53-related interactome, and this can be inhibited

10.1101/2021.03.10.434760 ◽

2021 ◽

Author(s):

Long Jiang ◽

Katrine Ingelshed ◽

Yunbing Shen ◽

Sanjaykumar V. Boddul ◽

Vaishnavi Srinivasan Iyer ◽

...

Keyword(s):

Cellular Response ◽

Human Cancer ◽

Genetic Alterations ◽

Dna Breaks ◽

Data Set ◽

Human Cancer Cell Lines ◽

Cycle Arrest ◽

Double Stranded Dna ◽

A Cell

CRISPR/Cas9 can be used to inactivate or modify genes by inducing double-stranded DNA breaks1–3. As a protective cellular response, DNA breaks result in p53-mediated cell cycle arrest and activation of cell death programs4,5. Inactivating p53 mutations are the most commonly found genetic alterations in cancer, highlighting the important role of the gene6–8. Here, we show that cells deficient in p53, as well as in genes of a core CRISPR-p53 tumor suppressor interactome, are enriched in a cell population when CRISPR is applied. Such enrichment could pose a challenge for clinical CRISPR use. Importantly, we identify that transient p53 inhibition suppresses the enrichment of cells with these mutations. Furthermore, in a data set of >800 human cancer cell lines, we identify parameters influencing the enrichment of p53 mutated cells, including strong baseline CDKN1A expression as a predictor for an active CRISPR-p53 axis. Taken together, our data identify strategies enabling safe CRISPR use.

Download Full-text

Advancing data science in drug development through an innovative computational framework for data sharing and statistical analysis

BMC Medical Research Methodology ◽

10.1186/s12874-021-01409-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Ann-Marie Mallon ◽

Dieter A. Häring ◽

Frank Dahlke ◽

Piet Aarden ◽

Soroosh Afyouni ◽

...

Keyword(s):

Machine Learning ◽

Clinical Trial ◽

Drug Development ◽

Phase Ii ◽

Data Science ◽

Clinical Trial Data ◽

Trial Data ◽

Computational Framework ◽

Data Set ◽

Collaborative Development

Abstract Background Novartis and the University of Oxford’s Big Data Institute (BDI) have established a research alliance with the aim to improve health care and drug development by making it more efficient and targeted. Using a combination of the latest statistical machine learning technology with an innovative IT platform developed to manage large volumes of anonymised data from numerous data sources and types we plan to identify novel patterns with clinical relevance which cannot be detected by humans alone to identify phenotypes and early predictors of patient disease activity and progression. Method The collaboration focuses on highly complex autoimmune diseases and develops a computational framework to assemble a research-ready dataset across numerous modalities. For the Multiple Sclerosis (MS) project, the collaboration has anonymised and integrated phase II to phase IV clinical and imaging trial data from ≈35,000 patients across all clinical phenotypes and collected in more than 2200 centres worldwide. For the “IL-17” project, the collaboration has anonymised and integrated clinical and imaging data from over 30 phase II and III Cosentyx clinical trials including more than 15,000 patients, suffering from four autoimmune disorders (Psoriasis, Axial Spondyloarthritis, Psoriatic arthritis (PsA) and Rheumatoid arthritis (RA)). Results A fundamental component of successful data analysis and the collaborative development of novel machine learning methods on these rich data sets has been the construction of a research informatics framework that can capture the data at regular intervals where images could be anonymised and integrated with the de-identified clinical data, quality controlled and compiled into a research-ready relational database which would then be available to multi-disciplinary analysts. The collaborative development from a group of software developers, data wranglers, statisticians, clinicians, and domain scientists across both organisations has been key. This framework is innovative, as it facilitates collaborative data management and makes a complicated clinical trial data set from a pharmaceutical company available to academic researchers who become associated with the project. Conclusions An informatics framework has been developed to capture clinical trial data into a pipeline of anonymisation, quality control, data exploration, and subsequent integration into a database. Establishing this framework has been integral to the development of analytical tools.

Download Full-text

Analysis of Co-Expression Module of Liver Metastasis Genes in Colorectal Cancer and Mining of Potential High-Expression Biomarkers

10.21203/rs.3.rs-516058/v1 ◽

2021 ◽

Author(s):

Xuehu Wang ◽

Nie Li ◽

Yun Wang ◽

Xiaoping Yin ◽

Yongchang Zheng

Keyword(s):

Colorectal Cancer ◽

Colon Cancer ◽

Liver Metastasis ◽

High Expression ◽

Data Set ◽

Colorectal Cancer Liver Metastasis ◽

Functional Annotations ◽

Public Data ◽

Highly Correlated ◽

Expression Module

Abstract AimsThe Hub genes highly related to the disease were found from the gene co-expression module, and the potential high expression genes were analyzed to predict the liver metastasis of colorectal cancer, so as to provide reference for subsequent targeted therapy.MethodsIn this study, we used the public data set of GEO database (GSE50760) to analyze the gene co-expression of liver metastasis of colon cancer, primary colon cancer and normal colon tissue (54 cases) and 50 cases of clinical cases. The functional annotations based on GO database are enriched, and the functional annotations of five gene modules are obtained through the enrichment of biological processes. Then the data mining is carried out to find the sub-networks with high adjacency in the gene co-expression network. At the same time, these sub-networks are annotated to find oncogenes related to liver metastasis of colorectal cancer.ResultsThis experiment found that KRAS, APC, FBXW7, PIK3CA, TP53 were highly correlated with liver metastasis of colorectal cancer. Finally, two protein genes STAT1 and MAPK1 were found by MCODE, which may be highly correlated with liver metastasis of colorectal cancer. Two new genes with high expression proteins found in this experiment have potential cancer, which has not been reflected in previous studies.ConclusionAccording to clinical data, KRAS, APC, FBXW7, PIK3CA, TP53 are related to colorectal cancer liver metastasis, and the analysis of the data set shows that STAT1 and MAPK1 are not only related to colorectal cancer liver metastasis oncogene but also related to clinically obtained genes.

Download Full-text