A Model-Free Approach for Detecting Genomic Regions of Deep Divergence Using the Distribution of Haplotype Distances

Mapping Intimacies ◽

10.1101/144394 ◽

2017 ◽

Author(s):

Mats E. Pettersson ◽

Marcin Kierczak ◽

Markus Sällman Almén ◽

Sangeet Lamichhaney ◽

Leif Andersson

Keyword(s):

Evolutionary History ◽

Balancing Selection ◽

A Priori ◽

R Package ◽

Data Sets ◽

Proof Of Concept ◽

Real World Data ◽

Genome Scanning ◽

Model Free ◽

Genomic Regions

AbstractRecent advances in comparative genomics have revealed that divergence between populations is not necessarily uniform across all parts of the genome. There are examples of regions with divergent haplotypes that are substantially more different from each other that the genomic average.Typically, these regions are of interest, as their persistence over long periods of time may reflect balancing selection. However, they are hard to detect unless the divergent sub-populations are known prior to analysis.Here, we introduce HaploDistScan, an R-package implementing model-free detection of deep-divergence genomic regions based on the distribution of pair-wise haplotype distances, and show that it can detect such regions without use of a priori information about population sub-division. We apply the method to real-world data sets, from ruff and Darwin’s finches, and show that we are able to recover known instances of balancing selection – originally identified in studies reliant on detailed phenotyping – using only genotype data. Furthermore, in addition to replicating previously known divergent haplotypes as a proof-of-concept, we identify novel regions of interest in the Darwin’s finch genome and propose a plausible, data-driven evolutionary history for each novel locus individually.In conclusion, HaploDistScan requires neither phenotypic nor demographic input data, thus filling a gap in the existing set of methods for genome scanning, and provides a useful tool for identification of regions under balancing selection or similar evolutionary processes.

Download Full-text

CHIPIN: ChIP-seq inter-sample normalization based on signal invariance across transcriptionally constant genes

BMC Bioinformatics ◽

10.1186/s12859-021-04320-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lélia Polit ◽

Gwenneg Kerdivel ◽

Sebastian Gregoricchio ◽

Michela Esposito ◽

Christel Guillouf ◽

...

Keyword(s):

Signal To Noise Ratio ◽

R Package ◽

Superior Performance ◽

Open Chromatin ◽

Data Sets ◽

Drug Treatments ◽

On Chip ◽

Genomic Regions ◽

Signal Normalization ◽

User Friendly

Abstract Background Multiple studies rely on ChIP-seq experiments to assess the effect of gene modulation and drug treatments on protein binding and chromatin structure. However, most methods commonly used for the normalization of ChIP-seq binding intensity signals across conditions, e.g., the normalization to the same number of reads, either assume a constant signal-to-noise ratio across conditions or base the estimates of correction factors on genomic regions with intrinsically different signals between conditions. Inaccurate normalization of ChIP-seq signal may, in turn, lead to erroneous biological conclusions. Results We developed a new R package, CHIPIN, that allows normalizing ChIP-seq signals across different conditions/samples when spike-in information is not available, but gene expression data are at hand. Our normalization technique is based on the assumption that, on average, no differences in ChIP-seq signals should be observed in the regulatory regions of genes whose expression levels are constant across samples/conditions. In addition to normalizing ChIP-seq signals, CHIPIN provides as output a number of graphs and calculates statistics allowing the user to assess the efficiency of the normalization and qualify the specificity of the antibody used. In addition to ChIP-seq, CHIPIN can be used without restriction on open chromatin ATAC-seq or DNase hypersensitivity data. We validated the CHIPIN method on several ChIP-seq data sets and documented its superior performance in comparison to several commonly used normalization techniques. Conclusions The CHIPIN method provides a new way for ChIP-seq signal normalization across conditions when spike-in experiments are not available. The method is implemented in a user-friendly R package available on GitHub: https://github.com/BoevaLab/CHIPIN

Download Full-text

Human-Machine Collaborative Optimization via Apprenticeship Scheduling

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11233 ◽

2018 ◽

Vol 63 ◽

pp. 1-49 ◽

Cited By ~ 3

Author(s):

Matthew Gombolay ◽

Reed Jensen ◽

Jessica Stigile ◽

Toni Golen ◽

Neel Shah ◽

...

Keyword(s):

Decision Making ◽

Domain Knowledge ◽

Resource Constraints ◽

Optimization Technique ◽

Collaborative Optimization ◽

Data Sets ◽

Scheduling Problems ◽

Real World Data ◽

Domain Experts ◽

Model Free

Coordinating agents to complete a set of tasks with intercoupled temporal and resource constraints is computationally challenging, yet human domain experts can solve these difficult scheduling problems using paradigms learned through years of apprenticeship. A process for manually codifying this domain knowledge within a computational framework is necessary to scale beyond the "single-expert, single-trainee" apprenticeship model. However, human domain experts often have difficulty describing their decision-making processes. We propose a new approach for capturing this decision-making process through counterfactual reasoning in pairwise comparisons. Our approach is model-free and does not require iterating through the state space. We demonstrate that this approach accurately learns multifaceted heuristics on a synthetic and real world data sets. We also demonstrate that policies learned from human scheduling demonstration via apprenticeship learning can substantially improve the efficiency of schedule optimization. We employ this human-machine collaborative optimization technique on a variant of the weapon-to-target assignment problem. We demonstrate that this technique generates optimal solutions up to 9.5 times faster than a state-of-the-art optimization algorithm.

Download Full-text

Datamining, Genetic Diversity Analyses, and Phylogeographic Reconstructions Redefine the Worldwide Evolutionary History of Grapevine Pinot gris virus and Grapevine berry inner necrosis virus

Phytobiomes Journal ◽

10.1094/pbiomes-10-19-0061-r ◽

2020 ◽

Vol 4 (2) ◽

pp. 165-177 ◽

Cited By ~ 4

Author(s):

Jean-Michel Hily ◽

Nils Poulicard ◽

Thierry Candresse ◽

Emmanuelle Vigne ◽

Monique Beuve ◽

...

Keyword(s):

Genetic Diversity ◽

Evolutionary History ◽

High Throughput Sequencing ◽

Balancing Selection ◽

Necrosis Virus ◽

Worldwide Distribution ◽

Genetic Features ◽

Grapevine Pinot Gris Virus ◽

Grapevine Berry ◽

Genomic Regions

The recently described member of the genus Trichovirus grapevine Pinot gris virus (GPGV) has now been detected in most grape-growing countries. While it has been associated with severe mottling and deformation symptoms under some circumstances, it has generally been detected in asymptomatic infections. The cause(s) underlying this variable association with symptoms remain(s) subject to speculations. GPGV genetic diversity has been studied using short genomic regions amplified by RT-PCR but not so far at the pan-genomic level. In an attempt to gain insight into GPGV diversity and evolutionary history, a systematic datamining effort was performed on our own high-throughput sequencing (HTS) data as well as on publicly available sequence read archive files. One hundred new complete or near complete GPGV genomic sequences were thus obtained, together with 69 new complete genomes for the other grapevine-infecting Trichovirus, grapevine berry inner necrosis virus (GINV). Phylogenetic and diversity analyses revealed that both viruses likely have their origin in Asia and that China is the most probable country of origin of GPGV. However, despite their common taxonomy, origin, and host, these two trichoviruses display very distinct genetic features and evolutionary traits. GINV shows an important overall genetic diversity, and is likely evolving under a balancing selection in a very restricted region of the world. On the contrary, GPGV shows a worldwide distribution with a modest genetic diversity and presents a strong selective sweep pattern. Taken together, these results show how two closely related trichoviruses differ drastically in their evolutionary history and epidemiological success. Possible causes for these differences are discussed. [Formula: see text] Copyright © 2020 The Author(s). This is an open access article distributed under the CC BY-NC-ND 4.0 International license .

Download Full-text

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

F1000Research ◽

10.12688/f1000research.7563.2 ◽

2016 ◽

Vol 4 ◽

pp. 1521 ◽

Cited By ~ 268

Author(s):

Charlotte Soneson ◽

Michael I. Love ◽

Mark D. Robinson

Keyword(s):

Statistical Inference ◽

High Throughput Sequencing ◽

Real Data ◽

Transcript Level ◽

R Package ◽

Data Sets ◽

Rna Seq ◽

Abundance Estimates ◽

Gene Level ◽

Genomic Regions

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Download Full-text

An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images

Algorithms ◽

10.3390/a14110337 ◽

2021 ◽

Vol 14 (11) ◽

pp. 337

Author(s):

Shaw-Hwa Lo ◽

Yiqiao Yin

Keyword(s):

Neural Network ◽

Deep Learning ◽

Convolutional Neural Network ◽

Large Scale ◽

Explanatory Power ◽

Prediction Performance ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Model Free

The field of explainable artificial intelligence (XAI) aims to build explainable and interpretable machine learning (or deep learning) methods without sacrificing prediction performance. Convolutional neural networks (CNNs) have been successful in making predictions, especially in image classification. These popular and well-documented successes use extremely deep CNNs such as VGG16, DenseNet121, and Xception. However, these well-known deep learning models use tens of millions of parameters based on a large number of pretrained filters that have been repurposed from previous data sets. Among these identified filters, a large portion contain no information yet remain as input features. Thus far, there is no effective method to omit these noisy features from a data set, and their existence negatively impacts prediction performance. In this paper, a novel interaction-based convolutional neural network (ICNN) is introduced that does not make assumptions about the relevance of local information. Instead, a model-free influence score (I-score) is proposed to directly extract the influential information from images to form important variable modules. This innovative technique replaces all pretrained filters found by trial-and-error with explainable, influential, and predictive variable sets (modules) determined by the I-score. In other words, future researchers need not rely on pretrained filters; the suggested algorithm identifies only the variables or pixels with high I-score values that are extremely predictive and important. The proposed method and algorithm were tested on real-world data set and a state-of-the-art prediction performance of 99.8% was achieved without sacrificing the explanatory power of the model. This proposed design can efficiently screen patients infected by COVID-19 before human diagnosis and can be a benchmark for addressing future XAI problems in large-scale data sets.

Download Full-text

rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure

10.1101/067629 ◽

2016 ◽

Author(s):

Mathieu Gautier ◽

Alexander Klassmann ◽

Renaud Vitalis

Keyword(s):

Positive Selection ◽

Computation Time ◽

Large Data ◽

R Package ◽

Large Data Sets ◽

Data Sets ◽

Genome Wide ◽

Haplotype Data ◽

Order Of Magnitude ◽

Genomic Regions

AbstractIdentifying genomic regions with unusually high local haplotype homozygosity represents a powerful strategy to characterize candidate genes responding to natural or artificial positive selection. To that end, statistics measuring the extent of haplotype homozygosity within (e.g., EHH, iHS) and between (Rsb or XP-EHH) populations have been proposed in the literature. The rehh package for R was previously developed to facilitate genome-wide scans of selection, based on the analysis of long-range haplotypes. However, its performance wasn’t sufficient to cope with the growing size of available data sets. Here we propose a major upgrade of the rehh package, which includes an improved processing of the input files, a faster algorithm to enumerate haplotypes, as well as multi-threading. As illustrated with the analysis of large human haplotype data sets, these improvements decrease the computation time by more than an order of magnitude. This new version of rehh will thus allow performing iHS-, Rsb- or XP-EHH-based scans on large data sets. The package rehh 2.0 is available from the CRAN repository (http://cran.r-project.org/web/packages/rehh/index.html) together with help files and a detailed manual.

Download Full-text

ReMo-SNPs: a new software tool for identification of polymorphisms in regions and motifs genome-wide

Genetics Research ◽

10.1017/s0016672315000051 ◽

2015 ◽

Vol 97 ◽

Author(s):

LISETTE GRAAE ◽

SILVIA PADDOCK ◽

ANDREA CARMINE BELIN

Keyword(s):

Association Study ◽

Association Studies ◽

Genetic Diseases ◽

A Priori ◽

Software Tool ◽

Data Sets ◽

Genome Wide Association Studies ◽

Association Analyses ◽

Genome Wide ◽

Genomic Regions

SummaryStudies of complex genetic diseases have revealed many risk factors of small effect, but the combined amount of heritability explained is still low. Genome-wide association studies are often underpowered to identify true effects because of the very large number of parallel tests. There is, therefore, a great need to generate data sets that are enriched for those markers that have an increased a priori chance of being functional, such as markers in genomic regions involved in gene regulation. ReMo-SNPs is a computational program developed to aid researchers in the process of selecting functional SNPs for association analyses in user-specified regions and/or motifs genome-wide. The useful feature of automatic selection of genotyped markers in the user-provided material makes the output data ready to be used in a following association study. In this article we describe the program and its functions. We also validate the program by including an example study on three different transcription factors and results from an association study on two psychiatric phenotypes. The flexibility of the ReMo-SNPs program enables the user to study any region or sequence of interest, without limitation to transcription factor binding regions and motifs. The program is freely available at: http://www.neuro.ki.se/ReMo-SNPs/

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text

Hfinger: Malware HTTP Request Fingerprinting

Entropy ◽

10.3390/e23050507 ◽

2021 ◽

Vol 23 (5) ◽

pp. 507

Author(s):

Piotr Białczak ◽

Wojciech Mazurczyk

Keyword(s):

Real World ◽

Network Traffic ◽

Experimental Evaluation ◽

Data Sets ◽

Real World Data ◽

Malicious Software ◽

Default Mode ◽

World Data ◽

Effectiveness Analysis ◽

Http Protocol

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.

Download Full-text

SambaR: An R package for fast, easy and reproducible population‐genetic analyses of biallelic SNP data sets

Molecular Ecology Resources ◽

10.1111/1755-0998.13339 ◽

2021 ◽

Author(s):

Menno J. Jong ◽

Joost F. Jong ◽

A. Rus Hoelzel ◽

Axel Janke

Keyword(s):

Population Genetic ◽

R Package ◽

Data Sets ◽

Genetic Analyses ◽

Snp Data ◽

Population Genetic Analyses

Download Full-text