Beware the Jaccard: the choice of metric is important and non-trivial in genomic colocalisation analysis

Mapping Intimacies ◽

10.1101/479253 ◽

2018 ◽

Author(s):

Stefania Salvatore ◽

Knut Dagestad Rand ◽

Ivar Grytten ◽

Egil Ferkingstad ◽

Diana Domanska ◽

...

Keyword(s):

Genomic Data ◽

Real Data ◽

Jaccard Index ◽

Phenotypic Traits ◽

Wide Binary ◽

Genome Wide ◽

Dataset Size ◽

Genome Wide Data ◽

Modelling Assumptions ◽

Systematic Collection

AbstractBackgroundThe generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation, and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of metrics have been proposed for this problem in other fields like ecology. However, while several of these metrics have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated.ResultsWe show that the choice of metric may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly affected by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less affected by dataset size, but one should be aware of increased variance for small datasets.AvailabilityAll results on simulated and real data can be inspected and reproduced athttps://hyperbrowser.uio.no/sim-measure

Download Full-text

Beware the Jaccard: the choice of similarity measure is important and non-trivial in genomic colocalisation analysis

Briefings in Bioinformatics ◽

10.1093/bib/bbz083 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1523-1530 ◽

Cited By ~ 5

Author(s):

Stefania Salvatore ◽

Knut Dagestad Rand ◽

Ivar Grytten ◽

Egil Ferkingstad ◽

Diana Domanska ◽

...

Keyword(s):

Similarity Measure ◽

Similarity Measures ◽

Real Data ◽

Phenotypic Traits ◽

Wide Binary ◽

Genome Wide ◽

Dataset Size ◽

Genome Wide Data ◽

Modelling Assumptions ◽

Systematic Collection

Abstract The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of similarity measures have been proposed for this problem in other fields like ecology. However, while several of these measures have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated. We show that the choice of similarity measure may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly altered by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less influenced by dataset size, but one should be aware of increased variance for small datasets. All results on simulated and real data can be inspected and reproduced at https://hyperbrowser.uio.no/sim-measure.

Download Full-text

Ancient genome-wide DNA from France highlights the complexity of interactions between Mesolithic hunter-gatherers and Neolithic farmers

Science Advances ◽

10.1126/sciadv.aaz5344 ◽

2020 ◽

Vol 6 (22) ◽

pp. eaaz5344 ◽

Cited By ~ 3

Author(s):

Maïté Rivollat ◽

Choongwon Jeong ◽

Stephan Schiffels ◽

İşil Küçükkalıpçı ◽

Marie-Hélène Pemonge ◽

...

Keyword(s):

Genomic Data ◽

Biological Interactions ◽

Genetic Affinity ◽

Hunter Gatherers ◽

Near Eastern ◽

Genome Wide ◽

Cultural Pattern ◽

Genome Wide Data ◽

Genetic Substructure ◽

Neolithic Expansion

Starting from 12,000 years ago in the Middle East, the Neolithic lifestyle spread across Europe via separate continental and Mediterranean routes. Genomes from early European farmers have shown a clear Near Eastern/Anatolian genetic affinity with limited contribution from hunter-gatherers. However, no genomic data are available from modern-day France, where both routes converged, as evidenced by a mosaic cultural pattern. Here, we present genome-wide data from 101 individuals from 12 sites covering today’s France and Germany from the Mesolithic (N = 3) to the Neolithic (N = 98) (7000–3000 BCE). Using the genetic substructure observed in European hunter-gatherers, we characterize diverse patterns of admixture in different regions, consistent with both routes of expansion. Early western European farmers show a higher proportion of distinctly western hunter-gatherer ancestry compared to central/southeastern farmers. Our data highlight the complexity of the biological interactions during the Neolithic expansion by revealing major regional variations.

Download Full-text

A performance assessment of relatedness inference methods using genome-wide data from thousands of relatives

10.1101/106013 ◽

2017 ◽

Author(s):

Monica D. Ramstetter ◽

Thomas D. Dyer ◽

Donna M. Lehman ◽

Joanne E. Curran ◽

Ravindranath Duggirala ◽

...

Keyword(s):

State Of The Art ◽

Association Studies ◽

Genetic Association Studies ◽

Real Data ◽

New Methods ◽

Genome Wide ◽

Genome Wide Data ◽

Inference Methods ◽

Multiple Samples ◽

Combining Information

AbstractInferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these approaches in real data has been lacking. Here, we report an assessment of 12 state-of-the-art pairwise relatedness inference methods using a dataset with 2,485 individuals contained in several large pedigrees that span up to six generations. We find that all methods have high accuracy (~92% – 99%) when detecting first and second degree relationships, but their accuracy dwindles to less than 43% for seventh degree relationships. However, most IBD segment-based methods inferred seventh degree relatives correct to within one relatedness degree for more than 76% of relative pairs. Overall, the most accurate methods are ERSA and approaches that compute total IBD sharing using the output from GERMLINE and Refined IBD to infer relatedness. Combining information from the most accurate methods provides little accuracy improvement, indicating that novel approaches—such as new methods that leverage relatedness signals from multiple samples—are needed to achieve a sizeable jump in performance.

Download Full-text

M3C: Monte Carlo reference-based consensus clustering

10.1101/377002 ◽

2018 ◽

Cited By ~ 4

Author(s):

Christopher R. John ◽

David Watson ◽

Dominic Russ ◽

Katriona Goldmann ◽

Michael Ehrenstein ◽

...

Keyword(s):

Monte Carlo ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Data ◽

The Cancer Genome Atlas ◽

Consensus Clustering ◽

Null Distributions ◽

Genome Wide ◽

Genome Wide Data ◽

Cancer Genome Atlas

AbstractGenome-wide data is used to stratify patients into classes for precision medicine using clustering algorithms. A common problem in this area is selection of the number of clusters (K). The Monti consensus clustering algorithm is a widely used method which uses stability selection to estimate K. However, the method has bias towards higher values of K and yields high numbers of false positives. As a solution, we developed Monte Carlo reference-based consensus clustering (M3C), which is based on this algorithm. M3C simulates null distributions of stability scores for a range of K values thus enabling a comparison with real data to remove bias and statistically test for the presence of structure. M3C corrects the inherent bias of consensus clustering as demonstrated on simulated and real expression data from The Cancer Genome Atlas (TCGA). For testing M3C, we developed clusterlab, a new method for simulating multivariate Gaussian clusters.

Download Full-text

Population Replacement in Early Neolithic Britain

10.1101/267443 ◽

2018 ◽

Cited By ~ 10

Author(s):

Selina Brace ◽

Yoan Diekmann ◽

Thomas J. Booth ◽

Zuzana Faltyskova ◽

Nadin Rohland ◽

...

Keyword(s):

Ancient Dna ◽

Genomic Data ◽

Early Neolithic ◽

Genetic Affinity ◽

Hunter Gatherers ◽

Population Replacement ◽

Neolithic Transition ◽

Genome Wide ◽

Genome Wide Data ◽

Neolithic Cultures

The roles of migration, admixture and acculturation in the European transition to farming have been debated for over 100 years. Genome-wide ancient DNA studies indicate predominantly Anatolian ancestry for continental Neolithic farmers, but also variable admixture with local Mesolithic hunter-gatherers1–9. Neolithic cultures first appear in Britain c. 6000 years ago (kBP), a millennium after they appear in adjacent areas of northwestern continental Europe. However, the pattern and process of the British Neolithic transition remains unclear10–15. We assembled genome-wide data from six Mesolithic and 67 Neolithic individuals found in Britain, dating from 10.5-4.5 kBP, a dataset that includes 22 newly reported individuals and the first genomic data from British Mesolithic hunter-gatherers. Our analyses reveals persistent genetic affinities between Mesolithic British and Western European hunter-gatherers over a period spanning Britain’s separation from continental Europe. We find overwhelming support for agriculture being introduced by incoming continental farmers, with small and geographically structured levels of additional hunter-gatherer introgression. We find genetic affinity between British and Iberian Neolithic populations indicating that British Neolithic people derived much of their ancestry from Anatolian farmers who originally followed the Mediterranean route of dispersal and likely entered Britain from northwestern mainland Europe.

Download Full-text

Genetic analysis of amyotrophic lateral sclerosis identifies contributing pathways and cell types

Science Advances ◽

10.1126/sciadv.abd9036 ◽

2021 ◽

Vol 7 (3) ◽

pp. eabd9036

Author(s):

Sara Saez-Atienzar ◽

Sara Bandres-Ciga ◽

Rebekah G. Langston ◽

Jonggeol J. Kim ◽

Shing Wan Choi ◽

...

Keyword(s):

Amyotrophic Lateral Sclerosis ◽

Membrane Trafficking ◽

Molecular Mechanisms ◽

Cell Types ◽

Polygenic Risk Score ◽

Genome Wide ◽

Genome Wide Data ◽

Data Driven Approach ◽

Single Nucleus ◽

Lateral Sclerosis

Despite the considerable progress in unraveling the genetic causes of amyotrophic lateral sclerosis (ALS), we do not fully understand the molecular mechanisms underlying the disease. We analyzed genome-wide data involving 78,500 individuals using a polygenic risk score approach to identify the biological pathways and cell types involved in ALS. This data-driven approach identified multiple aspects of the biology underlying the disease that resolved into broader themes, namely, neuron projection morphogenesis, membrane trafficking, and signal transduction mediated by ribonucleotides. We also found that genomic risk in ALS maps consistently to GABAergic interneurons and oligodendrocytes, as confirmed in human single-nucleus RNA-seq data. Using two-sample Mendelian randomization, we nominated six differentially expressed genes (ATG16L2, ACSL5, MAP1LC3A, MAPKAPK3, PLXNB2, and SCFD1) within the significant pathways as relevant to ALS. We conclude that the disparate genetic etiologies of this fatal neurological disease converge on a smaller number of final common pathways and cell types.

Download Full-text

Ancient genomic time transect from the Central Asian Steppe unravels the history of the Scythians

Science Advances ◽

10.1126/sciadv.abe4414 ◽

2021 ◽

Vol 7 (13) ◽

pp. eabe4414

Author(s):

Guido Alberto Gnecchi-Ruscone ◽

Elmira Khussainova ◽

Nurzhibek Kahbatkyzy ◽

Lyazzat Musralina ◽

Maria A. Spyrou ◽

...

Keyword(s):

Bronze Age ◽

Iron Age ◽

Gene Pools ◽

Social Rules ◽

Eurasian Steppe ◽

Central Asian ◽

Genome Wide ◽

Genome Wide Data ◽

History Of ◽

First Millennium

The Scythians were a multitude of horse-warrior nomad cultures dwelling in the Eurasian steppe during the first millennium BCE. Because of the lack of first-hand written records, little is known about the origins and relations among the different cultures. To address these questions, we produced genome-wide data for 111 ancient individuals retrieved from 39 archaeological sites from the first millennia BCE and CE across the Central Asian Steppe. We uncovered major admixture events in the Late Bronze Age forming the genetic substratum for two main Iron Age gene-pools emerging around the Altai and the Urals respectively. Their demise was mirrored by new genetic turnovers, linked to the spread of the eastern nomad empires in the first centuries CE. Compared to the high genetic heterogeneity of the past, the homogenization of the present-day Kazakhs gene pool is notable, likely a result of 400 years of strict exogamous social rules.

Download Full-text

Genome diversity in Ukraine

GigaScience ◽

10.1093/gigascience/giaa159 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taras K Oleksyk ◽

Walter W Wolfsberger ◽

Alexandra M Weber ◽

Khrystyna Shchubelka ◽

Olga T Oleksyk ◽

...

Keyword(s):

Sequence Data ◽

Copy Number Variations ◽

Genomic Variation ◽

High Coverage ◽

Genome Data ◽

New Information ◽

Genome Wide ◽

Public Data ◽

Genome Wide Data ◽

Multiple Samples

Abstract Background The main goal of this collaborative effort is to provide genome-wide data for the previously underrepresented population in Eastern Europe, and to provide cross-validation of the data from genome sequences and genotypes of the same individuals acquired by different technologies. We collected 97 genome-grade DNA samples from consented individuals representing major regions of Ukraine that were consented for public data release. BGISEQ-500 sequence data and genotypes by an Illumina GWAS chip were cross-validated on multiple samples and additionally referenced to 1 sample that has been resequenced by Illumina NovaSeq6000 S4 at high coverage. Results The genome data have been searched for genomic variation represented in this population, and a number of variants have been reported: large structural variants, indels, copy number variations, single-nucletide polymorphisms, and microsatellites. To our knowledge, this study provides the largest to-date survey of genetic variation in Ukraine, creating a public reference resource aiming to provide data for medical research in a large understudied population. Conclusions Our results indicate that the genetic diversity of the Ukrainian population is uniquely shaped by evolutionary and demographic forces and cannot be ignored in future genetic and biomedical studies. These data will contribute a wealth of new information bringing forth a wealth of novel, endemic and medically related alleles.

Download Full-text

Initial Upper Palaeolithic humans in Europe had recent Neanderthal ancestry

Nature ◽

10.1038/s41586-021-03335-3 ◽

2021 ◽

Vol 592 (7853) ◽

pp. 253-257 ◽

Cited By ~ 3

Author(s):

Mateja Hajdinjak ◽

Fabrizio Mafessoni ◽

Laurits Skov ◽

Benjamin Vernot ◽

Alexander Hübner ◽

...

Keyword(s):

Family History ◽

East Asia ◽

Late Pleistocene ◽

Modern Human ◽

Human Migration ◽

Upper Palaeolithic ◽

Modern Humans ◽

Genome Wide ◽

Genome Wide Data

AbstractModern humans appeared in Europe by at least 45,000 years ago1–5, but the extent of their interactions with Neanderthals, who disappeared by about 40,000 years ago6, and their relationship to the broader expansion of modern humans outside Africa are poorly understood. Here we present genome-wide data from three individuals dated to between 45,930 and 42,580 years ago from Bacho Kiro Cave, Bulgaria1,2. They are the earliest Late Pleistocene modern humans known to have been recovered in Europe so far, and were found in association with an Initial Upper Palaeolithic artefact assemblage. Unlike two previously studied individuals of similar ages from Romania7 and Siberia8 who did not contribute detectably to later populations, these individuals are more closely related to present-day and ancient populations in East Asia and the Americas than to later west Eurasian populations. This indicates that they belonged to a modern human migration into Europe that was not previously known from the genetic record, and provides evidence that there was at least some continuity between the earliest modern humans in Europe and later people in Eurasia. Moreover, we find that all three individuals had Neanderthal ancestors a few generations back in their family history, confirming that the first European modern humans mixed with Neanderthals and suggesting that such mixing could have been common.

Download Full-text

Genome-Wide Patterns of Homozygosity Reveal the Conservation Status in Five Italian Goat Populations

Animals ◽

10.3390/ani11061510 ◽

2021 ◽

Vol 11 (6) ◽

pp. 1510

Author(s):

Salvatore Mastrangelo ◽

Rosalia Di Gerlando ◽

Maria Teresa Sardina ◽

Anna Maria Sutera ◽

Angelo Moscarelli ◽

...

Keyword(s):

Conservation Status ◽

Phenotypic Traits ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Local Populations ◽

Genomic Technologies ◽

Fitness Traits ◽

Genome Wide ◽

Breeding Schemes ◽

Genomic Inbreeding

The application of genomic technologies has facilitated the assessment of genomic inbreeding based on single nucleotide polymorphisms (SNPs). In this study, we computed several runs of homozygosity (ROH) parameters to investigate the patterns of homozygosity using Illumina Goat SNP50 in five Italian local populations: Argentata dell’Etna (N = 48), Derivata di Siria (N = 32), Girgentana (N = 59), Maltese (N = 16) and Messinese (N = 22). The ROH results showed well-defined differences among the populations. A total of 3687 ROH segments >2 Mb were detected in the whole sample. The Argentata dell’Etna and Messinese were the populations with the lowest mean number of ROH and inbreeding coefficient values, which reflect admixture and gene flow. In the Girgentana, we identified an ROH pattern related with recent inbreeding that can endanger the viability of the breed due to reduced population size. The genomes of Derivata di Siria and Maltese breeds showed the presence of long ROH (>16 Mb) that could seriously impact the overall biological fitness of these breeds. Moreover, the results confirmed that ROH parameters are in agreement with the known demography of these populations and highlighted the different selection histories and breeding schemes of these goat populations. In the analysis of ROH islands, we detected harbored genes involved with important traits, such as for milk yield, reproduction, and immune response, and are consistent with the phenotypic traits of the studied goat populations. Finally, the results of this study can be used for implementing conservation programs for these local populations in order to avoid further loss of genetic diversity and to preserve the production and fitness traits. In view of this, the availability of genomic data is a fundamental resource.

Download Full-text