scholarly journals Effective normalization for copy number variation in Hi-C data

2017 ◽  
Author(s):  
N. Servant ◽  
N. Varoquaux ◽  
E. Heard ◽  
JP. Vert ◽  
E. Barillot

AbstractNormalization is essential to ensure accurate analysis and proper interpretation of sequencing data. Chromosome conformation data, such as Hi-C, is not different. The most widely used type of normalization of Hi-C data casts estimations of unwanted effects as a matrix balancing problem, relying on the assumption that all genomic regions interact as much as any other. Here, we show that these approaches, while very effective on fully haploid or diploid genome, fail to correct for unwanted effects in the presence of copy number variations. We propose a simple extension to matrix balancing methods that properly models the copy-number variation effects. Our approach can either retain the copy-number variation effects or remove it. We show that this leads to better downstream analysis of the three-dimensional organization of rearranged genome.

2020 ◽  
Vol 160 (11-12) ◽  
pp. 634-642
Author(s):  
Shiqiang Luo ◽  
Xingyuan Chen ◽  
Tizhen Yan ◽  
Jiaolian Ya ◽  
Zehui Xu ◽  
...  

High-throughput sequencing based on copy number variation (CNV-seq) is commonly used to detect chromosomal abnormalities. This study identifies chromosomal abnormalities in aborted embryos/fetuses in early and middle pregnancy and explores the application value of CNV-seq in determining the causes of pregnancy termination. High-throughput sequencing was used to detect chromosome copy number variations (CNVs) in 116 aborted embryos in early and middle pregnancy. The detection data were compared with the Database of Genomic Variants (DGV), the Database of Chromosomal Imbalance and Phenotype in Humans using Ensemble Resources (DECIPHER), and the Online Mendelian Inheritance in Man (OMIM) database to determine the CNV type and the clinical significance. High-throughput sequencing results were successfully obtained in 109 out of 116 specimens, with a detection success rate of 93.97%. In brief, there were 64 cases with abnormal chromosome numbers and 23 cases with CNVs, in which 10 were pathogenic mutations and 13 were variants of uncertain significance. An abnormal chromosome number is the most important reason for embryo termination in early and middle pregnancy, followed by pathogenic chromosome CNVs. CNV-seq can quickly and accurately detect chromosome abnormalities and identify microdeletion and microduplication CNVs that cannot be detected by conventional chromosome analysis, which is convenient and efficient for genetic etiology diagnosis in miscarriage.


2018 ◽  
Vol 33 (4) ◽  
pp. 540-544 ◽  
Author(s):  
Samanta Salvi ◽  
Valentina Casadio ◽  
Filippo Martignano ◽  
Giorgia Gurioli ◽  
Maria Maddalena Tumedei ◽  
...  

Background: We report a case of prostatic carcinosarcoma, a rare variant of prostatic cancer, which is composed of a mixture of epithelial and mesenchymal components with a generally poor outcome. Aims and methods: We aim to identify molecular alterations, in particular copy number variations of AR and c -MYC genes, methylation and expression of glutathione S-transferase P1 (GSTP1), programmed death-ligand 1 (PD-L1), AR, and phosphorylated AR expression. Results: We found a distinct molecular pattern between adenocarcinoma and carcinosarcoma, which was characterized by high AR copy number variation gain; positive expression of PD-L1, AR, and phosphorylated AR; low espression of GSTP1 in epithelial component. The sarcomatoid component had a lower gain of the AR gene, and no expression of PD-L1, AR, phosphorylated AR, or GSTP1. Both components had a gain of c-MYC copy number variation. Conclusions: Our findings suggest that carcinosarcoma has specific molecular characteristics that could be indicative for early diagnosis and treatment selection.


2020 ◽  
Author(s):  
Christopher W. Whelan ◽  
Robert E. Handsaker ◽  
Giulio Genovese ◽  
Seva Kashin ◽  
Monkol Lek ◽  
...  

AbstractTwo intriguing forms of genome structural variation (SV) – dispersed duplications, and de novo rearrangements of complex, multi-allelic loci – have long escaped genomic analysis. We describe a new way to find and characterize such variation by utilizing identity-by-descent (IBD) relationships between siblings together with high-precision measurements of segmental copy number. Analyzing whole-genome sequence data from 706 families, we find hundreds of “IBD-discordant” (IBDD) CNVs: loci at which siblings’ CNV measurements and IBD states are mathematically inconsistent. We found that commonly-IBDD CNVs identify dispersed duplications; we mapped 95 of these common dispersed duplications to their true genomic locations through family-based linkage and population linkage disequilibrium (LD), and found several to be in strong LD with genome-wide association (GWAS) signals for common diseases or gene expression variation at their revealed genomic locations. Other CNVs that were IBDD in a single family appear to involve de novo mutations in complex and multi-allelic loci; we identified 26 de novo structural mutations that had not been previously detected in earlier analyses of the same families by diverse SV analysis methods. These included a de novo mutation of the amylase gene locus and multiple de novo mutations at chromosome 15q14. Combining these complex mutations with more-conventional CNVs, we estimate that segmental mutations larger than 1kb arise in about one per 22 human meioses. These methods are complementary to previous techniques in that they interrogate genomic regions that are home to segmental duplication, high CNV allele frequencies, and multi-allelic CNVs.Author SummaryCopy number variation is an important form of genetic variation in which individuals differ in the number of copies of segments of their genomes. Certain aspects of copy number variation have traditionally been difficult to study using short-read sequencing data. For example, standard analyses often cannot tell whether the duplicated copies of a segment are located near the original copy or are dispersed to other regions of the genome. Another aspect of copy number variation that has been difficult to study is the detection of mutations in the copy number of DNA segments passed down from parents to their children, particularly when the mutations affect genome segments which already display common copy number variation in the population. We develop an analytical approach to solving these problems when sequencing data is available for all members of families with at least two children. This method is based on determining the number of parental haplotypes the two siblings share at each location in their genome, and using that information to determine the possible inheritance patterns that might explain the copy numbers we observe in each family member. We show that dispersed duplications and mutations can be identified by looking for copy number variants that do not follow these expected inheritance patterns. We use this approach to determine the location of 95 common duplications which are dispersed to distant regions of the genome, and demonstrate that these duplications are linked to genetic variants that affect disease risk or gene expression levels. We also identify a set of copy number mutations not detected by previous analyses of sequencing data from a large cohort of families, and show that repetitive and complex regions of the genome undergo frequent mutations in copy number.


2020 ◽  
Author(s):  
Getiria Onsongo ◽  
Ham Ching Lam ◽  
Matthew Bower ◽  
Bharat Thyagarajan

Abstract Objective : Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF , capable of detecting small clinically relevant CNVs. CNV-RF was designed for small gene panels and did not scale well to large gene panels. On large gene panels, CNV-RF routinely failed due to memory limitations. When successful, it took about 2 days to complete a single analysis, making it impractical for routinely analyzing large gene panels. We need a reliable tool capable of detecting CNVs in the clinic that scales well to large gene panels. Results : We have developed Hadoop-CNV-RF, a scalable implementation of CNV-RF . Hadoop-CNV-RF is a freely available tool capable of rapidly analyzing large gene panels. It takes advantage of Hadoop, a big data framework developed to analyze large amounts of data. Preliminary results show it reduces analysis time from about 2 days to less than 4 hours and can seamlessly scale to large gene panels. Hadoop-CNV-RF has been clinically validated for targeted capture data and is currently being used in a CLIA molecular diagnostics laboratory. Its availability and usage instructions are publicly available at: https://github.com/getiria-onsongo/hadoop-cnvrf-public .


2019 ◽  
Vol 133 (3) ◽  
pp. 951-966 ◽  
Author(s):  
Maria Kyriakidou ◽  
Sai Reddy Achakkagari ◽  
José Héctor Gálvez López ◽  
Xinyi Zhu ◽  
Chen Yu Tang ◽  
...  

Abstract Key message Twelve potato accessions were selected to represent two principal views on potato taxonomy. The genomes were sequenced and analyzed for structural variation (copy number variation) against three published potato genomes. Abstract The common potato (Solanum tuberosum L.) is an important staple crop with a highly heterozygous and complex tetraploid genome. The other taxa of cultivated potato contain varying ploidy levels (2X–5X), and structural variations are common in the genomes of these species, likely contributing to the diversification or agronomic traits during domestication. Increased understanding of the genomes and genomic variation will aid in the exploration of novel agronomic traits. Thus, sequencing data from twelve potato landraces, representing the four ploidy levels, were used to identify structural genomic variation compared to the two currently available reference genomes, a double monoploid potato genome and a diploid inbred clone of S. chacoense. The results of a copy number variation analysis showed that in the majority of the genomes, while the number of deletions is greater than the number of duplications, the number of duplicated genes is greater than the number of deleted ones. Specific regions in the twelve potato genomes have a high density of CNV events. Further, the auxin-induced SAUR genes (involved in abiotic stress), disease resistance genes and the 2-oxoglutarate/Fe(II)-dependent oxygenase superfamily proteins, among others, had increased copy numbers in these sequenced genomes relative to the references.


2019 ◽  
Author(s):  
Matthew Aguirre ◽  
Manuel Rivas ◽  
James Priest

AbstractCopy number variations (CNV) represent a significant proportion of the genetic differences between individuals and many CNVs associate causally with syndromic disease and clinical outcomes. Here, we characterize the landscape of copy number variation and their phenome-wide effects in a sample of 472,228 array-genotyped individuals from the UK Biobank. In addition to population-level selection effects against genic loci conferring high-mortality, we describe genetic burden from syndromic and previously uncharacterized CNV loci across nearly 2,000 quantitative and dichotomous traits, with separate analyses for common and rare classes of variation. Specifically, we highlight the effects of CNVs at two well-known syndromic loci 16p11.2 and 22q11.2, as well as novel associations at 9p23, in the context of acute coronary artery disease and high body mass index. Our data constitute a deeply contextualized portrait of population-wide burden of copy number variation, as well as a series of known and novel dosage-mediated genic associations across the medical phenome.


2017 ◽  
Author(s):  
Yuchao Jiang ◽  
Rujin Wang ◽  
Eugene Urrutia ◽  
Ioannis N. Anastopoulos ◽  
Katherine L. Nathanson ◽  
...  

AbstractHigh-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods, but suffers from biases and artifacts that lead to false discoveries and low sensitivity. We describe CODEX2, a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs.


2020 ◽  
Author(s):  
Yihang Shen ◽  
Carl Kingsford

AbstractThree-dimensional chromosomal structure plays an important role in gene regulation. Chromosome conformation capture techniques, especially the high-throughput, sequencing-based technique Hi-C, provide new insights on spatial architectures of chromosomes. However, Hi-C data contains artifacts and systemic biases that substantially influence subsequent analysis. Computational models have been developed to address these biases explicitly, however, it is difficult to enumerate and eliminate all the biases in models. Other models are designed to correct biases implicitly, but they will also be invalid in some situations such as copy number variations. We characterize a new kind of artifact in Hi-C data. We find that this artifact is caused by incorrect alignment of Hi-C reads against approximate repeat regions and can lead to erroneous chromatin contact signals. The artifact cannot be corrected by current Hi-C correction methods. We design a probabilistic method and develop a new Hi-C processing pipeline by integrating our probabilistic method with the HiC-Pro pipeline. We find that the new pipeline can remove this new artifact effectively, while preserving important features of the original Hi-C matrices.


Sign in / Sign up

Export Citation Format

Share Document