scholarly journals ntEdit: scalable genome assembly polishing

2019 ◽  
Author(s):  
René L Warren ◽  
Lauren Coombe ◽  
Hamid Mohamadi ◽  
Jessica Zhang ◽  
Barry Jaquish ◽  
...  

AbstractIn the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes.We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled E. coli and C. elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20X), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14s and <3m, on average, on E. coli and C. elegans, respectively. We performed similar benchmarks on a sub-20X coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40m on those sequences. We show how ntEdit ran in <2h20m to improve upon long and linked read human genome assemblies of NA12878, using high coverage (54X) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gbp interior and white spruce genomes in <4 and <5h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024.Availabilityhttps://github.com/bcgsc/nteditSupplemental materialavailable online.

2019 ◽  
Vol 35 (21) ◽  
pp. 4430-4432 ◽  
Author(s):  
René L Warren ◽  
Lauren Coombe ◽  
Hamid Mohamadi ◽  
Jessica Zhang ◽  
Barry Jaquish ◽  
...  

Abstract Motivation In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. Results We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability and implementation https://github.com/bcgsc/ntedit Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Vol 114 (27) ◽  
pp. E5379-E5388 ◽  
Author(s):  
Jaebum Kim ◽  
Marta Farré ◽  
Loretta Auvil ◽  
Boris Capitanu ◽  
Denis M. Larkin ◽  
...  

Whole-genome assemblies of 19 placental mammals and two outgroup species were used to reconstruct the order and orientation of syntenic fragments in chromosomes of the eutherian ancestor and six other descendant ancestors leading to human. For ancestral chromosome reconstructions, we developed an algorithm (DESCHRAMBLER) that probabilistically determines the adjacencies of syntenic fragments using chromosome-scale and fragmented genome assemblies. The reconstructed chromosomes of the eutherian, boreoeutherian, and euarchontoglires ancestor each included >80% of the entire length of the human genome, whereas reconstructed chromosomes of the most recent common ancestor of simians, catarrhini, great apes, and humans and chimpanzees included >90% of human genome sequence. These high-coverage reconstructions permitted reliable identification of chromosomal rearrangements over ∼105 My of eutherian evolution. Orangutan was found to have eight chromosomes that were completely conserved in homologous sequence order and orientation with the eutherian ancestor, the largest number for any species. Ruminant artiodactyls had the highest frequency of intrachromosomal rearrangements, and interchromosomal rearrangements dominated in murid rodents. A total of 162 chromosomal breakpoints in evolution of the eutherian ancestral genome to the human genome were identified; however, the rate of rearrangements was significantly lower (0.80/My) during the first ∼60 My of eutherian evolution, then increased to greater than 2.0/My along the five primate lineages studied. Our results significantly expand knowledge of eutherian genome evolution and will facilitate greater understanding of the role of chromosome rearrangements in adaptation, speciation, and the etiology of inherited and spontaneously occurring diseases.


2020 ◽  
Vol 8 (11) ◽  
pp. 1648
Author(s):  
Toni L. Poole ◽  
Wayne D. Schlosser ◽  
Robin C. Anderson ◽  
Keri N. Norman ◽  
Ross C. Beier ◽  
...  

Aeromonas hydrophila are ubiquitous in the environment and are highly distributed in aquatic habitats. They have long been known as fish pathogens but are opportunistic human pathogens. Aeromonas spp. have persisted through food-processing safeguards and have been isolated from fresh grocery vegetables, dairy, beef, pork, poultry products and packaged ready-to-eat meats, thus providing an avenue to foodborne illness. A beta-hemolytic, putative Escherichia coli strain collected from diarrheic neonatal pigs in Oklahoma was subsequently identified as A. hydrophila, and designated CVM861. Here we report the whole-genome sequence of A. hydrophila CVM861, SRA accession number, SRR12574563; BioSample number, SAMN1590692; Genbank accession number SRX9061579. The sequence data for CVM861 revealed four Aeromonas-specific virulence genes: lipase (lip), hemolysin (hlyA), cytonic enterotoxin (ast) and phospholipid-cholesterolacyltransferase (GCAT). There were no alignments to any virulence genes in VirulenceFinder. CVM861 contained an E. coli resistance plasmid identified as IncQ1_1__M28829. There were five aminoglycoside, three beta-lactam, and one each of macrolide, phenicol, sulfonamide, tetracycline and trimethoprim resistance genes, all with over 95% identity to genes in the ResFinder database. Additionally, there were 36 alignments to mobile genetic elements using MobileElementFinder. This shows that an aquatic pathogen, rarely considered in human disease, contributes to the resistome reservoir and may be capable of transferring resistance and virulence genes to other more prevalent foodborne strains such as E. coli or Salmonella in swine or other food production systems.


Blood ◽  
2010 ◽  
Vol 116 (21) ◽  
pp. SCI-16-SCI-16
Author(s):  
Eric D. Green

Abstract Abstract SCI-16 The Human Genome Project's completion of the human genome sequence in 2003 was a landmark scientific achievement of historic significance. It also signified a critical transition for the field of genomics, as the new foundation of genomic knowledge started to be used in powerful ways by researchers and clinicians to tackle increasingly complex problems in biomedicine. To exploit the opportunities provided by the human genome sequence and to ensure the productive growth of genomics as one of the most vital biomedical disciplines of the 21st century, the National Human Genome Research Institute (NHGRI) is pursuing a broad vision for genomics research beyond the Human Genome Project. This vision includes facilitating and supporting the highest-priority research areas that interconnect genomics to biology, to health, and to society.Current efforts in genomics research are focused on using genomic data, technologies, and insights to acquire a deeper understanding of biology and to uncover the genetic basis of human disease. Some of the most profound advances are being catalyzed by revolutionary new DNA sequencing technologies; these methods are already producing prodigious amounts of DNA sequence data, including from large numbers of individual patients. Such a capability, coupled with better associations between genetic diseases and specific regions of the human genome, are accelerating our understanding of the genetic basis for complex genetic disorders and for drug response. Together, these developments will usher in the era of genomic medicine. Disclosures: No relevant conflicts of interest to declare.


Yeast ◽  
2000 ◽  
Vol 1 (1) ◽  
pp. 43-47

Alan Coulson has two main roles at the Sanger Centre, revolving around the worm and the human genome projects. Although the worm sequence is essentially finished, the tidying-up of that and the physical map is ongoing. There is also a continuous need for communication with the worm field with regard to information and materials relating to the sequence project. For example, the cosmids and YACs of the physical map continue to be, as they have been for many years now, an extremely powerful resource, and the Sanger Centre distributes in the order of 500 clones per month to the community.Alan is team leader of the worm functional genomics group, which is currently small but will be expanding shortly. Patricia Kuwabara is a member of the team and a description of their activities can be found below. The Human Genome Project is sequencing mapped PAC and BAC clones. Alan's primary involvement is with the team that is responsible for subcloning the 10 000 or so clones that will be required to complete the one-third of the genome sequence to be contributed by the Sanger Centre.Patricia Kuwabara has been using Caenorhabditis elegans as a model for understanding how protein–protein interactions regulate cell-to-cell signalling. Her research has focused on understanding the molecular mechanisms underlying the genetics of C. elegans sex determination. This work has led into a study of regulated proteolysis involving calpains and also into the roles of the multiple C. elegans Patched proteins, which in other organisms have been shown to be receptors for the Hedgehog morphogen.In addition, the group is taking advantage of the completion of the C. elegans genome sequence to develop whole genome DNA microarrays for expression profiling. At the Sanger Centre, DNA microarrays are providing opportunities to examine how development and physiology are regulated globally, because most nematode genes have now been identified at the sequence level. The group are being assisted in this endeavour by Dr Stuart Kim (Stanford, CA).


2009 ◽  
Vol 10 (9) ◽  
pp. R94 ◽  
Author(s):  
Scott DiGuistini ◽  
Nancy Y Liao ◽  
Darren Platt ◽  
Gordon Robertson ◽  
Michael Seidel ◽  
...  

2021 ◽  
Vol 26 ◽  
pp. e983
Author(s):  
Susanne Hollmann ◽  
Babette Regierer ◽  
Teresa K Attwood ◽  
Andreas Gisel ◽  
Jacques Van Helden ◽  
...  

The completion of the human genome sequence triggered worldwide efforts to unravel the secrets hidden in its deceptively simple code. Numerous bioinformatics projects were undertaken to hunt for genes, predict their protein products, function and post-translational modifications, analyse protein-protein interactions, etc. Many novel analytic and predictive computer programmes fully optimised for manipulating human genome sequence data have been developed, whereas considerably less effort has been invested in exploring the many thousands of other available genomes, from unicellular organisms to plants and non-human animals.  Nevertheless, a detailed understanding of these organisms can have a significant impact on human health and well-being.New advances in genome sequencing technologies, bioinformatics, automation, artificial intelligence, etc., enable us to extend the reach of genomic research to all organisms.  To this aim gather, develop and implement new bioinformatics solutions (usually in the form of software) is pivotal. A helpful model, often used by the bioinformatics community, is the so-called hackathon. These are events when all stakeholders beyond their disciplines work together creatively to solve a problem. During its runtime, the consortium of the EU-funded project AllBio - Broadening the Bioinformatics Infrastructure to cellular, animal and plant science - conducted many successful hackathons with researchers from different Life Science areas. Based on this experience, in the following, the authors present a step-by-step and standardised workflow explaining how to organise a bioinformatics hackathon to develop software solutions to biological problems.


Sign in / Sign up

Export Citation Format

Share Document