scholarly journals A massively parallel algorithm for finding non-existing sequences in genomes

2019 ◽  
Author(s):  
Marco Falda

AbstractWe discuss a method for producing a set of absent words in a reference genome with a guaranteed Hamming distance along all positions and additional information about the number of mismatches, their location and the position of the best match. We implemented it exploiting the massively parallelism of modern GPUs hardware: the code is available at https://bitbucket.org/mfalda/cuda_keeseek/.

2016 ◽  
Author(s):  
Logan Kistler ◽  
Stephen M. Johnson ◽  
Mitchell T. Irwin ◽  
Edward E. Louis ◽  
Aakrosh Ratan ◽  
...  

AbstractShort tandem repeat (STRs or microsatellites) variants, are highly polymorphic markers that facilitate powerful, high-precision population genetic analyses. STRs are especially valuable in conservation and ecological genetic research, yielding detailed information on population structure and short-term demographic flux. However, STR marker development and analysis by conventional PCR-based methods imposes a workflow bottleneck and is suboptimal for noninvasive sampling strategies such as fecal DNA recovery. While massively parallel sequencing has not previously been leveraged for scalable, efficient STR recovery, here we present a pipeline for developing STR markers directly from high-throughput shotgun sequencing data without requiring a reference genome assembly, and a methodological approach for highly parallel recovery of enriched STR loci. We first employed our approach to design and capture a panel of 5,000 STR loci from a test group of diademed sifakas (Propithecus diadema, n=3), endangered Malagasy rainforest lemurs, and we report extremely efficient recovery of targeted loci—97.3-99.6% of STRs characterized with ≥10x non-redundant coverage. Second, we tested our STR capture strategy on a P. diadema fecal DNA preparation, and report robust initial results and methodological suggestions for future implementations. In addition to STR targets, this approach also generates large, genome-wide single nucleotide polymorphism (SNP) panels from regions flanking the STR loci. Our method provides a cost-effective and highly scalable solution for rapid recovery of large STR and SNP datasets in any species without need for a reference genome, and can be used even with suboptimal DNA, which is more easily acquired in conservation and ecological genetic studies.Data DepositionRaw sequencing data are available under Study Accession numbers SRP073167 (genomic shotgun data for Oberon and Tatiana) and SRP076225 (targeted re-sequencing data) from the NCBI Sequence Read Archive. BaitSTR software is available at Github (core BaitSTR programs: https://github.com/aakrosh/BaitSTR; BaitSTR_type.pl companion script for genotyping and block manipulation: https://github.com/lkistler/BaitSTR_type).


HortScience ◽  
1998 ◽  
Vol 33 (3) ◽  
pp. 552e-552
Author(s):  
James L. Green

In 1997, the ASHS Board of Directors established ASHS HortBase as a Standing Committee of the Society. The ASHS HortBase Committee, a six-member Standing Committee and Chair, is charged to implement and maintain ASHS HortBase. The members of the ASHS HortBase Committee will be chair and chair-elect of the three HortBase Task Forces: 1) Finance and Marketing; 2) Standards—authoring, reviewing, and publishing; and 3) Technology. ASHS HortBase is a dispersed, dynamic horticultural information system (network) on the WWW comprised of peer—reviewed, concise, interlinked information modules to meet the information needs of instructors and students, gardeners and growers. A strong advantage and distinguishing characteristic of ASHS HortBase is our dynamic pool of potential authors, reviewers, and users (ASHS Extension, Industry, and Teaching membership) to continually evolve and update the peer-reviewed information in HortBase. We have the scholastic international standing to provide peer review and validation of the information and to recognition to the authors, coupled with the marketing to stimulate wide use of their information modules. ASHS HortBase is a dispersed system (dispersed development and server costs). The “dispersed cost” for information file development and updating and delivery on the respective authors' dispersed servers disperses the major costs of the HortBase information system. Additional information on ASHS HortBase and the papers presented at the 4-h Colloquium on HortBase at ASHS-97 can be found at http://[email protected] or contact me ([email protected], phone 541.737.5452, fax 541.737.3479).


Author(s):  
Jorg Keller ◽  
Gabriele Spenger ◽  
Steffen Wendzel

We present and motivate a parallel algorithm to compute promising candidate states for modifying the state space of a pseudo-random number generator in order to increase its cycle length. This is important for generators in low-power devices where increase of state space to achieve longer cycles is not an alternative. The runtime of the parallel algorithm is improved by an analogy to ant colony behavior: if two paths meet, the resulting path is followed at accelerated speed just as ants tend to reinforce paths that have been used by other ants. We evaluate our algorithm with simulations and demonstrate high parallel efficiency that makes the algorithm well-suited even for massively parallel systems like GPUs. Furthermore, the accelerated path variant of the algorithm achieves a runtime improvement of up to 4% over the straightforward implementation.1  


2021 ◽  
Author(s):  
Michał Stolarczyk ◽  
Bingjie Xue ◽  
Nathan C. Sheffield

Genome analysis relies on reference data like sequences, feature annotations, and aligner indexes. These data can be found in many versions from many sources, making it challenging to identify and assess compatibility among them. For example, how can you determine which indexes are derived from identical raw sequence files, or which annotations share a compatible coordinate system? Here, we describe a novel approach to establish identity and compatibility of reference genome resources. We approach this with three advances: First, we derive unique identifiers for each resource; second, we record parent-child relationships among resources; and third, we describe recursive identifiers that determine identity as well as compatibility of coordinate systems and sequence names. These advances facilitate portability, reproducibility, and re-use of genome reference data.Availabilityhttps://refgenie.databio.org


2009 ◽  
Vol 16A (6) ◽  
pp. 411-418 ◽  
Author(s):  
Luong Van Huynh ◽  
Cheol-Hong Kim ◽  
Jong-Myon Kim

2020 ◽  
Author(s):  
Yuxuan Yuan ◽  
Philipp E. Bayer ◽  
Robyn Anderson ◽  
HueyTyng Lee ◽  
Chon-Kit Kenneth Chan ◽  
...  

AbstractRecent advances in long-read sequencing have the potential to produce more complete genome assemblies using sequence reads which can span repetitive regions. However, overlap based assembly methods routinely used for this data require significant computing time and resources. Here, we have developed RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step. During benchmarking, we assembled the wheat Chinese Spring (CS) genome using publicly available PacBio reads in parallel in 168 wall hours on a 250 CPU system. The maximum RAM used was 300 Gb and the computing time was 42,000 CPU hours. The approach opens applications for the assembly of other large and complex genomes with much-reduced computing requirements. The RefKA pipeline is available at https://github.com/AppliedBioinformatics/RefKA


Sign in / Sign up

Export Citation Format

Share Document