fasta file
Recently Published Documents


TOTAL DOCUMENTS

20
(FIVE YEARS 5)

H-INDEX

2
(FIVE YEARS 0)

2021 ◽  
Author(s):  
David A Eccles
Keyword(s):  

This protocol is for a semi-manual method for read demultiplexing, as used after my presentation Sequencing DNA with Linux Cores and Nanopores to work out the number of reads captured by different barcodes. Input: reads as a FASTQ file, barcode sequences as a FASTA file Output: reads split into single FASTQ files per target [barcode] Note: barcode / adapter sequences are not trimmed by this protocol


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Youri Hoogstrate ◽  
Guido W. Jenster ◽  
Harmen J. G. van de Werken

Abstract Background The FASTA file format, used to store polymeric sequence data, has become a bioinformatics file standard used for decades. The relatively large files require additional files, beyond the scope of the original format, to identify sequences and to provide random access. Multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive. Moreover, these solutions are not directly backwards compatible to FASTA files, resulting in limited software integration. Results We designed a linux based toolkit that virtualises the content of DNA, RNA and protein FASTA archives into the filesystem by using filesystem in userspace. This guarantees in-sync virtualised metadata files and offers fast random-access decompression using bit encodings plus Zstandard (zstd). The toolkit, FASTAFS, can track all its system-wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy. The file compression ratios were comparable but not superior to other state of the art archival tools, despite the innovative random access feature implemented in FASTAFS. Conclusions FASTAFS is a user-friendly and easy to deploy backwards compatible generic purpose solution to store and access compressed FASTA files, since it offers file system access to FASTA files as well as in-sync metadata files through file virtualisation. Using virtual filesystems as in-between layer offers format conversion without the need to rewrite code into different programming languages while preserving compatibility.


2021 ◽  
Author(s):  
David A Eccles

This protocol demonstrates how to assemble reads from plasmid DNA, and generate a circularised and non-repetitive consensus sequence At the moment, this protocol uses Canu to de-novo assemble high-quality single-cut reads. Input(s): demultiplexed fastq files (see protocol Demultiplexing Nanopore reads with LAST). I've noticed that the default demultiplexing carried out by Guppy (at least up to v4.2.2, as used in the first version of this protocol) has issues with chimeric reads, which can affect assembly. Output(s): Consensus sequence per barcode as a fasta file


2021 ◽  
Author(s):  
David A Eccles

This protocol demonstrates how to assemble reads from plasmid DNA, and generate a circularised and non-repetitive consensus sequence At the moment, this protocol uses Canu to de-novo assemble high-quality single-cut reads. Input(s): demultiplexed fastq files (see protocol Demultiplexing Nanopore reads with LAST). I've noticed that the default demultiplexing carried out by Guppy (at least up to v4.2.2, as used in the first version of this protocol) has issues with chimeric reads, which can affect assembly. Output(s): Consensus sequence per barcode as a fasta file


2021 ◽  
Author(s):  
Yann Spöri ◽  
Fabio Stoch ◽  
Simon Dellicour ◽  
C. William Birky ◽  
Jean-François Flot

K/θ is a method to delineate species that rests on the calculation of the ratio between the average distance K separating two putative species-level clades and the genetic diversity θ of these clades. Although this method is explicitly rooted in population genetic theory, it was never benchmarked due to the absence of a program allowing automated analyses. For the same reason, its application by hand was limited to small datasets of a few tens of sequences. We present an automatic implementation of the K/θ method, dubbed KoT (short for "K over Theta"), that takes as input a FASTA file, builds a neighbour-joining tree, and returns putative species boundaries based on a user-specified K/θ threshold. This automatic implementation avoids errors and makes it possible to apply the method to datasets comprising many sequences, as well as to test easily the impact of choosing different K/θ threshold ratios. KoT is implemented in Haxe, with a javascript webserver interface freely available at https://eeg-ebe.github.io/KoT/ .


2020 ◽  
Author(s):  
Youri Hoogstrate ◽  
Guido Jenster ◽  
Harmen J. G. van de Werken

AbstractBackgroundThe FASTA file format used to store polymeric sequence data has become a bioinformatics file standard used for decades. The relatively large files require additional files beyond the scope of the original format, to identify sequences and provide random access. Currently, multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive. Moreover, these solutions are not directly backwards compatible to FASTA files, resulting in limited software integration.ResultsWe designed linux based a toolkit using Filesystem in Userspace (FUSE) that virtualises the content of DNA, RNA and protein FASTA archives into the filesystem. This guarantees in-sync virtualised metadata files and offers fast random-access decompression using Zstandard (zstd). The toolkit, FASTAFS, can track all system wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy.ConclusionsFASTAFS is a user-friendly and easy to deploy backwards compatible generic purpose solution to store and access compressed FASTA files, since it offers file system access to FASTA files as well as in-sync metadata files through file virtualisation. Using virtual filesystems as in-between layer offers the possibility to design format conversion without the need to rewrite code into different languages while preserving compatibility.Code Availabilityhttps://github.com/yhoogstrate/fastafs


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10150
Author(s):  
Benjamin Istace ◽  
Caroline Belser ◽  
Jean-Marc Aury

Motivation Long read sequencing and Bionano Genomics optical maps are two techniques that, when used together, make it possible to reconstruct entire chromosome or chromosome arms structure. However, the existing tools are often too conservative and organization of contigs into scaffolds is not always optimal. Results We developed BiSCoT (Bionano SCaffolding COrrection Tool), a tool that post-processes files generated during a Bionano scaffolding in order to produce an assembly of greater contiguity and quality. BiSCoT was tested on a human genome and four publicly available plant genomes sequenced with Nanopore long reads and improved significantly the contiguity and quality of the assemblies. BiSCoT generates a fasta file of the assembly as well as an AGP file which describes the new organization of the input assembly. Availability BiSCoT and improved assemblies are freely available on GitHub at http://www.genoscope.cns.fr/biscot and Pypi at https://pypi.org/project/biscot/.


2020 ◽  
Vol 21 (19) ◽  
pp. 7281
Author(s):  
A. J. Preto ◽  
Irina S. Moreira

Protein Hot-Spots (HS) are experimentally determined amino acids, key to small ligand binding and tend to be structural landmarks on protein–protein interactions. As such, they were extensively approached by structure-based Machine Learning (ML) prediction methods. However, the availability of a much larger array of protein sequences in comparison to determined tree-dimensional structures indicates that a sequence-based HS predictor has the potential to be more useful for the scientific community. Herein, we present SPOTONE, a new ML predictor able to accurately classify protein HS via sequence-only features. This algorithm shows accuracy, AUROC, precision, recall and F1-score of 0.82, 0.83, 0.91, 0.82 and 0.85, respectively, on an independent testing set. The algorithm is deployed within a free-to-use webserver, only requiring the user to submit a FASTA file with one or more protein sequences.


2020 ◽  
Author(s):  
Jana K. Schniete ◽  
Nelly Selem-Mojica ◽  
Anna S. Birke ◽  
Pablo Cruz-Morales ◽  
Iain S. Hunter ◽  
...  

AbstractActinobacteria are a large and diverse phylum of bacteria that contains medically and ecologically relevant organisms. Many members are valuable sources of bioactive natural products and chemical precursors that are exploited in the clinic. These are made using the enzyme pathways encoded in their complex genomes. Whilst the number of sequenced genomes has increased rapidly in the last twenty years, the large size and complexity of many Actinobacterial genomes means that the sequences remain incomplete and consist of large numbers of contigs with poor annotation, which hinders large scale comparative genomics and evolutionary studies. To enable greater understanding and exploitation of Actinobacterial genomes, specialist genomic databases must be linked to high-quality genome sequences. Here we provide a curated database of 612 high-quality actinobacterial genomes from 80 genera, chosen to represent a broad phylogenetic group with equivalent genome reannotation. Utilising this database will provide researchers with a framework for evolutionary and metabolic studies, to enable a foundation for genome and metabolic engineering, to facilitate discovery of novel bioactive therapeutics and studies on gene family evolution.Significance as a bioresource to the communityThe Actinobacteria are a large diverse phylum of bacteria, often with large, complex genomes with a high G+C content. Sequence databases have great variation in the quality of sequences, equivalence of annotation and phylogenetic representation, which makes it challenging to undertake evolutionary and phylogenetic studies. To address this, we have assembled a curated, taxa-specific, non-redundant database to aid detailed comparative analysis of Actinobacteria. ActDES constitutes a novel resource for the community of Actinobacterial researchers that will be useful primarily for two types of analyses: (i) comparative genomic studies – facilitated by reliable identification of orthologs across a set of defined, phylogenetically-representative genomes, and (ii) phylogenomic studies which will be improved by identification of gene subsets at specified taxonomic level. These analyses can then act as a springboard for the studies of the evolution of virulence genes, the evolution of metabolism and identification of targets for metabolic engineering.Data summaryAll genome sequences used in this study can be found in the NCBI taxonomy browser https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/www.tax.cgi and are summarised along with Accession numbers in Table S1All other data are available on Figshare https://doi.org/10.6084/m9.figshare.12167529 and https://doi.org/10.5281/zenodo.3830391Perl script files available on GitHub https://github.com/nselem/ActDES including details of how to batch annotate genomes in RAST from the terminal https://github.com/nselem/myrastSupp. Table S1 List of genomes from NCBI (Actinobacteria database.xlsx) https://doi.org/10.6084/m9.figshare.12167529CVS genome annotation files including the FASTA files of nucleotide and amino acids sequences (individual .cvs files) https://doi.org/10.6084/m9.figshare.12167880BLAST nucleotide database (.fasta file) https://doi.org/10.6084/m9.figshare.12167724BLAST protein database (.fasta file) https://doi.org/10.6084/m9.figshare.12167724Supp. Table S2 Expansion table genus level (Expansion table.xlsx Tab Genus level) https://doi.org/10.6084/m9.figshare.12167529Supp. Table S2 Expansion table species level (Expansion table.xlsx Tab species level) https://doi.org/10.6084/m9.figshare.12167529All GlcP and Glk data – blast hits from ActDES database, MUSCLE Alignment files and .nwk tree files can be found at https://doi.org/10.6084/m9.figshare.12167529Interactive trees in Microreact for Glk tree https://microreact.org/project/w_KDfn1xA/90e6759e and associated files can be found at https://doi.org/10.6084/m9.figshare.12326441.v1Interactive trees in Microreact for GlcP tree https://microreact.org/project/VBUdiQ5_k/0fc4622b and associated files can be found at https://doi.org/10.6084/m9.figshare.12326441.v1


BioTechniques ◽  
2019 ◽  
Vol 67 (2) ◽  
pp. 50-54 ◽  
Author(s):  
Gabriel Foley ◽  
Leander Sützl ◽  
Stephlina A D'Cunha ◽  
Elizabeth MJ Gillam ◽  
Mikael Bodén
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document