Seed Selection Strategies for Overlap Detection
AbstractThe current state-of-the-art assemblers of long, error-prone reads rely on detecting all-vs-all overlaps within the set of reads with overlaps represented by a sparse selection of short subsequences or “seeds”. Though the quality of selection of these seeds can impact both accuracy and speed of overlap detection, existing algorithms do little more than ignore over-represented seeds. Here we propose several more informed seed selection strategies to improve precision and recall of overlaps. These strategies are evaluated against real long-read data sets with a range of fixed seed sizes. We show that these strategies substantially improve the utility of individual seeds over uninformed selection.