THiCweed: fast, sensitive detection of sequence features by clustering big data sets
AbstractWe present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin-immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared towards data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30,000 peaks in 1-2 hours, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs.THiCweed performs best with large “window” sizes (≥ 50bp), much longer than typical binding sites (7-15 base pairs). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs, and secondary motifs even when they occur in < 5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq data sets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif-finding to give new insights into genomic TF binding complexity.