Recently, I have been re-analyzing bisulfite sequencing data from a fungus with significant DNA methylation around its centromeres, but with limited mappability due to high transposon similarity. I ‘discovered’ that diferent bisulfite mapping software produced drastically different summary methylation statistics.
After about a week of comparing the output of various mapping softwares, I was faced with another discovery: I didn’t quite grasp how the bisulfite software works at the conceptual level. What an embarrassment! A… sixth year postdoc?… that doesn’t understand his own bread-and-butter methods. In any case, my initial realization stemmed from reading the perl script of our lab’s ‘in-house’ (i.e. not published and not well documented) bisulfite mapper, which works a lot like published softwares “Bismark” and “BS-seeker” and others (however it is quite different from BSMAP, Pash, and others… perhaps more on that later). With software like Bismark, all sequencing reads and the genome itself are C-to-T converted, whereupon mapping is carried out, and later resolved using read ID matching to calculate C that did not convert in reads, and was therefore methylated. However, a G-to-A converted genome is also generated. This has long bothered me… why do we need a G-to-A converted genome as well? My understanding was we need C-to-T alone because of the directionality strategy built into essentially all library preparation methods. Of course one needs directionality for RNA-seq, since you care very much about the strand that gave rise to a particular single-stranded RNA. But, to review, why directionality for bisulfite libraries? In order to make the libraries, adapters are ligated to the fragmented gDNA and the protolibraries are then denatured. That means the 5′-end of your protolibrary top strand will have the same adaptor sequence as the 5′-end of your bottom strand. For our NextSeq flowcell, these adaptors serve two critical functions: 1.) as the complementary strand that anneals the library to the flowcell, and 2.) as a sequencing primer binding site. Following bridge amplification of the libraries on-flowcell, which generates a complementary strand to your originals you added, extension from the read 1 sequencing primer re-creates the “original top” and “original bottom” — such is the beauty of directionality in the bisulfite context. The sequencing information you get from read 1-based extension exactly replicates the ‘original’ bisulfite-converted genomic DNA. On this strand, C converts to T when it’s not protected by methylation (and its derivatives), and so that is why we map these reads to a C-to-T converted genome. OK…. so why the G-to-A converted genome also?? G-to-A seems relevant, but only because the template strand covalently bound to the flowcell represents your data in that transformation…ah! but what about paired end libraries!!! Duh. You may have seen this coming; if so, my apologies. The punchline to this diatribe is: as long as you are using directional library prep methods (basically any normal kit will produce directional libraries, for instance from Tecan/Nugen, NEB, Illumina, Swift, etc.) paired end bisulfite sequencing requires the G-to-A converted read and genome sequence to map read-2 reads. All of your unmethylated bases in the ‘original strand’ (top or bottom) will register as A on the complementary strand, the ‘complementary to original’ strand, which is synthesized from the read 2 primer (in our case, built in to the NextSeq reagents). The Bismark github page has some documentation on this, but I had really not appreciated the importance of directionality in these libraries and how that translates to single-end mapping requiring only the C-to-T converted genome and reads; and likewise the relevance of G-to-A converted genomes for read 2 in paired-end libraries.