understanding bisulfite sequencing — C-to-T vs G-to-A

Recently, I have been re-analyzing bisulfite sequencing data from a fungus with significant DNA methylation around its centromeres, but with limited mappability due to high transposon similarity. I ‘discovered’ that diferent bisulfite mapping software produced drastically different summary methylation statistics.

After about a week of comparing the output of various mapping softwares, I was faced with another discovery: I didn’t quite grasp how the bisulfite software works at the conceptual level. What an embarrassment! A… sixth year postdoc?… that doesn’t understand his own bread-and-butter methods. In any case, my initial realization stemmed from reading the perl script of our lab’s ‘in-house’ (i.e. not published and not well documented) bisulfite mapper, which works a lot like published softwares “Bismark” and “BS-seeker” and others (however it is quite different from BSMAP, Pash, and others… perhaps more on that later). With software like Bismark, all sequencing reads and the genome itself are C-to-T converted, whereupon mapping is carried out, and later resolved using read ID matching to calculate C that did not convert in reads, and was therefore methylated. However, a G-to-A converted genome is also generated. This has long bothered me… why do we need a G-to-A converted genome as well? My understanding was we need C-to-T alone because of the directionality strategy built into essentially all library preparation methods. Of course one needs directionality for RNA-seq, since you care very much about the strand that gave rise to a particular single-stranded RNA. But, to review, why directionality for bisulfite libraries? In order to make the libraries, adapters are ligated to the fragmented gDNA and the protolibraries are then denatured. That means the 5′-end of your protolibrary top strand will have the same adaptor sequence as the 5′-end of your bottom strand. For our NextSeq flowcell, these adaptors serve two critical functions: 1.) as the complementary strand that anneals the library to the flowcell, and 2.) as a sequencing primer binding site. Following bridge amplification of the libraries on-flowcell, which generates a complementary strand to your originals you added, extension from the read 1 sequencing primer re-creates the “original top” and “original bottom” — such is the beauty of directionality in the bisulfite context. The sequencing information you get from read 1-based extension exactly replicates the ‘original’ bisulfite-converted genomic DNA. On this strand, C converts to T when it’s not protected by methylation (and its derivatives), and so that is why we map these reads to a C-to-T converted genome. OK…. so why the G-to-A converted genome also?? G-to-A seems relevant, but only because the template strand covalently bound to the flowcell represents your data in that transformation…ah! but what about paired end libraries!!! Duh. You may have seen this coming; if so, my apologies. The punchline to this diatribe is: as long as you are using directional library prep methods (basically any normal kit will produce directional libraries, for instance from Tecan/Nugen, NEB, Illumina, Swift, etc.) paired end bisulfite sequencing requires the G-to-A converted read and genome sequence to map read-2 reads. All of your unmethylated bases in the ‘original strand’ (top or bottom) will register as A on the complementary strand, the ‘complementary to original’ strand, which is synthesized from the read 2 primer (in our case, built in to the NextSeq reagents). The Bismark github page has some documentation on this, but I had really not appreciated the importance of directionality in these libraries and how that translates to single-end mapping requiring only the C-to-T converted genome and reads; and likewise the relevance of G-to-A converted genomes for read 2 in paired-end libraries.

xml update

generate a bigwig (bw) file from a bed (remove non-nuclear genomes first)

for i in ./*bed; do perl -wnl -e ‘/Chr\d/ and print;’ $i | bedtools genomecov -bg -i – -g ~/bsgenome_tair9.Chr.txt > ${i%%bed}bg; done

then

for i in ./*bg; do bedGraphToBigWig $i ~/bsgenome_tair9.Chr.txt ${i%%bg}bw; done

collect the names of these bw

ls -l | awk ‘{print $9}’ > file_names.txt

perl -wpl -e ‘s|^(.*)$|<Resource name=”$1″ path=”http://yourServerHere/igvdata/chip_seq_jic/$1″/>|g;’ file_names.txt > xml.tmp

paste names from this new file into the field of a new igvdata xml with appropriately descriptive global and category headers,

add this xml to your igv_registry.txt in your /var/www/html/igv directory,
and import the bw files to an appropriately named directory in /var/www/html/igvdata

calculate mappability of a genome

I couldn’t find a simple explanation of how to run gem_mappability as was carried out in Catoni et al. 2017 EMBO (out of the Paszkowski lab). I found enough information here to get me going – below is the simplified version for those that simply want to get a bigwig or bedgraph of these values but don’t care about comparing a bajillion different k-mer sizes.

  1. download the compressed file (your specific version may be different)
  2. tar -xvjf GEM-binaries-Linux-x86_64-core_i3-20130406-045632.tbz2
  3. cd GEM-binaries-Linux-x86_64-core_i3-20130406-045632/
  4. cp all the executables in ./bin to a directory in your path, such as /usr/local/bin
  5. then run the following – change the names of your files accordingly:

gem-indexer -i /path_to_your_genome_fasta -o output_name_you_pick

gem-mappability -I output_from_above -l [you decide how big kmer should be] -o pick_a_gem_output_name

gem-2-wig -I gem_indexer_output -i gem_mappability_output -o your_wig

then convert your wig to bigWig using Jim Kent/UCSC tools

genotyping in 96-well format

had to split the plates to get them to fit into small thermocycler :-<

Generating Arabidopsis triple and quadruple mutants (and beyond) requires stamina.  the mind-numbing process can be made less time-consuming by doing on-plate DNA extraction with an SDS-free extraction method, as put forth in a Ronan O’Malley et al. paper (“guide to the t-dna insertion collection” approximately). They use steel balls and a paint can shaker to achieve tissue disruption, whereas I use a PCR plate with zirconia beads after deep-freezing the leaf sample (partially submerge plate on metal plate rack in ethanol+dry ice). I then add 300ul TE with 100mM NaCl (roughly per O’Malley), vortex a bit, and spin down the plate at 4000 RPM in an eppendorf 5804. The trick post-spin is take only the very top to avoid contaminating debris. I took ~50ul here, adding it to a new plate where I’d crudely pipetman’d out 200ul cold isopropanol. Once done with this, I store o/n in the freezer, spin down at high speed as before the next day, and then decant as much iso as possible. From here, I do a 70% EtOH wash and spin as before, then dry the samples out. I found ramming air into the plate was the only way that worked for me… I titled the plate at one of our PCR vents and the remaining traces of alcohol slowly evaporated. It took a long time to dry. I resuspend in 50ul EB and proceed with PCR using multichannel pipets and life is good.  

NB: I use adhesive foil for PCR plates to cover the samples for all these steps. The foil dents heavily during the vortex, but doesn’t break open.

Once ready to run the samples on a DNA gel, you can cast the gel with an appropriately-sized comb to further streamline the process. I use a 26-well format from Biorad in their sub-cell electrophoresis get-up, then load WT in even positions, with mutant in odd positions (something I realize the O’Malley paper also recommends!)