download only the SRAs you want, from cmd line

November 3, 2020November 3, 2020 | biolyons

go to the NCBI SRA database site, click the record you are interested in and then click the tab ‘data access’ and copy-paste the https URL into a text file

place this text file of your SRA URLs into the directory where you wish to download the data (no spaces in the list or any trailing characters, etc)

enter the command

for i in `cat srr.list` ; do wget $i ; done

et voila, go have a snack

fastq trimming – Illumina truseq2 v truseq3

January 8, 2019June 27, 2019 | biolyons

the 5′ end (13 nt) are identical, with some similarity further 3′ as well

When I’m not sure which adaptors were used for the construction of a sequencing library, but I know they were Illumina, I take the top ~100k reads and run trimmomatic using more-or-less default settings against the 2 different Illumina truseq adaptors that ‘ship’ with that software. Then compare how the two trim-logs look — the one that trimmed off more crap is the one to proceed with. Something like this (these are set for small RNA libraries where the inserts are very short!):

java -jar trimmomatic-0.36.jar SE -threads 5 -trimlog trimlog_test test.fastq out.test.fastq ILLUMINACLIP:TruSeq2-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:1

awk ‘$6>0 {print}’ trimlog_test | wc

make a freeNAS file server with old parts: I

October 24, 2018October 25, 2018 | biolyons

when my lab departed for the John Innes Centre about a year ago, there was a large quantity of raw computer parts left behind for the salvage heap. I wanted a dedicated place to back up my laptop and home computers remotely, and had heard about freely-available network attached storage (NAS) systems, but that was where my knowledge ended. Our lab had been using the somewhat pricey synology NAS system, and I wanted to learn if there was an open source alternative.

FreeBSD-based FreeNAS fit the bill. There is a TON of info and support for how FreeNAS works and how to get it going. What I envisioned was taking some old HDDs laying around, reformatting them, and using them as the storage core of the system. I would boot to a usb drive on which I’ve installed FreeNAS, and format things from there. The amazing thing is… it basically worked and now I have a server for a lot of previously unsecure stuff. Booting off of a USB loaded with the OS is actually the recommended way to go: see this simple instructional video from FreeNAS

One detail that is worth noting is that FreeNAS storage is memory intensive. I am not 100% clear on why, but I believe it has to do with the ZFS-based architecture (NB: zfs does not stand for anything!). For each TB of hard drive, they recommend 1GB of RAM, starting with at least 4GB RAM as a base. Luckily I had a bunch of DDR3 RAM laying around and a capable Asus X99 motherboard with an i7 cpu. Once I got the thing running, it had 64GB RAM supporting 8TB of hard storage; I’ve gone about 2TB into this thing after about a year.

the web interface for FreeNAS with some basic system specs.

Make an IGV server out of your computer

October 19, 2018November 12, 2018 | biolyons

this makes it possible to remotely view your data on any integrated genome viewer (IGV) session with an internet connection. Make sure your computer’s web server enabled — for me on ubuntu 14 and 16, it meant installing the so-called LAMP stack (I used the slick little installer package called tacksel as outlined here)

become loosely familiar with xml format – you use xml to write the structure of your tracks and features, how they’re colored, and lots of other details
there is a hierarchy that you will adhere to: a.) a registry at the top (information about where your xml files are) , b.) the xml files stating where your data files are (I put them in the directory ./igvdata/, and c.) a directory full of those data in their precise location as specified by xml files in b (put these below the xmls within ./igvdata)
how i did it: in /var/www/html/igv, make your registry e.g. igv_registry.txt. it’s a simple list (of so-called infolinks) comprised of http://your_server_url/igvdata/_some_xml_file.xml …. This registry file points to the location of your secondary xml files; my files are in subdirectories within /var/www/html/igvdata/ so my registry file reads like: http://__ip___/igvdata/some_cool_set_of_experiments.xml
and the xml file “some_cool_set_of_experiments.xml” reads like:

<?xml version="1.0" encoding="UTF-8"?>
<Global name="bsseq_sex_cells"  infolink="http://ip_addr/igv/" version="1">
    <Category name="bsseq_sperm">
        <Category name="wt.sp.f4">
         <Resource name="wt.sp.CG.f4"
               color="0,100,0"
                  path="http://ip_addr/igvdata/bsseq.sperm/wt.f4.sp.all.cg.gff.clean.CG-methyl.tdf" />
        <Resource name="wt.sp.CHG.f4"
               color="0,100,0"
                  path="http://ip_addr/igvdata/bsseq.sperm/wt.f4.sp.all.cg.gff.clean.CHG-methyl.tdf" />
         <Resource name="wt.sp.CHH.f4"
               color="0,100,0"
                  path="http://ip_addr/igvdata/bsseq.sperm/wt.f4.sp.all.cg.gff.clean.CHH-methyl.tdf" />

        </Category>

here I have specified the “global name” as bsseq_sex_cells, so this will be an expandable menu item in IGV, with the CG, CHG, and CHH tracks available as items within that menu.

As you can see above in the xml snippet, under ‘Resource’ the path shows my tracks for this set of experiments are all within a sub-directory in ./igvdata called bsseq.sperm — it seemed a good way for me to organize my data. Also cool is that each sub-directory corresponds to a different menu heading, like so

a menu of my data sets after all the xml’ing and set up

6. to enable this, set your IGV session to look for your server at your IP address:

click ‘Edit Server properties’ and enter the appropriate URL; change name of registry accordingly

use awk to split file on unique identifier

October 18, 2018October 20, 2018 | biolyons

if you want to split your file on unique identifiers in say col. 2, and your identifiers are, for ex., CGA, CGC, CGG, and CGT, running this one-liner will generate 4 files, named as your identifiers:

awk ‘{print >> $2 }’ input (don’t specify output)

& done – parallelize the bash for loop

October 18, 2018October 18, 2018 | biolyons

perhaps I would have finished my last paper a month earlier if I had known about this simple trick: when running a simple for loop in bash, take advantage of a multicore processor and use an “&” rather than “;” to close the loop. for instance:

bashhub – mind the bash

October 18, 2018October 20, 2018 | biolyons

just discovered bashhub – after years of using the built-in tool ‘history’ then piping through grep to find past commands. I don’t get all the options available, but the basic bashhub features allow you to do this sort of search in one step, or enter interactive mode (sorta like htop but for your command history).

David B. Lyons

biologist at MSU Bozeman — chromatin, epigenetics, devbio

tech for biology