Humanity has never lived in better times

It is easy to be disillusioned and pessimistic about the world we live in. Bad news seems to be followed by worse news. But humanity has come a long way from the disease-ridden, impoverished, war-torn lives of our fore-fathers. Here we look at a few data-driven graphs to convince ourselves of the progress we have made over time in various aspects of life. Slow progress never makes headlines.

It may seem like the world is descending into total chaos, violence, and destruction. War in Syria, Ukraine, Yemen, Islamic state, migrant crisis, Ebola, plane crashes, earthquakes, tsunamis and what-not. The more news you watch, the more worried you will be. This is because the news outlets tend to focus on spectacularly negative instances. Violence, atrocities, and hatred are thrown into the spotlight and into the lives of common people. With the ever increasing digital connectivity, it is easy to disseminate information and to absorb information at an unprecedented level. Relatively smaller incidents have a larger voice. As said by Ray Kurzwil, “The world isn’t getting worse, our information is getting better”. To appreciate the world we live in, we have to put things into a wider context.

The fact is that humanity has never lived in a better time than now in pretty much every aspect you look at; war, violence, diseases, poverty are all at the lowest it has ever been. Of course, there is still a long way to go, but this is the best it has been since the beginning of humankind. To prove my point, here we evaluate human progress using some real data and simple time-series plots. Most of the data and information was obtained from OurWorldInData.


Read counts of RNA-Seq Spike-ins using STAR and QoRTs

A short tutorial on quantifying spike-ins used in an RNA-Seq experiment.

In RNA-Seq analyses, adding pre-determined quantity of synthetic RNA sequences (spike-ins) to samples is a popular way to verify the experimental pipeline, determine quantification accuracy and for normalisation of differential expression. The most commonly used spike-ins are the ERCC spike-ins.

This post will cover the bioinformatic steps involved in obtaining read counts of spike-ins from a FASTQ file sequenced with spike-ins. The steps are namely creating a custom FASTA genome build incorporating the spike-in sequences, custom GTF file creation, mapping the reads to the custom genome, read counting and visualisation. This post will not be covering the wet lab part of adding spike-ins. I have a FASTQ data file (sample01.fq.gz) from single cell 50bp single-end Illumina reads with spike-ins that I am using for this workflow.

Read More

Which file compression to use on Linux?

Seven different compression formats (7z, bzip2, gzip, lrzip, lz4, xz and zip) are tested using ten different compression commands (7za, bzip2, lbzip2, lrzip, lz4, pbzip2, gzip, pigz, xz and zip) on five different file types (fastq, mp3 tar archive, mp4 movie file, random text file and a tiff stack) for compression ratio and time. bzip2 compression using the command lbzip2 and pbzip2 comes out as the winner due to high compression ratio, speed and multi-threading capabilities.

This is a quick comparison of some of the data compression and decompression formats on Linux. The idea is to compare compression/decompression time and compression size difference using seven compression formats on five different file types.

Five different data files were tested: a fastq text file, mp3 tar archive, an mp4 movie file, a randomly generated text file and a tiff image stack. Some properties of the files: fastq file (403 MB, 1.56 million reads), mp3 tar archive (390 MB, a tar archive composed of four tar archives each with 6 mp3 tracks of size 10MB to 32MB), mp4 file (340 MB), text file (400MB, created using (base64 /dev/urandom | head -c 419430400 > text.txt) and tiff stack (404MB, 1380 frames, 640 x 480 px, sequence of zebrafish larvae swimming in a microtitre plate).  For clarity, fastq files are text files containing next generation sequencing data and tiff stacks are used for image analysis using ImageJ, for example.

Seven different compression formats were tested: 7z, bzip2, gzip, lrzip, lz4, xz and zip using ten different compression commands: 7za, bzip2, lbzip2, pbzip2, gzip, pigz, lrzip, xz and zip. For decompression, the same commands were used except for zip where unzip was used. The 7za command by default compresses to the 7z format but also allows exporting to bzip2, gzip and zip. lbzip2 and pbzip2 are multi-threaded versions of bzip2. Similarly, pigz is the multi-threaded version of gzip.


Structure ‘Sort by Q’ explained.

STRUCTURE is a popular software used by biologists to infer the population structure of organisms using genetic markers. Barplots in STRUCTURE have an option to sort individuals by Q. We explore the ‘Sort by Q’ option using R and Excel to figure out what it does.

STRUCTURE is a popular software used by biologists to infer the population structure of organisms using genetic markers. Barplots in STRUCTURE have an option to sort individuals by Q. We are going to figure out what this means and how it is done.

Read More

A guide to elegant tiled heatmaps in R

A step-by-step guide to data preparation and plotting of simple, neat and elegant heatmaps in R using base graphics and ggplot2.

This was inspired by the disease incidence rate in the US featured on the Wall Street Journal which I mentioned in one of the previous posts. The disease incidence dataset was originally used in this article in the New England Journal of Medicine. Here, I use the measles level 1 incidence (cases per 100,000 people) dataset obtained as a .csv file from Project Tycho. Download the .csv file here or head over to Project Tycho for other datasets.

In this post, we will look into creating a neat, clean and elegant heatmap in R. No clustering, no dendrograms, no trace  lines, no bullshit. We will go through some basic data cleanup, reformatting and finally plotting. We go through this step by step. For the whole code with minimal explanations, scroll to the bottom of the page.

Read More