In this post, we are going to explore how to adjust various ggplot plot elements. What can be adjusted, what they are called and how they can be adjusted.

## Customising ggplot2

Customising a ggplot2 plot using the theme function.

Customising a ggplot2 plot using the theme function.

In this post, we are going to explore how to adjust various ggplot plot elements. What can be adjusted, what they are called and how they can be adjusted.

Create an elegant photo calendar completely using R.

I had this idea of using some of my travel photos to create a photo calendar. I would normally go about it using Adobe Photoshop or Adobe Illustrator. But, that would involve a lot of manual work placing dates and days for each month. I would also like to mark some public holidays and friend’s birthdays. So, I wondered if it might be possible to do it with R. After fiddling about with it over the weekend, I managed to make it work. It went better than I expected. And here I am recreating the calendar using some stock photos. All stock photos are royalty-free from Pexels. For the impatient ones, the whole code and images are available at this Github repository. For detailed guide, keep reading.

We scrap Instagram for basic public data using R to help us pick optimal hashtags.

If you are an Instagram user, at some point, you care going to be interested in the various metrics such as followers, number of posts by a certain user etc. You might want to compare these metrics between different users or to find out the number of posts with a certain hashtag etc. The casual way to do it is to go the relevant Instagram page and look at the metric and write it down somewhere, and go to next and so on. Clearly this is not ideal strategy if you want to look at a few hundred pages. It would be neat to get this data in an automated manner.

A quick tutorial on running ImageJ on a Linux cluster.

I use ImageJ for many of my image analysis needs. My desktop computer runs Windows 7 and it has pretty solid specs with Core i7 processor and 16GB RAM. I recently had to handle some large tiff stacks (4-5gb) and it simply wouldn’t work on my desktop as I constantly ran into ‘out of memory’ errors. So I decided to run them on a computing cluster instead since I have access to one. Running on a cluster might be useful when handling data with large memory requirements or to perform computations on numerous files in parallel by distributing load to multiple cores. It took me a while to figure out how to get things to work, so I thought I would make a record of it. And this might hopefully be useful to others.

In the advent of big data such as genomics, running numerous statistical tests is unavoidable. But long comes strange statistical problems. This post investigates issues with multiple statistical testing and its solutions along with simulated data.

In a standard statistical test, one assumes a null hypothesis, performs a statistical test and computes a p-value. The estimated p-value is compared to a predetermined threshold (usually 0.05). If the estimated p-value is greater than 0.05 (say 0.2), it means that there is a 20% chance of obtaining the current result if the null hypothesis is true. Since we decided our threshold as 5%, the 20% is too high to reject the null hypothesis and we accept the null hypothesis. Now, if the estimated p-value was less than 0.05 (say 0.02), there is a 2% probability of obtaining the observed result if the null hypothesis is true. Since 2% is a very low probability and it is below our threshold of 5%, we reject the null hypothesis and accept an alternative hypothesis.

The 5% threshold, although giving us high confidence, is an arbitrary value and does not absolutely guarantee an outcome. There is still the possibility that we are wrong 5% of the time. This is known as the probability of a Type I error. A Type I error occurs when a researcher falsely concludes that an observed difference is real, when in fact, there is no difference.

That was the story of a single statistical test. With large data, it is common for data analysts to do multiple statistical tests on the same data. Similar to a single test, each test in a multiple test has the 5% Type 1 error rate. And this accumulates for the number of tests.