Close

Structure ‘Sort by Q’ explained.

STRUCTURE is a popular software used by biologists to infer the population structure of organisms using genetic markers. Barplots in STRUCTURE have an option to sort individuals by Q. We explore the ‘Sort by Q’ option using R and Excel to figure out what it does.

STRUCTURE is a popular software used by biologists to infer the population structure of organisms using genetic markers. Barplots in STRUCTURE have an option to sort individuals by Q. We are going to figure out what this means and how it is done.

I currently use STRUCTURE 2.3.4 on Windows. A typical assignment output file for K=2 looks like below.

structure-file

Fig 1: A typical Structure assignment output.

We can use the plotting functionality within STRUCTURE to view the assignment results as a barplot. See fig below. The individuals are ordered in the same order as the input file when selecting the ‘Original order’ option. There is another option to sort individuals called ‘Sort by Q’. What does this actually do?

bp-structure-unsorted

FIg 2: Barplot in STRUCTURE software showing original order of individuals (top) and ‘Sort by Q’ order (bottom).

One might reasonably assume that the individuals are sorted by one of the assignment clusters. But, that is not the case. We will try to plot the data manually and investigate this option. The structure output file used can be downloaded here.
We use the R package pophelper to convert structure files to R dataframe, ggplot package for plotting and reshape2 package for data restructuring. The data is read into R as a dataframe with two columns Cluster1 and Cluster2 with assignment values.

#install pophelper library
#library(devtools)
#install_github('royfrancis/pophelper')

#load packages
library(ggplot2)
library(reshape2)
library(pophelper)

#read data to dataframe
df <- runsToDfStructure("structure-file.txt")
head(df)
> head(df)
Cluster1 Cluster2
1 0.965 0.035
2 0.977 0.023
3 0.961 0.039
4 0.975 0.025
5 0.974 0.026
6 0.982 0.018

Now we create a function to create the plot.

#create function to generate plots
plotfn <- function(df=NULL,filename=NULL)
{
#reshape to long format
df$num <- 1:nrow(df)
df1 <- reshape2::melt(df,id.vars = "num")
#reversing order for cosmetic reasons
df1 <- df1[rev(1:nrow(df1)),]

#plot
p <- ggplot(df1,aes(x=num,y=value,fill=variable))+
geom_bar(stat="identity",position="fill",width = 1, space = 0)+
scale_x_continuous(expand = c(0, 0))+
scale_y_continuous(expand = c(0, 0))+
labs(x = NULL, y = NULL)+
theme_grey(base_size=7)+
theme(legend.position = "none",
axis.ticks = element_blank(),
axis.text.x = element_blank())

ggsave(filename=filename,plot = p,height=4,width=12,dpi=150,units = "cm")
}

#plot unsorted plot
plotfn(df=df,filename="bp-r-unsorted.png")

Here is the assignment barplot in the original order.

bp-r-unsorted

Fig 3: Assignment barplot recreated in R. Individuals are in original order.

Now we create two plots. One figure where the table is sorted by Cluster1 and second figure where the table is sorted by Cluster2.

#sort table by cluster1
df_c1 <- df[order(df[,1]),]
plotfn(df=df_c1,filename="bp-r-sortedc1.png")

#sort table by cluster2
df_c2 <- df[order(df[,2]),]
plotfn(df=df_c2,filename="bp-r-sortedc2.png")
bp-r-sortedc1

Fig 4: Assignment barplot in R sorted by cluster1.

bp-r-sortedc2

Fig 5: Assignment barplot in R sorted by cluster2.

Both of these plots do not resemble the ‘Sort by Q’ option in Structure software. They look like mirror images only because it’s K=2. For K>2, they would look quite different. Anyway, the ‘Sort by Q’ option does a bit more. For each individual, the max assignment value is picked to create a new column called ‘max’. The cluster number with the max assignment is created as a new column called ‘match’. The the whole table is sorted ascending by ‘match’ and descending by ‘max’. Here is the R code.

#pick max cluster, match max to cluster
maxval <- apply(df,1,max)
matchval <- vector(length=nrow(df))
for(j in 1:nrow(df)) matchval[j] <- match(maxval[j],df[j,])

#add max and match to df
df_q <- df
df_q$maxval <- maxval
df_q$matchval <- matchval

#order dataframe ascending match and decending max
df_q <- df_q[with(df_q, order(matchval,-maxval)), ]

#remove max and match
df_q$maxval <- NULL
df_q$matchval <- NULL

#plot
plotfn(df=df_q,filename="bp-r-sortedq.png")

And that gives us the plot we are looking for. The same plot created in the STRUCTURE software.

bp-r-sortedq

Fig 6: Assignment barplot in R sorted by Q.

Here is also an Excel file with the calculations, if R is not your thing.

structure-excel

Fig 7: Assignment barplot and ‘Sort by Q’ calculation in Excel.

You can always verify by checking the individual number (#) with the individual numbers in the STRUCTURE software (set to ‘Plot in multiple lines’).

That’s all for now. I hope this was useful for all those who were as confused as I was.

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: