Boulder Workshop 2021

MAGMA Tutorial

In this practical we will run through the three basic steps of performing a (MAGMA) gene-set analysis: 
annotation of SNPs to genes, gene analysis and subsequent gene-set analysis. Additionally, we will also 
perform a number of more advanced analyses using the generalized gene-set analysis framework that 
MAGMA provides, including conditional and joint gene-set analysis, gene-set interaction analysis, as 
well as analysis of tissue-specific gene expression levels. Input files are provided alongside this 
instruction, and consist of:

*	the MAGMA manual (manual_v1.09a.pdf)
*	a PLINK data set containing simulated GWAS data of a small but more or less realistic size 
	(practical.bed, practical.bim, practical.fam)
*	a file containing 10 PCs to correct for population stratification, to be included as covariates in 
	the gene analysis (practical.cov)
*	a gene-set file containing 1013 Reactome gene sets (reactome.sets)
*	a gene covariate file containing tissue-specific gene expression levels per gene for 11 tissues, 
	simulated based on real expression data (tissue_gex.cov)
*	a gene definition file (NCBI37.3.gene.loc)
*	two additional auxiliary files (step3a.signif, step6a.partitioned.sets)

All the input files except practical.bed are plain-text files, so if you want to inspect their content 
you can do so with a text editor such as vi or vim or use commands such as more or head and tail.
You can also download the files to your local computer and inspect them with Word or Wordpad 
rather than Notepad however, since the files have been generated on a Linux system and will 
therefore not display properly in Notepad. 


Notes
*	To determine significance in this practical, we will use the traditional significance threshold  
	of 0.05. We will use Bonferroni correction to account for multiple testing. As a reminder, a 
	test is significant after Bonferroni correction when its p-value is smaller than 0.05/K, where 
	K is the number of tests you want to correct for.


#### Step 0: set-up ####
Log in to the remote computer. Create a new folder for today’s tutorial on gene- and pathway analysis called friday2
and copy the following files into that new folder. Use the following commands:

mkdir friday2
cd friday2
rsync -av /faculty/christiaan/Boulder2021/magma_session.zip ./
unzip magma_session.zip
cd magma_session

#### Step 1: annotation ####
In this step we will annotate the SNPs in the data to genes. To do so, use the command:

./magma --annotate window=1,0.5 --snp-loc practical.bim --gene-loc NCBI37.3.gene.loc --out step1

This tells MAGMA to perform annotation, mapping SNPs to genes based on the transcription region of 
each gene. A SNP is mapped to a gene if it is located either inside the transcription region of the gene, 
or in a window around it. In this case we specify the window to reach up to 1 kilobase upstream of the 
transcription start site, and 0.5 kilobases downstream of the transcription stop site.
       The --snp-loc flag specifies which file to use to read the SNP locations from, and the --gene-loc 
flag specifies the file that defines the gene locations. The latter contains one row per gene, with the 
values: gene ID, chromosome, transcription start and stop site (in base-pair position), genomic strand 
(this relates to the direction in which the gene is transcribed: “front to back” or “back to front”; for 
genes on the negative strand, the transcription start site is the higher of the two base-pair values), and 
official gene symbol.

Running this command will create the file step1.genes.annot, containing the mapping of SNPs to 
genes. This will be used as an input file for the gene analysis. Each row in the file corresponds to a 
gene, containing: the gene ID, the mapping region (chromosome:start:stop), and then the list of SNP IDs 
mapped to that gene. A step1.log file will also be created, containing the output that was also 
printed to the screen. It provides you with useful information about the steps that were performed 
(such as the number of values read from input files or printed to output files), as well as any warnings 
and errors that occurred during execution.

*	Questions:
	How many gene definitions were in the gene location file and how many genes have ended 
	up in the .genes.annot file? What caused this difference, and how do you think this could affect the 
	gene-set analysis?


#### Step 2: gene analysis ####
In this step we will run a gene analysis, which will perform a test of association for each gene and 
create the input file needed for all subsequent gene-set analyses. We will do so using the command:

./magma --bfile practical --covar file=practical.cov --gene-annot step1.genes.annot --out step2

The --bfile flag specifies the prefix of a binary PLINK file set (.bed, .bim and .fam) that will be used for 
the analysis. Because we are using raw genotype data as input, the default principal components linear 
regression model will be used for the analysis. It will use the phenotype embedded in the .fam file as 
the dependent variable, and will also include the variables in the practical.cov file specified using the -
-covar flag as additional covariates. For genes on the X chromosome, gender will also be included as a 
covariate. The --gene-annot flag tells MAGMA which SNP-to-gene mapping file to use to determine 
which genes to analyse and what SNPs they contain.

This command will create two output files: step2.genes.out and step2.genes.raw. The .genes.raw file 
contains all the information MAGMA needs to perform the analyses in the next steps. It is a plain- 
text file and you can inspect the content if you want, but it is not really designed to be read by people 
and all the relevant output is also in the .genes.out file.
       The step2.genes.out file is the main gene analysis output file, and contains the following 
information: the gene ID (GENE), gene mapping region (CHR, START and STOP), number of valid SNPs 
mapped to the gene (NSNPS), number of principal components extracted from those SNPs (NPARAM), 
sample size for that gene (N; in this case it is the same for all genes, but it can vary if there are missing 
values in the data), the test statistic and corresponding p-value (ZSTAT and P), and the r-squared and adjusted
r-squared values (RSQ and RSQ_ADJ; these reflect the proportion of variance in the phenotype explained by 
the SNPs in that gene).

*	Questions:
	How many genes are significant after Bonferroni correction (correcting for the total 
	number of genes)? What percentage of the genes has a p-value below 0.05? How would you 
	interpret that, does this indicate a lot of genetic signal in the data to you? 


	Note: To answer this, you can either load the output file into R and manipulate the resulting dataframe, 
	or you can use Linux commands instead. Remember that google is your friend here. Most coding problems you 
	encounter have been solved by many others before. Dedicated forums usually contain all the answers to your 
	questions. 

	Here are two stack overflow pages that can help you with this question: 
	- print values below a threshold: https://stackoverflow.com/questions/6848606/print-only-values-smaller-than-certain-threshold-in-bash 
	- count lines in output: https://stackoverflow.com/questions/12457457/count-number-of-lines-in-terminal-output


#### Step 3a: basic competitive gene-set analysis ####
Having completed the gene analysis step, we will now perform a competitive gene-set analysis:

./magma --gene-results step2.genes.raw --set-annot reactome.sets --out step3a

The --gene-results flag specifies which gene analysis .genes.raw output file to use to perform the 
analysis, and the --set-annot flag which gene-set definition file. In this case we use the reactome.sets 
file, which contains 1013 gene sets. These are almost all real gene sets taken from various databases, 
reflecting known biological pathways. A few additional sets were added for the purpose of this 
practical. The gene sets are stored by row, with each row containing the name of the gene set followed 
by the list of gene IDs of genes that belong in that set.
       With this command MAGMA will analyse each gene set in the reactome.sets file, one at a time, 
using the linear regression framework explained in the lecture. As you will see in the output log, a 
number of data-level properties of genes (eg. number of SNPs mapped to a gene) are automatically 
included as covariates in the analyses. In practice not all the genes mapped to a gene set in the 
reactome.sets will actually be included when analysing that set, because they are not present in the
.genes.raw file. This could be because those genes were not included in the gene definition file during 
annotation or had no SNPs mapped to them; it could also be because all of the SNPs mapped to that 
gene were either missing from the genotype data, or were invalid (eg. because they had too many 
missing values).

This  command  will  produce  three  output  files:  step3a.gsa.out,  step3a.gsa.genes.out  and
step3a.gsa.sets.genes.out.  The  step3a.gsa.out contains  the   analysis results for all the gene sets, and 
has the following information: the name of the gene set (VARIABLE and FULL_NAME; the VARIABLE 
column is a truncated version of the full name, this is intended to make the file easier to read when there 
are very long variable names), the variable type (TYPE; in this case, all are gene set variables), the 
number of genes included in the gene set for the analysis (NGENES), and the linear regression 
parameters (BETA, BETA_STD, SE) and corresponding p-value (P). The BETA value is the actual model 
parameter as discussed in the lecture (with SE its standard error). BETA_STD is a standardized 
coefficient, dividing BETA by the standard deviation of the gene set (generally larger for larger gene 
sets). This can be useful for comparing the effect size of different gene sets.
       The step3a.gsa.genes.out file contains information per gene for all the genes used in the 
analysis, and is very similar to the .genes.out file from step 2. You won’t need it for this practical. The 
step3a.gsa.sets.genes.out file contains information per gene for significant gene sets (determined 
using Bonferroni correction for the total number of gene sets analyzed). It contains mostly the same 
columns as the step2.genes.out file, in separate blocks for each of the significant gene sets. This is 
useful for better understanding the genes and associations of those genes in a significant set.


*	Questions: 
	how many gene sets are significant in the gene-set analysis (after Bonferroni correction for 
	the total number of analyzed sets)? How do you interpret a significant result for a gene set in a 
	competitive analysis like this, what do you conclude from the fact that for example 
	SIGNALING_BY_NOTCH1_T is significant?

  	Inspect the gene analysis results for the SIGNALING_BY_NOTCH1_T set in the .gsa.sets.genes.out.txt file. 
	Are any of the genes significant at the genome-wide level (i.e. Bonferroni- corrected for the total number of 
	genes in the data)? What percentage of the genes has a p-value below 0.05? Is this higher than the 
	percentage you find for the data set as a whole in step 2? Do you think the genes with p-value greater than 
	0.05 still contribute to the gene-set association?


#### Step 3b: conditional gene-set analysis ####
The reactome.sets file contains a very strongly associated gene-set helpfully labelled 
CRITICAL_PATHWAY. Gene sets often overlap with each other, and it is possible that some gene sets 
are significant simply because they overlap with this CRITICAL_PATHWAY. We will therefore run a 
conditional gene-set analysis to test whether this is the case here for any of the other significant gene 
sets. The command to do so is:

./magma --gene-results step2.genes.raw --set-annot reactome.sets --model analyse=file,step3a.signif condition=CRITICAL_PATHWAY --out step3b


The ‘analyse’ option of the --model flag tells MAGMA to only analyse a selection of gene sets, in this 
case all the gene sets listed in the step3a.signif file. This file lists all the significant gene sets from step 
3a, for convenience this has already been created for you. With the ‘condition’ option we tell MAGMA 
that CRITICAL_PATHWAY should be included as an additional covariate in the gene-set analysis. As 
such, for each of the gene sets to be analysed (ie. those listed in step3a.signif), MAGMA will use a linear 
regression model containing two gene set variables: the gene set to be analysed, and the 
CRITICAL_PATHWAY gene set.

The output files from this step are of the same kind as in step3a, the only difference is that now the
.gsa.out file contains an additional MODEL column. Each row still corresponds to the results of a single 
gene set, so this MODEL column tells you which rows belong together in the same regression model. 
The parameter estimates and p-value therefore reflect the strength of the gene set effect when the 
other gene sets in the same model are taken into account. So for example, in this case model 1 will 
contain both CRITICAL_PATHWAY and SIGNALING_BY_NOTCH1_T, and the results for the 
SIGNALING_BY_NOTCH1_T reflect its effect conditional on CRITICAL_PATHWAY. Keep in mind that 
these multi-variable models are symmetrical: the CRITICAL_PATHWAY result for model 1 thus reflects 
the effect of CRITICAL_PATHWAY conditional on SIGNALING_BY_NOTCH1_T.

When interpreting results from a conditional analysis, it is always useful to compare the conditional 
association of a gene set with its marginal association (ie. the association that the variable had before 
conditioning on the other gene set). This tells you how much of that marginal association could be
explained by the other gene set. To do so we could go back to the step 3a results file, but we can also 
just rerun that analysis with only the gene sets of interest included:

./magma --gene-results step2.genes.raw --set-annot reactome.sets --model analyse=file,step3a.signif --out step3c

*	Questions: 
	how does conditioning on the CRITICAL_PATHWAY gene set affect the associations of the 
	other gene sets? How many of those gene sets remain significant (at the original Bonferroni-corrected 
	threshold) when the CRITICAL_PATHWAY effect is taken into account? What would you conclude about 
	the gene sets that are no longer significant? Does the CRITICAL_PATHWAY remain significant in all 
	cases? How do you interpret the results from models in which it does not?


#### Step 4a: basic tissue expression analysis ####
Continuous gene properties can be analyzed in much the same way as gene 
sets. In this tutorial we will analyse gene expression values (on a log2(RPKM) scale, higher values 
mean stronger expression) for different tissue types, which can provide insight into the tissue- 
specificity of our genetic associations. This analysis is run as follows:

./magma --gene-results step2.genes.raw --gene-covar tissue_gex.cov --model direction-covar=positive --out step4a

The --gene-covar flag is used to specify a file containing continuous gene properties, in this case the 
tissue_gex.cov file containing gene expression values. Each row in the file corresponds to a gene, with 
the gene ID listed in the first column followed by the all the gene expression variables in subsequent 
columns. The file contains expression variables for eleven different tissues, as well as a twelfth variable 
containing the mean expression across all the tissues.
       As when analysing the gene sets, this command will analyse each of the expression variables 
one at a time. The ‘direction-covar’ option sets the direction of the test that is performed. In this case 
we are testing whether the effect of the expression variable is positive.

The command will generate a step4a.gsa.out output file, which has all the same columns as the 
.gsa.out.txt file from step 3a. Because the variables are continuous, the NGENES column is set to the 
total number of genes in the analysis. You will see that this is actually a few hundred genes less than 
before, this is because for some of the genes no gene expression data was available (this is quite 
common with such data). Those genes were therefore discarded from the analysis.

*	Questions: 
	how do you interpret a significant result for a continuous gene property in an analysis like 
	this, what do you conclude from the fact that for example BRAIN_EXPR is significant? We performed a 
	one-sided test for positive association, do you think testing for negative associations would also be 
	useful? How would you interpret a significant negative association for one of these tissue expression 
	variables?

       	How many tissue expression levels are significantly (positively) associated with the genetic 
	associations (after Bonferroni correction for all tissue variables)? Taking all the results together, do you 
	think they are very informative about the phenotype?


#### Step 4b: conditional tissue expression analysis ####
As with the gene-set analysis, conditional analysis can again be used to obtain more specificity in our 
results. In this case, the significance of AVERAGE_EXPR in the previous step show us that in general, 
more strongly expressed genes also tend to be more strongly associated with our phenotype. This 
means that associations found for the specific tissue expression levels may simply reflect this general 
relation, rather than saying anything specifically about the expression in that tissue type. In this step 
we will therefore condition on the average gene expression level per gene to obtain associations that 
are specific to individual tissue expression levels:

./magma --gene-results step2.genes.raw --gene-covar tissue_gex.cov --model direction-covar=positive condition-hide=AVERAGE_EXPR --out step4b

In this case we are using the ‘condition-hide’ option rather than ‘condition’, this suppresses output for 
AVERAGE_EXPR in the step4b.gsa.out file. This does not otherwise affect the results of the analysis, 
but since we are not very interested here in the output for AVERAGE_EXPR itself in these models for 
now it is helpful for making the output file easier to read.

*	Questions: 
	how many tissue expression levels remain significant (at the original threshold) now that 
	we have accounted for the overall average effect of gene expression? Taken together, what do you 
	conclude from the results of step 4a and 4b?


#### Step 5: joint analysis of gene sets and tissue expression levels ####
Confounding and overlap of association signals can of course also happen between gene sets and 
continuous gene properties, and with gene expression variables it is not uncommon for this to happen. 
For example, for psychiatric phenotypes like schizophrenia there is often a strong association with 
brain-specific expression levels. Any gene set that happens to contain many strongly brain-expressed 
genes is therefore more likely to be significant as well, even if the underlying pathway or biological 
process has nothing to do with schizophrenia. We can again use conditional analysis to account for 
this.
       For our example data we will do so in two steps, first correcting for the effect of average 
expression, then additionally correcting for the effect of brain-specific expression as well.

./magma --gene-results step2.genes.raw --set-annot reactome.sets --gene-covar tissue_gex.cov --model analyse=file,step3a.signif condition-hide=AVERAGE_EXPR --out step5a

./magma --gene-results step2.genes.raw --set-annot reactome.sets --gene-covar tissue_gex.cov --model analyse=file,step3a.signif condition-hide=AVERAGE_EXPR,BRAIN_EXPR --out step5b


We are only reanalyzing the gene sets that were previously significant, as the aim is only to investigate 
whether those significant associations may have been the result of confounding caused by gene 
expression effects. We are again using the ‘condition-hide’ option to make the output files a bit tidier, 
since we are only interested now in what happens to the associations of the gene sets, not those of 
the expression variables.

*	Questions: how strongly are the gene-set p-values affected by conditioning on the general gene 
	expression levels? And when you also condition on the brain-specific expression? How would you 
	interpret these results?


#### Step 6: tissue by gene set interaction analysis ####
In a normal gene set analysis we implicitly assume that most or all genes in a particular gene set are 
relevant to the phenotype we are investigating. In practice however, this may not be the case. For 
example, the gene sets we use typically reflect general biological functions and processes whereas 
most phenotypes exhibit tissue-specific involvement of genes. Even if a particular biological function 
is relevant to that phenotype, this may be restricted to genes that are more strongly expressed in a 
specific tissue. Statistically, this would likely manifest as an interaction between expression in that 
tissue and the gene set.
       In our example, brain expression was significantly associated, so we will test for interactions 
between brain expression and all large gene sets (interactions between continuous variables and gene 
sets are often unstable for small gene sets; we will leave the cut-off at the default value of at least 100 
genes in the gene set).

./magma --gene-results step2.genes.raw --set-annot reactome.sets --gene-covar tissue_gex.cov --model analyse=sets direction-interaction=positive interaction-each=BRAIN_EXPR --out step6a

With the ‘analyse=sets’ option we can specify that we are only interested in analysing the gene sets, 
and not in any of the other tissue expression variables in the tissue_gex.cov file. We are also setting 
the testing direction to test for positive interaction effects. With the ‘interaction-each’ option set, 
MAGMA will now perform an interaction analysis with BRAIN_EXPR for each of the gene sets (provided 
they contain at least 100 genes). Each interaction model will be a linear regression containing three 
variables: the BRAIN_EXPR tissue expression variable, a gene set variable, and the interaction of 
BRAIN_EXPR with that gene set.

A step6a.gsa.out file will be created containing the analysis results, with the same structure as in step 
3b. The file again contains a MODEL column which tells you which rows go together in the same model; 
for this analysis it will be three rows per model, two for the main effects and one for the interaction. In 
this case we are only interested in the p-values for the interaction terms, denoted by INTER-SC in the 
TYPE column. 

*	Questions: 
	how many significant interactions are there in the output (Bonferroni-corrected for the number of interaction terms
	tested)? How do you interpret each significant result, what do you conclude from it about the effect gene set 
	involved in the interaction? Was that gene set significant in the earlier gene-set analysis?


With a significant tissue by gene set interaction, it can be useful to perform a follow-up analysis of the 
gene set partitioned by expression in the tissue. To do so, the genes in the gene set are first sorted 
from lowest to highest expression. The set is then subdivided into smaller sets based on this ordering. 
This can make the results of the analysis somewhat easier to interpret, and can help protect against 
possible outlier effects (which the interaction analysis can be sensitive to).
       In this case a step6a.partitioned.sets has already been provided for this for the significant 
interaction from step 6a. It contains four subsets corresponding to the partitioned gene set, from 
lowest expression in Q1 to highest expression in Q4. The full gene set is also included in this file, for 
comparison. We will analyse these partitions as normal gene sets, conditioning on the brain 
expression.


./magma --gene-results step2.genes.raw --set-annot step6a.partitioned.sets --gene-covar tissue_gex.cov --model analyse=sets condition-hide=BRAIN_EXPR --out step6b


*	Questions: 
	how do you interpret the results of this additional analysis? Does it further clarify the
	significant interaction and explain why the gene set wasn’t significant on its own?


Prologue
The MAGMA software and auxiliary files, as well as the manual and reference to the accompanying 
papers, can be found on the MAGMA website at http://ctglab.nl/software/magma. For further 
questions and suggestions you can also email Christiaan de Leeuw at c.a.de.leeuw@vu.nl.