Chapter 3 Deseq2 Normalization and Steps
3.1 Normalization
- Different library sizes or Sequencing depth
- RNA composition bias
Since tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool. However, sequencing depth and RNA composition do need to be taken into account.
To normalize for sequencing depth and RNA composition, DESeq2 uses the median of ratios method. On the user-end there is only one step, but on the back-end there are multiple steps involved, as described below.
In Short:
Take geometric mean of gene's counts across all samples
Divide gene's counts in a sample by the geometric mean
Take median of these ratios -> sample's normalization factor (applied to gene counts)
In Details:
Sample 1 | Sample 2 | Sample 3 | |
---|---|---|---|
Gene 1 | 0 | 10 | 4 |
Gene 2 | 2 | 6 | 12 |
Gene 3 | 33 | 55 | 200 |
Step 1: Log of raw base counts Log with base e
Step 2: Average of the logs for each gene in each sample
Step 3: Filters genes with 0 counst in more than one sample
Step 4: Subtract log(raw counts) -log(average) for eacg gene This is a ratio essentially of each gene across all samples
Step 5: Calculate the median for each gene
This helps to remove extreme gene expression like genes with high expression influencing genes with low expression. Thus focusing on genes with median expression and houskeeping genes
Step 6: Convert median to normal values which is the scaling factor e^median = Normal
Step 7: Divide original read counts by scaling factor
3.2 Dispersion
When comparinng gene expression levels between groups, it is important to account for within group variabilty It is diffcult to estimate within group variabilty. Solution - pool information across genes which are expessed at similar level from replicates. Assumes that genes of similar average expression strength have similar dispersion.
Maximum Likelihood - Dispersion estimates
Fits a curve to capture the dependance of these estimates on the average expression strength
Shrinks genewise values towards the curve using an empirical Baryes approach
3.3 Generalized Linear Model
Follows negative binomeal distribution
3.3.1 Why negative binomeal distribution for analysing RNAseq data
Explained quite nicely here
3.3.2 Statistical Significance and Multiple testing correction
Wald Test for significance
Benjamini Hochenberg