Chapter 3 Deseq2 Normalization and Steps

3.1 Normalization

Different library sizes or Sequencing depth
RNA composition bias

Since tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool. However, sequencing depth and RNA composition do need to be taken into account.

To normalize for sequencing depth and RNA composition, DESeq2 uses the median of ratios method. On the user-end there is only one step, but on the back-end there are multiple steps involved, as described below.

In Short:

Take geometric mean of gene's counts across all samples
Divide gene's counts in a sample by the geometric mean
Take median of these ratios -> sample's normalization factor (applied to gene counts)

In Details:

	Sample 1	Sample 2	Sample 3
Gene 1	0	10	4
Gene 2	2	6	12
Gene 3	33	55	200

Step 1: Log of raw base counts Log with base e

Step 2: Average of the logs for each gene in each sample

Step 3: Filters genes with 0 counst in more than one sample

Step 4: Subtract log(raw counts) -log(average) for eacg gene This is a ratio essentially of each gene across all samples

Step 5: Calculate the median for each gene

This helps to remove extreme gene expression like genes with high expression influencing genes with low expression. Thus focusing on genes with median expression and houskeeping genes

Step 6: Convert median to normal values which is the scaling factor e^median = Normal

Step 7: Divide original read counts by scaling factor

3.2 Dispersion

When comparinng gene expression levels between groups, it is important to account for within group variabilty It is diffcult to estimate within group variabilty. Solution - pool information across genes which are expessed at similar level from replicates. Assumes that genes of similar average expression strength have similar dispersion.

Maximum Likelihood - Dispersion estimates
Fits a curve to capture the dependance of these estimates on the average expression strength
Shrinks genewise values towards the curve using an empirical Baryes approach

3.3 Generalized Linear Model

Follows negative binomeal distribution

3.3.1 Why negative binomeal distribution for analysing RNAseq data

Explained quite nicely here

3.3.2 Statistical Significance and Multiple testing correction

Wald Test for significance

Benjamini Hochenberg

Memoirs of a bioinformatician