Project Overview

This project consists of a differential expression analysis involving 4 data types: genes, exons, transcripts and junctions. The main goal of this study was to explore the effects of smoking and nicotine exposures on the developing brain of mice pups. As secondary objectives, this work evaluated the affected genes by each substance on adult brain in order to compare pup and adult results, and the effects of smoking on adult blood and brain to search for overlapping biomarkers in both tissues.

Untitled

Untitled

Untitled

1. Data preparation

1.1 Data transformation

The aim of normalization is to remove systematic technical effects that occur in the data to ensure that technical bias has minimal impact on the results [1]. Raw read counts do not necessarily reflect gene expression measure because of those experimental variabilities and technical effects (differences not because of the biological conditions of interest) that prevent read count data from accurately reflecting.

There are within-sample effects, meaning that they affect the comparison of read counts between different genes in a sample such as length and GC content. On the other hand, there are between-sample effects that alter the comparison of read counts between the same gene in different samples such as sequencing depth. Experimental variability, such as variability in the total number of molecules sequenced, can lead to different total read counts in different samples; this is referred to as differences in sequencing depth, and the total number of reads in a sample is the library size of that sample.

After data normalization, we’d expect the read counts follow a Norm distribution.

Untitled

Untitled

Trimmed Mean of M-Values (TMM) TMM normalization is the EdgeR package's default normalization method and assumes that most genes are not differentially expressed.

Since the proportion of zeros in the read counts of jxn datasets are high, the best normalization method is TMMwsp. The TMMwsp method stands for "TMM with singleton pairing". This is a variant of TMM that is intended to perform better for data with a high proportion of zeros.

In the TMM method, genes that have zero count in either library are ignored when comparing pairs of libraries. In the TMMwsp method, the positive counts from such genes are reused to increase the number of features by which the libraries are compared. The singleton positive counts are paired up between the libraries in decreasing order of size and then a slightly modified TMM method is applied to the re-ordered libraries [2].

Then I computed QC metrics taking the original counts of the genes for the posterior sample filtering. It is important to do this step before feature filtering because we need the original counts to compute library size, mito and ribs proportions and detected number of genes of each sample. First, let Scran calculate some general qc-stats for genes and samples with the function perCellQCMetrics(). It can also calculate proportion of counts for specific gene subsets, so first we need to define which genes are mitochondrial and ribosomal.

More info:

  1. Robinson, M.D., Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11, R25 (2010). https://doi.org/10.1186/gb-2010-11-3-r25
  2. https://rdrr.io/bioc/edgeR/man/calcNormFactors.html