Contents

Introduction

Single cell orientation tracing (SOT) is an analysis framework for single cell RNAseq data. SOT discovers cellular heterogeneity based on the core biological functions embedding in low dimensions. SOT contains several steps:
1. Gene filter;
2. Grouping co-expressed genes;
3. Low-dimensional representation of cellular heterogeneity.

Load required packages

library(SOT)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyr)
library(SingleCellExperiment)
library(clusterProfiler)
library(ReactomePA)
library(destiny)
library(Matrix)

Load data

We load C1-OSK data for illustration. The data contains 912 cells of reprogramming. In brief, The mouse embryonic fibroblasts (MEF) were induced to pluripotent cells (iPSC) by overexpressing Oct4, Sox2 and Klf4. We use Fluidigm C1 to capture single cells from 11 time points or cell lines, including MEF (initial cells), D0-D8 (inducing cells), iPSC (terminal cells) and ESC (pluripotent cells).

The input data should be normlized elsewhere. The SCnorm, scran are recommended for dataset with less cells. For large dataset, Seurat or pagoda2 are more efficient.

file_dir = system.file("extdata", "c1_osk.csv.gz", package = "SOT")
exprs = read.csv(file_dir, row.names = 1,check.names = FALSE)

The expression profile contains genes in row and cells in columns.

exprs[1:5,1:5]
##       esc-b_c01 esc-b_c02 esc-b_c04 esc-b_c05 esc-b_c07
## Gnai3     53.85     75.73    649.34    629.47    600.02
## Cdc45    244.22    324.17      0.00      0.00    375.36
## H19        0.00      0.00      0.00      0.00      0.00
## Scml2      0.00      0.00      0.00     20.88      0.00
## Narf       0.00      0.00      0.00      0.00      0.00
dim(exprs)
## [1] 20273   912

The data manipulation of SOT is based on SingleCellExperiment object, which provides an easy way to store gene expression profiles, cell annotation, gene annotation, dimension reduction, metadata and etc.

sce = BuildSCE(normcounts=exprs, trim = 5.1/ncol(exprs))

Annotate days.

day = c("mef", "osk_d0", "osk_d1", "osk_d2", "osk_d3", "osk_d4", "osk_d5", "osk_d7", "osk_d8", "ips", "esc")
sce$Day = factor(gsub("(mef|osk_d0|osk_d1|osk_d2|osk_d3|osk_d4|osk_d5|osk_d7|osk_d8|ips|esc).*", "\\1", colnames(sce)),levels=day)
metadata(sce)$`Day color` = c("mef" = "#BEAED4",
                              "osk_d0" = "#fdac86",
                              "osk_d1" = "#FFFF99",
                              "osk_d2" = "#386CB0",
                              "osk_d3" = "#F0027F",
                              "osk_d5" = "#BF5B17",
                              "osk_d7" = "#666666",
                              "osk_d8" = "#ffd56f",
                              "ips" = "#7FBC41",
                              "esc" = "#A6CEE3")

We can browser the information stored in sce object.

sce
## class: SingleCellExperiment 
## dim: 20273 912 
## metadata(1): Day color
## assays(1): normcounts
## rownames(20273): Gnai3 Cdc45 ... Flt3l Rnf223
## rowData names(0):
## colnames(912): esc-b_c01 esc-b_c02 ... osk_d8_c95 osk_d8_c96
## colData names(1): Day
## reducedDimNames(0):
## spikeNames(0):

1 Gene filter

1.1 Filter low expression gene

The filtering results can be accessed in rowData(sce)$"filter.low".

sce = FilterLow(sce, minexp = 10, mincells = 10, datatype = "normcounts")
assay(sce, "logcounts") = Matrix(log2(assay(sce, "normcounts")+1), sparse = TRUE)

1.2 Find high variable genes

The high variable genes can be accessed by rowData(sce)$find.hvgs.

sce = FindHVGs(sce, datatype = "logcounts", thr.bio = 0, thr.FDR = 0.1)

sum(rowData(sce)$genes.use)
## [1] 5615

1.3 Perform distribution test

Some genes with similar distribution across multiple time points reflect stable biological processes (eg. housekeeping genes), which have little information for dissecting heterogeneity of intermediate state of reprogramming. Therefore, we test if gene expression is drawn from a population that follows a particular distribution using Anderson-Darling test, the genes with pvalue less then 1e-3 are excluded.
We use scipy.stats.anderson_ksamp to perform ADtest because this python implementation is much faster than ad.test of R package kSamples. One can force using R implementation by setting r.implement = FALSE.
The genes with different distribution can be accessed by rowData(sce)$"ad.test".

sce = ADtest(sce, 
             datatype = "logcounts",
             condition = "Day", 
             genes.use = "find.hvgs",
             useLevels = c("osk_d1","osk_d2","osk_d3","osk_d5","osk_d7","osk_d8"), 
             ncore = 2,
             thr.padj = 1e-3)
## Perform Aderson-Darling test for 583 cells of <osk_d1 osk_d2 osk_d3 osk_d5 osk_d7 osk_d8>
## Use 2 cores to perform Anderson-Darling test...
## *Use python implement of anderson_ksamp*

The genes number used for downstream analysis.

sum(rowData(sce)$genes.use)
## [1] 3460

2 Grouping co-expressed genes

SOT uses affinity propagation (AP) clustering to find co-expressed gene groups. The advantage of AP clustering is that it can determined the number of clusters automatically. SOT scores each cluster across cell by calculate the first principle component, which provides a denoised view of the data. The cluster scores are stored in reducedDim(sce, "lv1.score").

sce = ap.cluster(sce, datatype = "logcounts", genes.use = "genes.use", ncore = 2)
## Perform AP clustering on 3460 genes...
## Find 92 clusters
## Calculate eigengenes using 2 cores...
## Use metadata(sce)$p.cluster.size to visualize cluster size.

We can see the gene number. The symbols are the exemplars of each cluster.

metadata(sce)$p.cluster.size

Heatmap of lv1.score.

plot_embed(sce, usedEmbed = "lv1.score", anno_col = "Day", color_scale = metadata(sce)$`Day color`, fontsize=5)

The gene clusters are too many and it’s hard to interpret their biological functions since some clusters only contain few genes. To get a more concise view of the cellular heterogeneity, SOT performs AP clustering again on the eigengenes of gene clusters and assigns genes into several larger gene groups.

sce = reduce.cluster(sce)
## Perform AP clustering on 92 clusters...
## Calculate eigengenes using 2 cores...
## Find 9 groups
## Use metadata(sce)$p.group.size to visualize group size.

Similarly, the score of each gene group is represented by the first principle component. The group scores are stored in reducedDim(sce, "lv2.score"). The lv2.score is normalized by remove mean and standard deviation.

metadata(sce)$p.group.size

Heatmap of lv2.score.

plot_embed(sce, usedEmbed = "lv2.score", anno_col = "Day", cluster_row = FALSE, color_scale = metadata(sce)$`Day color`, fontsize=8)

It is interesting to research the pathway of gene groups. It helps us understand the source of reprogramming heterogeneity. Here we perform GO enrichment.

gene_anno = rowData(sce)
gene_anno = gene_anno[gene_anno$genes.use, ]
grlabs = split(gene_anno$symbol, gene_anno$lv2.labels)
gcSample = lapply(grlabs, function(gr) as.numeric(bitr(gr, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Mm.eg.db")$ENTREZID))
xx.mus.go <- compareCluster(gcSample, OrgDb='org.Mm.eg.db', fun='enrichGO', pvalueCutoff = 0.1, qvalueCutoff = 0.1, ont = "BP", readable=T)
EnrichResGO = xx.mus.go@compareClusterResult
dotplot(xx.mus.go, title = paste0("Mouse Gene Ontology"))

datatable(EnrichResGO, filter="top", options=list(pageLength = 10)) 

We can assess divergence of each gene group in each day.

lv2.score = as.data.frame(reducedDim(sce, "lv2.score"))
group.std = lv2.score %>% 
  cbind(Day = sce$Day) %>%
  group_by(Day) %>% 
  summarise_all(funs(sd)) %>%
  gather(Group, Std, -Day)

ggplot(data=group.std, aes(x=Day, y=Std, group=Group)) +
  geom_line(aes(color=Group))+
  geom_point() + facet_wrap(~Group) + theme_bw() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 10)) + 
  labs(title = "Divergence of gene groups in each day")

The group divergence reflects the heterogenous contribution of each gene group for each day. For example, group8 (annotated for keratinization) has low standard deviation (std) in MEF, iPS and ESC, but has high std in the intermediate stage, suggesting a divergent cell state emerged during reprogramming. The group7 has low std across all time points indicate a synchronistical biological process.

3 Low-dimensional representation of cellular heterogeneity

Previouse study had reported that cell cycle effect could mask the true developmental process when analyzing single-cell RNA-seq data. Notably, after performming co-expression analysis, we find that the group2 enriched for cell cycle, which may blur the dissection of successed and failed reprogramming branches. So we exclude group1 and then perform diffusion map. Diffusion map is dimensional reduction algorithm that discovers continuous manifold structure by constructing Markovian transition of cells. We found that cosine distance is more stable than euclidean distance when calculating diffusion map.

dm = DiffusionMap(reducedDim(sce, "lv2.score")[,-c(1)], distance ="cosine")
plot(eigenvalues(dm), ylim = c(0.8,1), pch = 20,main="Scree plot", xlab = "Diffusion component (DC)", ylab = "Eigenvalue")

The top diffusion components are selected and stored in reducedDim(sce) namely “dcs”.

top_dcs = dm[[paste0("DC", 1:7)]]
reducedDim(sce, "dcs") = top_dcs

SOT integrates some methods for visualizing high dimensional data in 2-d, e.g. FR (Fruchterman-Reingold layout), largeVis, umap(Uniform Manifold Approximation and Projection).

Once the specific type of 2-d layout was computed, it will not be computed again when usedEmbed contains one of FR, largeVis and umap in the string. To force re-calculating the layout, one should remove the coordinate in reducedDims(sce), for example, reducedDim(sce, "dcs_FR")=NULL.

sce = layout_2d(sce, k = 35, seed = 13, usedEmbed = "dcs", layout_method = "FR", colour_condition = "Day", color_scale = metadata(sce)$`Day color`)
## Use metadata(sce)$p.dcs_FR.Day to visualize embeddings.

The layout function returns sce with ggplot2 object in metadata(sce).

metadata(sce)$p.dcs_FR.Day

We can see distinct branches of reprogramming. Gene group scores can be showed in the layout which help us examine the biological pathways in different stage or branch of reprogramming.

sce = layout_2d(sce, usedEmbed = "dcs_FR", colour_embed = "lv2.score")
## Use metadata(sce)$p.dcs_FR.lv2.score to visualize embeddings.
metadata(sce)$p.dcs_FR.lv2.score

We can also see the expression of marker genes.

genes = c("Dppa5a", "Nanog", "Sall4", "Klk10", "Fxyd5", "Cd34","Gbp2", "Gbp5", "Ifng")
sce = layout_2d(sce, datatype = "logcounts", usedEmbed = "dcs_FR", colour_gene = genes)
## Use metadata(sce)$p.dcs_FR.gene to visualize embeddings.
metadata(sce)$p.dcs_FR.gene