Space Ranger2.0, printed on 11/21/2024
The spaceranger count pipeline outputs several CSV files which contain automated secondary analysis results. A subset of these results are used to render the Analysis View in the Web Summary.
From Space Ranger 2.0 onwards prefixes gene_expression_ are added to subfolder names in the analysis directory.
|
Before clustering, Principal Component Analysis (PCA) is run on the normalized filtered feature-barcode matrix to reduce the number of feature dimensions. Gene expression features are used as PCA features. The PCA analysis produces five output files.
analysis/pca └── gene_expression_10_components ├── components.csv ├── dispersion.csv ├── features_selected.csv ├── projection.csv └── variance.csv
The projection.csv
file contains the projection of each spot onto the first N principal components. By default N=10.
$ cd /home/jdoe/runs/sample345/outs $ column -s, -t < analysis/pca/gene_expression_10_components/projection.csv | less -S Barcode PC-1 PC-2 PC-3 PC-4 PC-5 PC-6 PC-7 PC-8 PC-9 PC-10 AACACTTGGCAAGGAA-1 4.346822844040275 -9.073988527954281 -3.9348855477667715 4.4143616349096835 0.570902992727801 5.916871998370152 2.480636375841689 -0.06798408872536493 -0.19559617312320177 1.8447556106163412 AACAGGATTCATAGTT-1 -1.615594647200382 -1.4893042055593746 7.5739700779328665 -3.594441916107372 -0.34089358717427726 2.111157673540723 0.7226241085802059 -3.9462479306752436 0.7109160992468775 0.2148672225802757 AACAGGTTATTGCACC-1 11.032392266516446 -8.48766121740853 -3.061209741692746 -1.0179508777455186 -0.3086495689242125 -1.7476955635612388 -4.667269353092443 3.0867661655728873 3.177976646698517 3.4325955564744035 AACAGGTTCACCGAAG-1 0.02261690362615809 -1.1836459670547157 -0.4219683969014265 -0.9704969551004782 0.042818261398003474 0.7016418174052369 0.5984518607384657 0.4370020158231471 5.6108084569945715 -0.5928326084763261 AACAGTCAGGCTCCGC-1 23.551530490594487 1.485566122772231 -4.061849114221165 -3.572810445316029 0.7253401628543874 8.335238428414028 -0.27411229186554853 -1.419600005890016 8.151194312679634 -0.4650714219420635
The components.csv
file is a components matrix which indicates how much each feature contributed (the loadings) to each principal component. Features that were not included in the PCA analysis have all of their loading values set to zero.
$ head -2 analysis/pca/gene_expression_10_components/components.csv PC,ENSG00000228327,ENSG00000237491,ENSG00000177757,ENSG00000225880,...,ENSG00000160310 1,-0.0044,0.0039,-0.0024,-0.0016,...,-0.0104
The features_selected.csv
file contains the Ensembl IDs of the features with the highest dispersion that were selected for use in the principal component calculations.
$ column -s, -t < analysis/pca/gene_expression_10_components/features_selected.csv | less -S Feature 1 ENSMUSG00000114038 2 ENSMUSG00000058063 3 ENSMUSG00000087216 4 ENSMUSG00000085244 5 ENSMUSG00000021604
The variance.csv
file records the proportion of total variance explained by each principal component.
When choosing the number of principal components that are significant, it is useful to look
at the plot of variance explained as a function of PC rank - when the numbers start to flatten out,
subsequent PCs are unlikely to represent meaningful variation in the data.
$ column -s, -t < analysis/pca/gene_expression_10_components/variance.csv | less -S PC Proportion.Variance.Explained 1 0.006020454455283148 2 0.0014744138318528535 3 0.0012400447266735174 4 0.0009462466452900335 5 0.0009012382233475119 6 0.0008795663315577918 7 0.0008772635528060896 8 0.0008770449415125795 9 0.0008671600964701859 10 0.0008598483035027898
The dispersion.csv
file lists the normalized dispersion of each feature, after binning features by their mean expression across the dataset. This provides a useful measure of variability of each feature.
$ column -s, -t < analysis/pca/gene_expression_10_components/dispersion.csv | less -S Feature Normalized.Dispersion ENSG00000187634 0.6831683505253648 ENSG00000188976 -0.14721475503619233 ENSG00000187961 2.2333235330589933 ENSG00000187583 -0.1377803092462445 ENSG00000187642 -0.4131854711145404 ENSG00000188290 -0.6689923111662834 ENSG00000187608 -1.0069025521553716 ENSG00000188157 0.1691687357833229 ENSG00000237330 2.0109141055507394 ENSG00000131591 -1.4170406794742954 ENSG00000162571 2.501396789174146
For gene expression, after running PCA, t-distributed Stochastic Neighbor Embedding (t-SNE) is run to visualize spots in a 2-D space.
$ column -s, -t < analysis/tsne/gene_expression_2_components/projection.csv | less -S Barcode TSNE-1 TSNE-2 AACACTTGGCAAGGAA-1 1.2672117740192608 25.047625819665186 AACAGGATTCATAGTT-1 0.04778171834588573 3.116509598383599 AACAGGTTATTGCACC-1 18.80364109918134 18.684080610445474 AACAGGTTCACCGAAG-1 1.99715394789933 -9.208697881938745 AACAGTCAGGCTCCGC-1 38.15012452500775 2.0611329330125514 AACAGTCCACGCGGTG-1 -1.9209290038167077 -32.80566322209981 AACATAGTCTATCTAC-1 24.641739427754395 4.132453609694308 AACATCTTAAGGCTCA-1 22.693280619738776 -4.616978161185022 AACCAATCTGGTTGGC-1 5.883220436323025 -20.80497990643471 AACCACTGCCATAGCC-1 -8.471808255953594 -12.06184466119581 AACCAGAATCAGACGT-1 11.670881660483042 4.385137546311761
For gene expression, after running PCA, Uniform Manifold Approximation and Projection (UMAP) is run to visualize spots in a 2-D space.
$ column -s, -t < analysis/umap/gene_expression_2_components/projection.csv | less -S Barcode UMAP-1 UMAP-2 AACACTTGGCAAGGAA-1 10.310660096919259 7.813228392659608 AACAGGATTCATAGTT-1 9.20511225151223 5.568023946107357 AACAGGTTATTGCACC-1 12.291062284438889 6.940462987013961 AACAGGTTCACCGAAG-1 9.032031636927861 7.064727855092599 AACAGTCAGGCTCCGC-1 13.326524472133555 4.742776277209383 AACAGTCCACGCGGTG-1 9.27174981223149 0.7703902647873845 AACATAGTCTATCTAC-1 11.73081758571688 4.510761419587083 AACATCTTAAGGCTCA-1 11.816231548622202 3.0618744238318683 AACCAATCTGGTTGGC-1 8.917922202723922 1.723589437141921 AACCACTGCCATAGCC-1 8.090373623491763 3.0793685741491017 AACCAGAATCAGACGT-1 12.501168063179637 5.342741923918538 AACCGCCAGACTACTT-1 8.042049094337923 2.931341403074622
Clustering is then run to group spots that have similar expression profiles together, based on their projection into PCA space for gene expression features.
Graph-based clustering (under graphclust
) is run once as it does not require a pre-specified number of clusters. K-means (under kmeans
) is run for many values of K=2,...,N where K corresponds to the number of clusters, and N=10 by default. The corresponding results for each K is separated into its own directory.
clustering ├── gene_expression_graphclust ├── gene_expression_kmeans_10_clusters ├── gene_expression_kmeans_2_clusters ├── gene_expression_kmeans_3_clusters ├── gene_expression_kmeans_4_clusters ├── gene_expression_kmeans_5_clusters ├── gene_expression_kmeans_6_clusters ├── gene_expression_kmeans_7_clusters ├── gene_expression_kmeans_8_clusters └── gene_expression_kmeans_9_clusters
For each clustering, spaceranger produces cluster assignments for each spot.
$ column -s, -t < analysis/clustering/gene_expression_kmeans_6_clusters/clusters.csv | less -S Barcode Cluster AACACTTGGCAAGGAA-1 2 AACAGGATTCATAGTT-1 5 AACAGGTTATTGCACC-1 3 AACAGGTTCACCGAAG-1 2 AACAGTCAGGCTCCGC-1 6 AACAGTCCACGCGGTG-1 4 AACATAGTCTATCTAC-1 6 AACATCTTAAGGCTCA-1 4 AACCAATCTGGTTGGC-1 4 AACCACTGCCATAGCC-1 2 AACCAGAATCAGACGT-1 3 AACCGCCAGACTACTT-1 2
spaceranger also produces a table indicating which features are differentially expressed in each cluster relative to all other clusters. For each feature we compute three values per cluster:
For details on the mean expression normalization and statistical test, see algorithms.
This is located in a different directory than the clustering results, but follows the same structure, with each clustering separated into its own directory.
$ column -s, -t < analysis/diffexp/gene_expression_kmeans_6_clusters/differential_expression.csv | less -S Feature ID Feature Name Cluster 1 Mean Counts Cluster 1 Log2 fold change Cluster 1 Adjusted p value Cluster 2 Mean Counts Cluster 2 Log2 fold change Cluster 2 Adjusted p value Cluster 3 Mean Counts Cluster 3 Log2 fold change Cluster 3 Adjusted p value Cluster 4 Mean Counts Cluster 4 Log2 fold change Cluster 4 Adjusted p value Cluster 5 Mean Counts Cluster 5 Log2 fold change Cluster 5 Adjusted p value Cluster 6 Mean Counts Cluster 6 Log2 fold change Cluster 6 Adjusted p value ENSG00000187634 SAMD11 0.029518663633325348 -0.6095322929009148 0.15858746256099554 0.03872574484664482 0.22461461249795533 0.9833486026864531 0.05434796766793789 0.7717892513698335 0.5539781031134247 0.04715646443259539 0.5514286743311745 0.5353925121590467 0.06172687981624832 1.220250418944373 0.993462403034926 0 4.571576338266795 1 ENSG00000188976 NOC2L 0.1467335512646852 -0.0075967135760453 1 0.14740767393239 0.014213063751402633 1 0.1698373989623059 0.24460101767787146 0.7911469739812328 0.1365911383564832 -0.10554919914683936 0.9108210732531846 0.14402938623791275 0.13278757769403393 1 0 2.494819139282757 1 ENSG00000187961 KLHL17 0.04556764580290029 -0.39403627583048184 0.32049989245048643 0.03872574484664482 -0.39782259364186334 0.8812215004145008 0.08831544746039907 0.9220513967762671 0.33577683887500076 0.0682955691782416 0.5366894741993495 0.45292552912400735 0.12345375963249663 1.5026501496450981 0.9360648785130347 0 4.037427627100276 1 ENSG00000187583 PLEKHN1 0.05617143759351231 -0.12087959886173749 0.796151857954772 0.04247339757373948 -0.47499045416432306 0.7185491283353644 0.06114146362643013 0.15608687925874287 0.9615028067775023 0.08618250396301916 0.6946685676302531 0.1601888957255227 0.04115125321083221 0.05866127667012577 1 0 3.8393486631072515 1 ENSG00000187642 PERM1 0.014902626300319593 -0.5455256804536939 0.3788718406454424 0.022485916362567963 0.53064508050511 0.8934787587812238 0.01698373989623059 0.22672844861260533 1 0.02113910474564621 0.43327732261291274 0.8209540486632679 0.04115125321083221 1.8213325850288067 0.9735819926883299 0 5.579481536352142 1 ENSG00000188290 HES4 0.18055104940771813 -0.030843629087912827 0.9373501986428783 0.1486568915080882 -0.3279264105308233 0.6634182075223123 0.24116910652647439 0.450053144644428 0.4526318219460056 0.2065127925151591 0.21783766191823117 0.6562367163059909 0.14402938623791275 -0.17663973362421137 1 0 2.1879192536277263 1 ENSG00000187608 ISG15 0.17969128250577662 -0.03771907517883433 0.916923410853115 0.16614593756786328 -0.139088325241695 0.9612349344804666 0.18002764290004428 0.010740899242568158 1 0.2113910474564621 0.2617530372733241 0.5640702024367799 0.16460501284332885 -0.0006145326454807254 1 0 2.192458413242894 1 ENSG00000188157 AGRN 0.9044747808424738 0.12651962483140314 0.4707216413731767 0.8494679514747897 -0.05772812832491778 1 0.8389967508737912 -0.07019980279983279 0.9809786650957779 0.8162946601780305 -0.11988533997079429 0.8073796843234022 0.576117544951651 -0.5929361291791735 1 0 -0.08452611213214536 1 ENSG00000237330 RNF223 0.03267114227377757 -0.5519448013479709 0.20914145009837526 0.036227309695248386 -0.029958214587642473 1 0.06793495958492236 0.9765323481080515 0.3717962630568738 0.05040863439346404 0.519007196638797 0.5830747542768697 0.08230250642166442 1.4267012964117995 0.9579008199206683 0 4.450651556007188 1
Data structures produced by Visium can be analyzed and visualized in R or Python. For suggestions on downstream analysis with 3rd party R and Python tools, see the 10x Genomics Analysis Guides resource.