今天我们来分享CoNGA的分析原理,其实我们最关系的就是和是怎么计算和运用的。
首先是TCR的分析(10X的结果文件)
Clonotype data from 10x genomics is first converted into a TCRdist 'clones file' and the matrix of TCRdist(这个软件提到过很多次了) distances is computed.
1、克隆数据的过滤
2、Kernel principal components analysis as implemented in scikitlearn's KernelPCA(这个方法在Seurat包中现在可以实现了) class is then used to extract the top 50 components of variation from this distance matrix。
3、these kernel PCs can be directly incorporated into the standard single-cell workflows for clustering and dimensionality reduction in place of the principal components extracted from the gene expression counts matrix.(直接进行类似单细胞转录组一样的后续分析)。降维用UMAP,聚类用louvain.
4、To annotate the Louvain clusters in CoNGA visualizations, the most frequent V segment in each cluster is identified and appended to the cluster name if it is present in at least 50% of the clustered TCRs, uppercased if present in at least 75% of the TCRs (clusters are initially named with consecutive integers, starting at 0 with the largest cluster)。
第二是TCR sequence features
1、For each clonotype, CoNGA calculates a set of TCR sequence-based scores for use in graph-vs-feature analysis and for annotating graph-vs-graph cluster pairs。首先,一组 28 个不同的氨基酸特性在 α 和 β 链 CDR3 环(不包括每个 CDR3 的前 4 个和最后 4 个残基,其中完整的 CDR3 序列定义为从保守的 半胱氨酸,并以 J 区中 GXG 基序之前的苯丙氨酸结尾并包括在内)。 这些分数包括由 VDJtools 包的作者从原始来源汇编的一组以及五个 Atchley 因素 。7个另外的测序分数也进行计算:
- 'alphadist',当完整的基因片段集按基因组位置排序时,它测量 Alpha 和 Alpha 基因之间的序数距离
- 'imhc', the iMHC score
- 'cd8', a simple CD8-versus-CD4 preference score calculated from the TCR V and J gene usage, CDR3 length, and CDR3 amino acid composition, based on frequency differences between flow-sorted CD8+ and CD4+ TCR sequence repertoires。
- 'cdr3len', total CDR3 length。
- 'mait', which assigns a score of 1 to TCRs with an alpha chain using the TRAV1-2 and TRAJ33/TRAJ20/TRAJ12 segments (TRAV1 and TRAJ33 in mouse) and a CDR3 length of 12, and 0 to all other TCRs(这个在案例中使用)。
- 'inkt', which assigns a score of 1 to TCRs with the TRAV10/TRAJ18/TRBV25 gene combination and a CDR3 length of 14, 15, or 16 (TRAV11/TRAJ18 and length 15 for mouse)
- 'nndists_tcr', which measures the density of TCR sequences nearby the scored clonotype by calculating the average TCR distance to the nearest 1% of clonotypes
iMHC分数的定义,score是TCR序列特征的加权线性组合。
接下来基因表达的分析,前面都一样,就是分析到PCA开始
这些基因表达 PC 用于通过采用 PC 空间中具有最小平均欧几里得距离的细胞与克隆型中的其他细胞来选择每个克隆型的单个代表性细胞。一旦数据集减少到每个克隆的单个细胞,UMAP 和 Louvain 聚类工具将应用于 PCA 矩阵以生成基因表达图谱和一组基因表达克隆型cluster。DEGs in clonotype groupings (for example the set of CoNGA hits in a cluster pair) are identified using the sc.tl.rank_genes_groups routine with the 'wilcoxon' method.(scanpy的分析方法,理解起来稍有难度)。当然,对于多样本的分析,还是要进行一定的批次去除,As it was not immediately obvious how to recover the processed gene expression components from the publicly available data, and as a test of CoNGA's robustness to alternative neighbor graphs,we elected to use the provided 3D UMAP coordinates in lieu of gene expression PCs for the CoNGA GEX neighbor calculations described below. We also directly borrowed the GEX clusters from the original paper rather than reclustering the dataset.
接下来重点1 Graph-vs-graph correlation analysis
In CoNGA graph-vs-graph correlation analysis, similarity graphs defined by gene expression and by TCR sequence are compared to identify vertices (clonotypes) whose neighbor sets in the two graphs overlap significantly.
分配给克隆型的 CoNGA 分数等于随机看到其 GEX 和 TCR 邻域之间相等或更大重叠的概率,乘以克隆型总数以校正多重测试。The hypergeometric distribution is used to estimate this probability, as implemented in the scipy.stats module。
Two types of similarity graphs can be used in CoNGA: K nearest neighbor (KNN) graphs, in which each clonotype is connected to its K nearest neighbors in gene expression or TCR space;and cluster graphs, in which each clonotype is connected to all the clonotypes in the same (GEX or TCR) cluster.
The neighbor number K for constructing KNN graphs is specified as a fraction of the total number of clones;for the calculations reported here, neighbor fractions of 0.01 and 0.1 were used.
The CoNGA score assigned to a clonotype is the minimum score over all graph comparisons, of which there were 6 combinations in the calculations reported here (GEX_KNN vs TCR_KNN, GEX_KNN vs TCR_cluster, and GEX_cluster vs TCR_KNN, for both the 0.01 and 0.1 KNN neighbor fractions).(有点难)。This may reflect correlation between neighborhoods of nearby clonotypes, which reduces the effective multiple-testing burden.
重点2 Graph-vs-feature correlation analysis
In CoNGA graph-vs-feature correlation analysis, numerical features defined on the basis of one property (GEX or TCR) are mapped onto similarity graphs defined by the other property, and graph neighborhoods with biased score distributions are identified.
As GEX properties we consider the expression levels of all the individual genes as well as a feature ('nndists_gex') that captures the density of nearby clonotypes by calculating the average distance in GEX space to the nearest 1% of the clonotypes.TCR的这个分析上面介绍过了。
As this analysis involves a large number of differential expression calculations (roughly the number of clonotypes times the number of different similarity graphs times the number of features), we use a two-step procedure that combines a pre-filter with the t-test followed by the more time-intensive Mann-Whitney-Wilcoxon (MWW) calculation for the top 100 hits per clonotype and graph that pass a t-test significance threshold ten times higher than the target threshold. The final significance score assigned to a detected association equals the raw MWW P-value multiplied by the product of the number of clonotypes and the number of features, to correct for multiple testing(计算的有点夸张啊)。
方法就到这里,有点难,一遍可能无法完全理解,下一篇我们分享代码
生活很好,有你更好