10X单细胞(10X空间转录组)TCR数据分析之TCRdist3(5)

这里接上一篇,继续分享TCR的分析代码

定义TCR相似度的半径(以便于寻找motif,大家注意下面的代码)

""" 
Example of TCR radii defined for each TCR in an 
antigen enriched repertoire, and logo-motif report.
"""
import os
import numpy as np
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.sample import _default_sampler
from tcrdist.background import get_stratified_gene_usage_frequency
from tcrdist.centers import calc_radii
from tcrdist.public import _neighbors_sparse_variable_radius, _neighbors_variable_radius
from tcrdist.public import TCRpublic
from tcrdist.ecdf import _plot_manuscript_ecdfs
import matplotlib.pyplot as plt
    # ANTIGEN ENRICHED REPERTOIRE
    # Load all TCRs tetramer-sorted for the epitope influenza PA epitope
df = pd.read_csv("dash.csv").query('epitope == "PA"').\
    reset_index(drop = True)
# Load <df> into a TCRrep instance, to infer CDR1, CDR2, and CDR2.5 region of each clone
tr = TCRrep(cell_df = df.copy(), 
            organism = 'mouse', 
            chains = ['beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv',
            compute_distances = True)
    # UN-ENRICHED REPERTOIRE
    # For illustration we pull a default sampler for mouse beta chains. 
    # This is used to estimate the gene usage 
    # probabilities P(TRBV = V, TRBJ = J)
ts = _default_sampler(organism = "mouse", chain = "beta")()
ts = get_stratified_gene_usage_frequency(ts = ts, replace = True) 
    # Then we synthesize a background using Olga (Sethna et al. 2019), 
    # using the P(TRBV = V, TRBJ = J) for inverse probability weighting.
df_vj_background = tr.synthesize_vj_matched_background(ts = ts, chain = 'beta')
    # Load <df_vj_background> into a TCRrep instance, to infer CDR1,CDR2,CDR2.5
trb = TCRrep(cell_df = df_vj_background.copy(), 
            organism = 'mouse', 
            chains = ['beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv',
            compute_distances = False)
    # Take advantage of multiple CPUs
tr.cpus = 4
    # Compute radii for each TCR that controls neighbor-discovery in the background at 
    # estimate of 1/10^5 inverse probability weighted TCRs.
    # Note we are set <use_sparse> to True, which allows us to take advantage of 
    # multiple cpus and only store distance less than or equal to <max_radius>
radii, thresholds, ecdfs = \
    calc_radii(tr = tr, 
        tr_bkgd = trb, 
        chain = 'beta', 
        ctrl_bkgd = 10**-5, 
        use_sparse = True, 
        max_radius=50)
    #  Optional, set a maximum radius
tr.clone_df['radius'] = radii
tr.clone_df['radius'][tr.clone_df['radius'] > 26] = 26
    # Tabulate index of neighboring clones in the ANTIGEN ENRICHED REPERTOIRE,
    # at each TCR-specific radius   
tr.clone_df['neighbors'] = _neighbors_variable_radius(
    pwmat = tr.pw_beta, 
    radius_list = tr.clone_df['radius'])
    # Tabulate neighboring sequences in background
tr.clone_df['background_neighbors'] = _neighbors_sparse_variable_radius(
    csrmat = tr.rw_beta, 
    radius_list = tr.clone_df['radius'])
    # Tabulate number of unique subjects
tr.clone_df['nsubject']             = tr.clone_df['neighbors'].\
        apply(lambda x: tr.clone_df['subject'].iloc[x].nunique())
    # Score Quasi(Publicity) : True (Quasi-Public), False (private)
tr.clone_df['qpublic']              = tr.clone_df['nsubject'].\
        apply(lambda x: x > 1)
    # OPTIONAL: HTML Report 
    # Note: you can call TCRpublic() with fixed radius or directly 
    # after tr.clone_df['radius'] is defined. 
tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones.html")
tp.fixed_radius = False
    # Generates the HTML report
rp = tp.report()
    # OPTIONAL: ECDF Figure, against reference
f1 = _plot_manuscript_ecdfs(
    thresholds = thresholds, 
    ecdf_mat = ecdfs, 
    ylab= 'Proportion of Background TCRs', 
    cdr3_len=tr.clone_df.cdr3_b_aa.str.len(), 
    min_freq=1E-10)
f1.savefig(os.path.join("", "PA1.png"))
from tcrdist.ecdf import distance_ecdf
tresholds, antigen_enriched_ecdf = distance_ecdf(   pwrect = tr.pw_beta,
    thresholds= thresholds,
    weights=None,
    pseudo_count=0,
    skip_diag=False,
    absolute_weight=True)
# It is straightforward to make a ECDF between antigen enriched TCRs as well:
antigen_enriched_ecdf[antigen_enriched_ecdf == antigen_enriched_ecdf.min()] = 1E-10
f2 = _plot_manuscript_ecdfs(
    thresholds = thresholds, 
    ecdf_mat = antigen_enriched_ecdf, 
    ylab= 'Proportion of Antigen Enriched PA TCRs', 
    cdr3_len=tr.clone_df.cdr3_b_aa.str.len(), 
    min_freq=1E-10)
f2.savefig(os.path.join("", "PA2.png"))
图片.png

这里对半径的定义个性化程度很高,大家可以自行调节。

(Quasi)Public Clones

Public TCRs are shared clonotypes found in multiple individuals, arising from VDJ recombination biases and common selection pressures. Repertoire analyses often focuses on public clones; however finding public antigen-specific TCRs is not always possible because TCR repertoires are characterized by extreme diversity. As a consequence, only a small fraction of the repertoire can be assayed in a single sample, making it difficult to reproducibly sample TCR clonotypes from an individual, let alone reliably detect shared clonotypes in a population.
Finding similar receptors from multiple individuals provides stronger evidence of shared epitope recognition and reveals mechanistic basis for CDR-peptide-MHC binding.(这个作用当然很好,但是识别起来有困难).
Moreover, meta-clonotypes are by definition more abundant than exact clonotype and thus can be more reliably be detected in a single bulk unenriched sample, facilitating more robust function comparisons across populations.(主要是不能精准查找TCR序列,所以定义半径就显得尤为重要)。

默认参数

For instance, you may want find all the (quasi)public collections of TCRs within a fixed radius <= 18(默认的半径,这个需要根据前面的分析结果进行调整) TCRdist units of each TCR in the antigen enriched input data.
"""
tcrdist3 is particularly useful for finding 
what we term quasi-public meta-clonotypes, 
collections of biochemically similar TCRs 
recognizing the same peptide-MHC. 

The easist way to define meta-clonotypes
is to compute pairwise distances between 
TCRs found in an antigen-enriched 
subrepertoire, abbreviated below as 
<aesr>
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic

    # <aesr_fn> antigen-enriched subrepertoire
fn = os.path.join('tcrdist', 'data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv')
    # <aesr_df> antigen-enriched subrepertoire
df = pd.read_csv(fn)
    # <tr> TCR repertoire
tr = TCRrep(
    cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa']].copy(), 
    organism = 'human', 
    chains = ['beta'], 
    db_file = 'alphabeta_gammadelta_db.tsv', 
    compute_distances = True)

    # <tp> TCRpublic class for reporting publicities, fixed radius 18, 'nsubject > 3'
tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones.html")

    # by calling, .report() an html report is made
public = tp.report()
    
    # Also, the following datafarme are available
    # <clone_df> pd.DataFrame clone_df from tr.clone_df 
    # with neighbors and summary measures appended
public['clone_df']
    # <nn_summary> pd.DataFrame with just summary measures
public['nn_summary']
    # <quasi_public_df> Non-redundant groups of quasipublic clones
public['quasi_public_df']
In addition to the summary DataFrames returned, a HTML quasi-publicity report is generated, allowing for the inspection of logo-motifs formed from highly similar antigen-enriched TCR sequences found in multiple subjects.(感兴趣的大家可以尝试一下)。
图片.png

调整一个参数

例如,汇总同类群组信息并将其添加到报告中。
    

import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic

# <aesr_fn> antigen-enriched subrepertoire
aesr_fn = os.path.join(
    'tcrdist',
    'data',
    'covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv')

# <aesr_df> antigen-enriched subrepertoire
aesr_df = pd.read_csv(aesr_fn)
# <tr> TCR repertoire
tr = TCRrep(
    cell_df = aesr_df[[
        'cohort',
        'subject',
        'v_b_gene', 
        'j_b_gene',
        'cdr3_b_aa']].copy(), 
    organism = 'human', 
    chains = ['beta'], 
    db_file = 'alphabeta_gammadelta_db.tsv', 
    compute_distances = True)
# <tp> TCRpublic class for reporting publicities 
tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones2.html")
# set to True, if we want a universal radius
tp.fixed_radius = True
# must then specify maximum distance for finding similar TCRs
tp.radius = 18
# set criteria for being quasi-public
tp.query_str = 'nsubject > 6'
# Add additional columns to be summarized in the report
tp.kargs_member_summ['addl_cols'] = ['subject', 'cohort']
# Add cohort.summary to the labels column so it shows up in the report
tp.labels.append("cohort.summary")
# by calling, .report() an html report is made
public = tp.report()
报告有一个新的列来总结来自研究中每个队列的 TCR 的百分比,并且元克隆型的数量较少,因为只有那些来自超过 8 个受试者的 TCR 被报告。
图片.png

特定于序列的搜索半径

应用于每个质心的半径可以在 clone_df 的列中指定。
"""
Instead of enforcing a fixed radius, 
use a radius specific to each
centroid, specified in an additional 
column.
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic

fn = os.path.join(
    'tcrdist',
    'data',
    'covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv')

df = pd.read_csv(fn)

tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], 
            organism = "human", 
            chains = ["beta"])

tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones3.html")

# set to True, if we want a universal radius
tp.fixed_radius = False
# must then specify maximum distance for finding similar TCRs
tp.radius = None
# set criteria for being quasi-public
tp.query_str = 'nsubject > 5'
# Add additional columns to be summarized in the report
tp.kargs_member_summ['addl_cols'] = ['subject', 'cohort']
# Add cohort.summary to the labels column so it shows up in the report
tp.labels.append("cohort.summary")
tp.labels.append("cdr3s")
# Change number of subjects to display
tp.kargs_member_summ['addl_n'] = 10
# by calling, .report() an html report is made
public = tp.report()
图片.png
Working from neighbor_diff output
"""
Use values from neighborhood_diff
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic  
fn = os.path.join('tcrdist','data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv')
df = pd.read_csv(fn)
tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], 
            organism = "human", 
            chains = ["beta"])

from tcrdist.rep_diff import neighborhood_diff
ndif = neighborhood_diff(   clone_df= tr.clone_df, 
                                pwmat = tr.pw_beta, 
                                count_col = 'count', 
                                x_cols = ['cohort'], 
                                knn_radius = 25, 
                                test_method = "chi2")
# Add neighbors and other columns of interest 
# from neighbor_diff result to the clone_df
tr.clone_df = pd.concat([tr.clone_df, ndif[['neighbors', 'K_neighbors','val_0','ct_0','pvalue']] ], axis = 1)
# Because neighors and K_neighbor are already added to the clone_df 
# TCRpublic.report() uses those instead of finding new ones.
tp = TCRpublic(
    tcrrep = tr, 
    output_html_name = "quasi_public_clones_with_ndif.html")
# Add any columns neighbor_diff columns 
#that you want to display in the final report
tp.labels.append('val_0')
tp.labels.append('ct_0')
tp.labels.append('pvalue')
# chagne sort to be pvalue not publicity
tp.sort_columns = ['pvalue']
# because you are sorting by pvalue, change to True
tp.sort_ascending = True
tp.report()
其他的一些选项
just want to quickly find neighbors and (quasi)public clones
"""
Instead of enforcing a fixed radius, 
use a radius specific to each
centroid, specified in an additional 
column.
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic  
fn = os.path.join('tcrdist','data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.radius.csv')
df = pd.read_csv(fn)
tr = TCRrep(cell_df = df[['cohort','subject','v_b_gene', 'j_b_gene','cdr3_b_aa', 'radius']], 
            organism = "human", 
            chains = ["beta"])

# NEIGHBORS
from tcrdist.public import _neighbors_fixed_radius
from tcrdist.public import _neighbors_variable_radius
# returns lists of lists of all neighbors at fixed of variable radii
_neighbors_fixed_radius(pwmat = tr.pw_beta, radius = 18)
_neighbors_variable_radius(pwmat = tr.pw_beta, radius_list = tr.clone_df.radius)

# returns the number (K) neighbors at fixed or vriable radii
from tcrdist.public import _K_neighbors_fixed_radius
from tcrdist.public import _K_neighbors_variable_radius
_K_neighbors_fixed_radius(pwmat = tr.pw_beta, radius = 18)
_K_neighbors_variable_radius(pwmat = tr.pw_beta, radius_list = tr.clone_df.radius)

# First find neighbors by your favorite method 
tr.clone_df['neighbors'] = _neighbors_variable_radius(
    pwmat = tr.pw_beta, 
    radius_list = tr.clone_df.radius)
# Once neighbors are added to a clone_df you can easily determine publicity. 
tr.clone_df['nsubject']   = tr.clone_df['neighbors'].\
    apply(lambda x: tr.clone_df['subject'].iloc[x].nunique())
tr.clone_df['qpublic']   = tr.clone_df['nsubject'].\
    apply(lambda x: x > 1)
I have neighbors and radii already, I want logos
"""
Report of meta-clonotypes using two dataframes.
<df>  has all TCRS
<df2> has a subset of TCRS in <df>, specifiyint which 
are to be used as centroids.
"""
import os
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from tcrdist.public import TCRpublic  
from tcrdist.tree import _default_sampler_olga
from progress.bar import IncrementalBar
from palmotif import compute_pal_motif, svg_logo
from tcrdist.public import make_motif_logo

output_html_name = "custom_report.html"
# <fn> filename for all TCRs in an antigen-enriched repertoire
fn = os.path.join('tcrdist','data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv.bE5ctrl.centers.csv')
df = pd.read_csv(fn, sep = ",")
df = df[['cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'pgen', 'max_radi']].\
    rename(columns= {'max_radi':'radius'}).copy()

# <fn>2 filename for priority TCRs
fn2 = os.path.join('tcrdist','data','covid19',
    'mira_epitope_55_524_ALRKVPTDNYITTY_KVPTDNYITTY.tcrdist3.csv.bE5ctrl.centers.csv.ranked_centers.tsv')
df2 = pd.read_csv(fn2, sep = "\t").\
    rename(columns= {'max_radi':'radius'}).copy()

# Compute distances between all TCRs
tr = TCRrep(cell_df = df, 
    organism = 'human',
    chains = ['beta'], 
    compute_distances = True)

# Initialize a tcrsampler, this will be used to make background motifs
tcrsampler = _default_sampler_olga(organism = "human", chain = "beta")()

# Iterate through each row of the df2, making a logo for each.
svgs = list()
svgs_raw = list()
bar = IncrementalBar("Making Logos", max = df2.shape[0])
for i,r in df2.iterrows():
    bar.next()
    svg,svg_raw=make_motif_logo(tcrsampler = tcrsampler,
                        clone_df = tr.clone_df,
                        pwmat = tr.pw_beta,
                        centroid = r['cdr3_b_aa'],
                        v_gene = r['v_b_gene'],
                        radius = r['radius'],
                        pwmat_str = 'pw_beta',
                        cdr3_name = 'cdr3_b_aa',
                        v_name = 'v_b_gene',
                        gene_names = ['v_b_gene','j_b_gene'])
    svgs.append(svg)
    svgs_raw .append(svg_raw)
bar.next(); bar.finish()
df2['svg'] = svgs
df2['svg_raw'] = svgs_raw

def shrink(s):
    """reduce size of svg graphic"""
    s = s.replace('height="100%"', 'height="20%"')
    s = s.replace('width="100%"', 'width="20%"')
    return s

# Choose columns to include in the report
labels = [  'cdr3_b_aa', 
            'v_b_gene',
            'j_b_gene',
            'radius',
            'regex', 
            'target_hits',
            'nsubject',
            'chi2joint']

with open(output_html_name, 'w') as output_handle:
    for i,r in df2.iterrows():
        #import pdb; pdb.set_trace()
        svg, svg_raw = r['svg'],r['svg_raw']
        output_handle.write("<br></br>")
        output_handle.write(shrink(svg))
        output_handle.write(shrink(svg_raw))
        output_handle.write("<br></br>")
        output_handle.write(pd.DataFrame(r[labels]).transpose().to_html())
        output_handle.write("<br></br>")
会生成一个网页报告
Will this work with sparse matrix options?
tcrdist3 has a memory efficient options for larger datasets that produce scipy.sparse rather than dense representations of distance relationships.
Currently you can’t call TCRpublic() on this sparse representation. However, here is an example of how you can achieve similar results via a script, reporting (quasi)Public meta-clonotypes from a sparse format.(平民化了)。
"""
Making a meta-clonotype report from a 
scipy.sparse TCRdist matrix.
"""
import numpy as np
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.public import _neighbors_sparse_fixed_radius, _neighbors_sparse_variable_radius
from tcrdist.summarize import test_for_subsets
from tcrdist.tree import _default_sampler_olga
from tcrdist.public import make_motif_logo_from_index

df = pd.read_csv("dash.csv").query('epitope == "PA"')
tr = TCRrep(cell_df = df,               #(2)
            organism = 'mouse', 
            chains = ['beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv',
            compute_distances = False)
    # When setting the radius to 50, the sparse matrix 
    # will convert any value > 50 to 0. True zeros are 
    # repressented as -1.
radius = 50
tr.cpus = 1
    # Notice that we called .compute_sparse_rect_distances instead of .compute_distances
tr.compute_sparse_rect_distances(df = tr.clone_df, radius = radius)

    # There are two functions for finding neighbors from a sparse TCRdist matrix. 
    # For 1 fixed radius: _neighbors_sparse_fixed_radius()
    # For a radius per row: _neighbors_sparse_variable_radius()
tr.clone_df['radius'] = 12 
tr.clone_df['neighbors'] = \
    _neighbors_sparse_variable_radius(
        csrmat = tr.rw_beta, 
        #radius = 12)
        radius_list = tr.clone_df['radius'].to_list())

    # <K_neighbors>the number of neighbors per TCR
tr.clone_df['K_neighbors'] = tr.clone_df['neighbors'].apply(lambda x: len(x))
    # <nsubject> the number of subject (nsubject) neighboring the TCR (
tr.clone_df['nsubject'] = tr.clone_df['neighbors'].apply(lambda x: len(tr.clone_df['subject'][x].unique()))
    # nsubject > 1 implies quasi-publicity)
tr.clone_df['qpublic'] = tr.clone_df['nsubject'].apply(lambda x: x >1 )

    # e.g., For the report, focus on TCRs with more than 5 neighboring subjects 
quasi_public_df = tr.clone_df.query('nsubject > 5').copy().\
    sort_values('nsubject', ascending = False)
    # test_for_subsets()> allows us to remove TCRs with identical neighborhoods
quasi_public_df['unique']  = test_for_subsets(quasi_public_df['neighbors'])
quasi_public_df = quasi_public_df[quasi_public_df['unique'] == 1].copy()
    # declare a sampler for generating a backgrond comparison
ts = _default_sampler_olga(organism = 'mouse', chain = 'beta')()

    # make a background-subtracted logo <svg> and raw log <svg_raw> for each TCR
svgs = list()
svgs_raw = list()
for i,r in quasi_public_df.iterrows():
    svg, svg_raw  = make_motif_logo_from_index(tcrsampler = ts,
                                               ind = r['neighbors'],
                                               centroid = r['cdr3_b_aa'],
                                               clone_df = tr.clone_df,
                                               cdr3_name = 'cdr3_b_aa',
                                               v_name = 'v_b_gene',
                                               gene_names = ['v_b_gene','j_b_gene'])
    svgs.append(svg)
    svgs_raw.append(svg_raw)

    # Output a html report
output_html_name = 'quasi_public_df_report.html'
quasi_public_df['svg'] = svgs
quasi_public_df['svg_raw'] = svgs_raw
    # Specific columns to include in the report
labels = [  'cdr3_b_aa', 
            'v_b_gene',
            'j_b_gene',
            'radius', 
            'K_neighbors',
            'nsubject']

def shrink(s):
    """reduce size of svg graphic"""
    s = s.replace('height="100%"', 'height="20%"')
    s = s.replace('width="100%"', 'width="20%"')
    return s

with open(output_html_name, 'w') as output_handle:
    for i,r in quasi_public_df.iterrows():
        #import pdb; pdb.set_trace()
        svg, svg_raw = r['svg'],r['svg_raw']
        output_handle.write("<br></br>")
        output_handle.write(shrink(svg))
        output_handle.write(shrink(svg_raw))
        output_handle.write("<br></br>")
        output_handle.write(pd.DataFrame(r[labels]).transpose().to_html())
        output_handle.write("<br></br>")

Probability of Generation

人的样本

生成概率是另一种过滤 TCR 和 TCR 邻域的方法,这些邻域基于它们在免疫发育的早期阶段通过重组产生的概率而比预期的更频繁。
"""
How to add pgen estimates to human alpha/beta CDR3s
"""
import pandas as pd
from tcrdist.pgen import OlgaModel
from tcrdist import mappers 
from tcrdist.repertoire import TCRrep
from tcrdist.setup_tests import download_and_extract_zip_file

df = pd.read_csv("dash_human.csv")

tr = TCRrep(cell_df = df.sample(5, random_state = 3), 
            organism = 'human', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', 
            store_all_cdr = False)

olga_beta  = OlgaModel(chain_folder = "human_T_beta", recomb_type="VDJ")
olga_alpha = OlgaModel(chain_folder = "human_T_alpha", recomb_type="VJ")

tr.clone_df['pgen_cdr3_b_aa'] = olga_beta.compute_aa_cdr3_pgens(
    CDR3_seq = tr.clone_df.cdr3_b_aa)

tr.clone_df['pgen_cdr3_a_aa'] = olga_alpha.compute_aa_cdr3_pgens(
    CDR3_seq = tr.clone_df.cdr3_a_aa)

tr.clone_df[['cdr3_b_aa', 'pgen_cdr3_b_aa', 'cdr3_a_aa','pgen_cdr3_a_aa']]
"""
            cdr3_b_aa  pgen_cdr3_b_aa          cdr3_a_aa  pgen_cdr3_a_aa
0    CASSETSGRSPYEQYF    1.922623e-09    CAVRPGYSSASKIIF    1.028741e-06
1  CASSPGLASPYSYNEQFF    1.554569e-11    CAVRVLMEYGNKLVF    1.128865e-08
2       CASSSRSTDTQYF    5.184760e-07     CAGAGGGSQGNLIF    2.512580e-06
3        CASSIGVYGYTF    8.576919e-08  CAFMSNAGGTSYGKLTF    1.832054e-07
4  CASSLLVSGVSSTDTQYF    2.143508e-11       CAVIGEGGKLTF    5.698448e-12
"""

鼠的样本

import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.pgen import OlgaModel
import numpy as np

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv',
            compute_distances = True)

# Load OLGA model as a python object
olga_beta  = OlgaModel(chain_folder = "mouse_T_beta", recomb_type="VDJ")
olga_alpha = OlgaModel(chain_folder = "mouse_T_alpha", recomb_type="VJ")

# An example computing a single Pgen
olga_beta.compute_aa_cdr3_pgen(tr.clone_df['cdr3_b_aa'][0])
olga_alpha.compute_aa_cdr3_pgen(tr.clone_df['cdr3_a_aa'][0])

# An example computing multiple Pgens 
olga_beta.compute_aa_cdr3_pgens(tr.clone_df['cdr3_b_aa'][0:5])
olga_alpha.compute_aa_cdr3_pgens(tr.clone_df['cdr3_a_aa'][0:5])

# An example computing 1920 Pgens more quickly with multiple cpus
import parmap
tr.clone_df['pgen_cdr3_b_aa'] = \
    parmap.map(
        olga_beta.compute_aa_cdr3_pgen, 
        tr.clone_df['cdr3_b_aa'], 
        pm_pbar=True, 
        pm_processes = 2)

tr.clone_df['pgen_cdr3_a_aa'] = \
    parmap.map(
        olga_alpha.compute_aa_cdr3_pgen, 
        tr.clone_df['cdr3_a_aa'], 
        pm_pbar=True, 
        pm_processes = 2)

"""
We can do something else useful. We've tweaked the original 
generative code in OLGA, so that you can generate CDRs,
given a specific TRV and TRJ. 

Note that unfortunately not all genes are recognized in default OLGA models, 
but many are. This gives you an idea of what you can do. Here are 10
CDR3s generated at random given a particular V,J usage combination
"""
np.random.seed(1)
olga_beta.gen_cdr3s(V = 'TRBV14*01', J = 'TRBJ2-5*01', n =10) 
olga_alpha.gen_cdr3s(V ='TRAV4-3*02', J ='TRAJ31*01',  n =10)

"""
Using this approach, we can synthesize an 100K background, 
with similar gene usage frequency to our actual repertoire. 
Note, however, that given data availability, 
this is currently likely the most reliable for human beta chain.

After OLGA's publication, a default mouse alpha model (mouse_T_alpha) 
was added to the OLGA GitHub repository. We've included that here
but it should be used with caution as it is missing
a number of commonly seen V genes.
"""
np.random.seed(1)
tr.synthesize_vj_matched_background(chain = 'beta')
"""
          v_b_gene    j_b_gene            cdr3_b_aa        pV        pJ       pVJ   weights      source
0        TRBV14*01  TRBJ2-3*01        CASSLASAETLYF  0.033721  0.092039  0.002989  0.065742  vj_matched
1      TRBV13-2*01  TRBJ2-3*01    CASGDAPDRTGAETLYF  0.118785  0.092039  0.010331  0.271309  vj_matched
2      TRBV13-3*01  TRBJ1-1*01  CASSDGFSRTGGVNTEVFF  0.074051  0.106146  0.006923  1.009124  vj_matched
3      TRBV13-3*01  TRBJ2-1*01       CASSDVQGGAEQFF  0.074051  0.117684  0.008915  1.021244  vj_matched
4      TRBV13-3*01  TRBJ2-7*01     CASSSGTGGYIYEQYF  0.074051  0.204898  0.015366  1.670224  vj_matched
...            ...         ...                  ...       ...       ...       ...       ...         ...
99995    TRBV14*01  TRBJ2-3*01  CASSPTGGAPYASAETLYF  0.033721  0.092039  0.002989  0.065742  vj_matched
99996    TRBV17*01  TRBJ2-5*01       CASSRDPTQDTQYF  0.028110  0.124712  0.004930  0.650360  vj_matched
99997    TRBV14*01  TRBJ2-3*01       CASSSTGGAETLYF  0.033721  0.092039  0.002989  0.065742  vj_matched
99998  TRBV13-1*01  TRBJ2-1*01      CASSDWGKDYAEQFF  0.106042  0.117684  0.013373  2.622194  vj_matched
99999     TRBV4*01  TRBJ2-3*01      CASSYDRGSAETLYF  0.040749  0.092039  0.002989  0.068343  vj_matched
"""
np.random.seed(1)
tr.synthesize_vj_matched_background(chain = 'alpha')
"""
           v_a_gene   j_a_gene        cdr3_a_aa        pV        pJ       pVJ   weights      source
0      TRAV12N-3*01  TRAJ34*02     CAIASNTNKVVF  0.000438  0.000088  0.000088  0.006059  vj_matched
1       TRAV3D-3*02  TRAJ33*01  CAVSAGADSNYQLIW  0.000088  0.000088  0.000088  0.005122  vj_matched
2        TRAV3-3*01  TRAJ27*01     CAVSTNTGKLTF  0.014029  0.042964  0.000877  0.277471  vj_matched
3        TRAV3-3*01  TRAJ26*01    CAVSHNYAQGLTF  0.014029  0.040947  0.001052  0.009155  vj_matched
4        TRAV3-3*01  TRAJ26*01   CAVSARNYAQGLTF  0.014029  0.040947  0.001052  0.009155  vj_matched
...             ...        ...              ...       ...       ...       ...       ...         ...
99995   TRAV3D-3*02  TRAJ21*01    CAVSVSNYNVLYF  0.000088  0.039982  0.000088  0.003758  vj_matched
99996    TRAV3-3*01  TRAJ43*01    CAVSENNNNAPRF  0.014029  0.022271  0.000526  0.071093  vj_matched
99997   TRAV3D-3*02  TRAJ26*01    CAVSGNYAQGLTF  0.000088  0.040947  0.000088  0.000296  vj_matched
99998    TRAV3-3*01  TRAJ26*01   CAVKGNNYAQGLTF  0.014029  0.040947  0.001052  0.009155  vj_matched
99999   TRAV9N-2*01  TRAJ15*01      CTYQGGRALIF  0.000088  0.043840  0.000088  0.020438  vj_matched
"""

""""
tcrdist3's integration of Pgen estimates makes it very easy to look for
PUBLIC clusters of TCRs (i.e. high number of neighbors) with unlikely V(D)J 
recombinations.
"""
from tcrdist.public import _neighbors_fixed_radius
from tcrdist.public import _K_neighbors_fixed_radius
tr.clone_df['neighbors'] = _neighbors_fixed_radius(pwmat = tr.pw_beta, radius = 18)
tr.clone_df['K_neighbors'] = _K_neighbors_fixed_radius(pwmat = tr.pw_beta , radius = 18)
tr.clone_df['pgen_cdr3_b_aa_nlog10'] = tr.clone_df['pgen_cdr3_b_aa'].apply(lambda x : -1*np.log10(x))
tr.clone_df['nsubject'] = tr.clone_df['neighbors'].apply(lambda x: len(tr.clone_df['subject'][x].unique()))
# nsubject > 1 implies quasi-publicity
tr.clone_df['qpublic'] = tr.clone_df['nsubject'].apply(lambda x: x >1)

# Note one can find neighbors based on paired-chain distances.
from tcrdist.public import _neighbors_fixed_radius
from tcrdist.public import _K_neighbors_fixed_radius
tr.clone_df['neighbors'] = _neighbors_fixed_radius(pwmat = tr.pw_beta + tr.pw_alpha, radius = 50)
tr.clone_df['K_neighbors'] = _K_neighbors_fixed_radius(pwmat = tr.pw_beta + tr.pw_alpha , radius = 50)
tr.clone_df['pgen_cdr3_b_aa_nlog10'] = tr.clone_df['pgen_cdr3_b_aa'].apply(lambda x : -1*np.log10(x))
tr.clone_df['nsubject'] = tr.clone_df['neighbors'].apply(lambda x: len(tr.clone_df['subject'][x].unique()))
# nsubject > 1 implies quasi-publicity)
tr.clone_df['qpublic'] = tr.clone_df['nsubject'].apply(lambda x: x >1 )
"""
#Code for plotting pgen vs. neighbors 
import plotnine
from plotnine import ggplot, geom_point, aes, ylab, xlab
(ggplot(tr.clone_df, 
    aes(x= 'pgen_cdr3_b_aa_nlog10', 
        y ='K_neighbors',
        color = 'nsubject')) + 
    geom_point(size = .1) + 
    plotnine.scales.xlim(0,20) + 
    plotnine.facets.facet_wrap('epitope', scales = "free_y")+ ylab('Neighbors') + xlab('-Log10 Pgen'))
"""
图片.png

Meta-Clonotypes(其实就是特异性克隆)

To ensure you have the necessary background reference files in your environment(背景就是我们正常的样本).

Meta-Clonotype Discovery

"""
2020-12-18
Seattle, WA
HLA CLASS I restricted meta-clonotype discovery.

This functions encapsulates a complete work flow for finding meta-clonotypes 
in antigen-enriched data.

prepare docker container:
apt-get install unzip
python3 -c "from tcrsampler.setup_db import install_all_next_gen; install_all_next_gen(dry_run = False)"
"""
import sys
import os
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
from tcrdist.paths import path_to_base
from tcrdist.repertoire import TCRrep
from tcrdist.automate import auto_pgen
from tcrsampler.sampler import TCRsampler
from tcrdist.background import get_stratified_gene_usage_frequency
from tcrdist.background import sample_britanova
from tcrdist.background import make_gene_usage_counter, get_gene_frequencies
from tcrdist.background import calculate_adjustment, make_gene_usage_counter
from tcrdist.background import make_vj_matched_background, make_flat_vj_background
from tcrdist.ecdf import distance_ecdf, _plot_manuscript_ecdfs
from tcrdist.centers import calc_radii
from tcrdist.public import _neighbors_variable_radius
from tcrdist.public import _neighbors_sparse_variable_radius
from tcrdist.regex import _index_to_regex_str

def find_metaclonotypes(
    project_path = "tutorial48",
    source_path = os.path.join(path_to_base,'tcrdist','data','covid19'),
    antigen_enriched_file = 'mira_epitope_48_610_YLQPRTFL_YLQPRTFLL_YYVGYLQPRTF.tcrdist3.csv',
    ncpus = 4, 
    seed = 3434):
    """
    This functions encapsulates a complete 
    workflow for finding meta-clonotypes in antigen-enriched data.
    """
    np.random.seed(seed)
    if not os.path.isdir(project_path):
        os.mkdir(project_path)
    ############################################################################
    # Step 1: Select and load a antigen-enriched (sub)repertoire.           ####
    ############################################################################
    print(f"INITIATING A TCRrep() with {antigen_enriched_file}")
    assert os.path.isfile(os.path.join(source_path, antigen_enriched_file))
        # Read file into a Pandas DataFrame <df>
    df = pd.read_csv(os.path.join(source_path, antigen_enriched_file))
        # Drop cells without any gene usage information
    df = df[( df['v_b_gene'].notna() ) & (df['j_b_gene'].notna()) ]
        # Initialize a TCRrep class, using ONLY columns that are complete and unique define a a clone.
        # Class provides a 'count' column if non is present
        # Counts of identical subject:VCDR3 'clones' will be aggregated into a TCRrep.clone_df.
    from tcrdist.repertoire import TCRrep
    tr = TCRrep(cell_df = df[['subject','cell_type','v_b_gene', 'j_b_gene', 'cdr3_b_aa']], 
                organism = "human", 
                chains = ['beta'], 
                compute_distances = True)
    tr.cpus = ncpus
    ############################################################################
    # Step 1.1: Estimate Probability of Generation                          ####
    ############################################################################
    ### It will be useful later to know the pgen of each
    from tcrdist.automate import auto_pgen
    print(f"COMPUTING PGEN WITH OLGA (Sethna et al 2018)")
    print("FOR ANTIGEN-ENRICHED CLONES TO BE USED FOR SUBSEQUENT ANALYSES")
    auto_pgen(tr)

    # Tip: Users of tcrdist3 should be aware that by default a <TCRrep.clone_df> 
    # DataFrame is created out of non-redundant cells in the cell_df, and 
    # pairwise distance matrices automatically computed.
    # Notice that attributes <tr.clone_df>  and  <tr.pw_beta> , <tr.pw_cdr3_b_aa>, 
    # are immediately accessible.
    # Attributes <tr.pw_pmhc_b_aa>, <tr.pw_cdr2_b_aa>, and <tr.pw_cdr1_b_aa>  
    # are also available if <TCRrep.store_all_cdr> is set to True.
    # For large datasets, i.e., >15,000 clones, this approach may consume too much 
    # memory so <TCRrep.compute_distances> is automatically set to False. 
                                    
    ############################################################################
    # Step 2: Synthesize an Inverse Probability Weighted VJ Matched Background #
    ############################################################################
    # Generating an appropriate set of unenriched reference TCRs is important; for
    # each set of antigen-associated TCRs, discovered by MIRA, we created a two part
    # background. One part consists of 100,000 synthetic TCRs whose V-gene and J-gene
    # frequencies match those in the antigen-enriched repertoire, using the software
    # OLGA (Sethna et al. 2019; Marcou et al. 2018). The other part consists of
    # 100,000 umbilical cord blood TCRs sampled uniformly from 8 subjects (Britanova
    # et al., 2017). This mix balances dense sampling of sequences near the
    # biochemical neighborhoods of interest with broad sampling of TCRs from an
    # antigen-naive repertoire. Importantly, we adjust for the biased sampling by
    # using the V- and J-gene frequencies observed in the cord-blood data (see
    # Methods for details about inverse probability weighting adjustment). Using this
    # approach we are able to estimate the abundance of TCRs similar to a centroid
    # TCR in an unenriched background repertoire of ~1,000,000 TCRs, using a
    # comparatively modest background dataset of 200,000 TCRs. While this estimate
    # may underestimate the true specificity, since some of the neighborhood TCRs in
    # the unenriched background repertoire may in fact recognize the antigen of
    # interest, it is useful for prioritizing neighborhoods and selecting a radius
    # for each neighborhood that balances sensitivity and specificity.
    # Initialize a TCRsampler -- human, beta, umbilical cord blood from 8 people.
    print(f"USING tcrsampler TO CONSTRUCT A CUSTOM V-J MATCHED BACKGROUND")
    from tcrsampler.sampler import TCRsampler
    ts = TCRsampler(default_background = 'britanova_human_beta_t_cb.tsv.sampler.tsv')
    # Stratify sample so that each subject contributes similarly to estimate of 
    # gene usage frequency
    from tcrdist.background import get_stratified_gene_usage_frequency
    ts = get_stratified_gene_usage_frequency(ts = ts, replace = True) 
    # Synthesize an inverse probability weighted V,J gene background that matches 
    # usage in your enriched repertoire 
    df_vj_background = tr.synthesize_vj_matched_background(ts = ts, chain = 'beta')
    # Get a randomly drawn stratified sampler of beta, cord blood from 
    # Britanova et al. 2016 
    # Dynamics of Individual T Cell Repertoires: From Cord Blood to Centenarians
    from tcrdist.background import  sample_britanova
    df_britanova_100K = sample_britanova(size = 100000)
    # Append frequency columns using, using sampler above
    df_britanova_100K = get_gene_frequencies(ts = ts, df = df_britanova_100K)
    df_britanova_100K['weights'] = 1
    df_britanova_100K['source'] = "stratified_random"
    # Combine the two parts of the background into a single DataFrame
    df_bkgd = pd.concat([df_vj_background.copy(), df_britanova_100K.copy()], axis = 0).\
        reset_index(drop = True)                                              
    # Assert that the backgrounds have the expected number of rows.
    assert df_bkgd.shape[0] == 200000
    # Save the background for future use
    background_outfile = os.path.join(project_path, f"{antigen_enriched_file}.olga100K_brit100K_bkgd.csv")
    print(f'WRITING {background_outfile}')
    df_bkgd.to_csv(background_outfile, index = False)
    # Load the background to a TCRrep without computing pairwise distances 
    # (i.e., compute_distances = False)
    tr_bkgd = TCRrep(
        cell_df = df_bkgd,
        organism = "human", 
        chains = ['beta'], 
        compute_distances = False)
    # Compute rectangular distances. Those are, distances between each clone in 
    # the antigen-enriched repertoire and each TCR in the background.
    # With a single 1 CPU and < 10GB RAM, 5E2x2E5 = 100 million pairwise distances, 
    # across CDR1, CDR2, CDR2.5, and CDR3 
    # 1min 34s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) 
    # %timeit -r 1 tr.compute_rect_distances(df = tr.clone_df, df2 = tr_bkdg.clone_df, store = False)
    ############################################################################
    # Step 4: Calculate Distances                                          #####
    ############################################################################
    print(f"COMPUTING RECTANGULARE DISTANCE")
    tr.compute_sparse_rect_distances(
        df = tr.clone_df, 
        df2 = tr_bkgd.clone_df,
        radius=50,
        chunk_size = 100)
    scipy.sparse.save_npz(os.path.join(project_path, f"{antigen_enriched_file}.rw_beta.npz"), tr.rw_beta)
        # Tip: For larger dataset you can use a sparse implementation: 
        # 30.8 s ± 0 ns per loop ; tr.cpus = 6
        # %timeit -r tr.compute_sparse_rect_distances(df = tr.clone_df, df2 = tr_bkdg.clone_df,radius=50, chunk_size=85)
    ############################################################################
    # Step 5: Examine Density ECDFS                                        #####
    ############################################################################
        # Investigate the density of neighbors to each TCR, based on expanding 
        # distance radius.
    from tcrdist.ecdf import distance_ecdf, _plot_manuscript_ecdfs
    import matplotlib.pyplot as plt
        # Compute empirical cumulative density function (ecdf)
        # Compare Antigen Enriched TCRs (against itself).
    thresholds, antigen_enriched_ecdf = distance_ecdf(
        tr.pw_beta,
        thresholds=range(0,50,2))
        # Compute empirical cumulative density function (ecdf)
        # Compare Antigen Enriched TCRs (against) 200K probability 
        # inverse weighted background
    thresholds, background_ecdf = distance_ecdf(
        tr.rw_beta,
        thresholds=range(0,50,2),
        weights= tr_bkgd.clone_df['weights'], 
        absolute_weight = True)
        # plot_ecdf similar to tcrdist3 manuscript #
    antigen_enriched_ecdf[antigen_enriched_ecdf == antigen_enriched_ecdf.min()] = 1E-10
    f1 = _plot_manuscript_ecdfs(
        thresholds, 
        antigen_enriched_ecdf, 
        ylab= 'Proportion of Antigen Enriched TCRs', 
        cdr3_len=tr.clone_df.cdr3_b_aa.str.len(), 
        min_freq=1E-10)
    f1.savefig(os.path.join(project_path, f'{antigen_enriched_file}.ecdf_AER_plot.png'))
    f2 = _plot_manuscript_ecdfs(
        thresholds,
        background_ecdf,
        ylab= 'Proportion of Reference TCRs',
        cdr3_len=tr.clone_df.cdr3_b_aa.str.len(),
        min_freq=1E-10)
    f2.savefig(os.path.join(project_path, f'{antigen_enriched_file}.ecdf_BUR_plot.png'))
    ############################################################################
    # Step 6: Find optimal radii  (theta = 1E5                             #####
    ############################################################################
    # To ascertain which meta-clonotypes are likely to be most specific, 
    # take advantage of an existing function <bkgd_cntrl_nn2>.                                                                                                                                  
    #  d888   .d8888b.  8888888888     888888888  
    # d8888  d88P  Y88b 888            888        
    #   888  888    888 888            888        
    #   888  888    888 8888888        8888888b.  
    #   888  888    888 888                 "Y88b 
    #   888  888    888 888      888888       888 
    #   888  Y88b  d88P 888            Y88b  d88P 
    # 8888888 "Y8888P"  8888888888      "Y8888P"                                         
   
    level_tag = '1E5'
    from tcrdist.neighbors import bkgd_cntl_nn2
    centers_df  = bkgd_cntl_nn2(
        tr               = tr,
        tr_background    = tr_bkgd,
        weights          = tr_bkgd.clone_df.weights,
        ctrl_bkgd        = 10**-5, 
        col              = 'cdr3_b_aa',
        add_cols         = ['v_b_gene', 'j_b_gene'],
        ncpus            = 4,
        include_seq_info = True,
        thresholds       = [x for x in range(0,50,2)],
        generate_regex   = True,
        test_regex       = True,
        forced_max_radius = 36)

    ############################################################################
    # Step 6.2: (theta = 1E5) ALL meta-clonotypes .tsv file                   ##
    ############################################################################
    # save center to project_path for future use
    centers_df.to_csv( os.path.join(project_path, f'{antigen_enriched_file}.centers_bkgd_ctlr_{level_tag}.tsv'), sep = "\t" )
    
    # Many of meta-clonotypes contain redundant information. 
    # We can winnow down to less-redundant list. We do this 
    # by ranking clonotypes from most to least specific. 
        # <min_nsubject> is minimum publicity of the meta-clonotype,  
        # <min_nr> is minimum non-redundancy
    # Add neighbors, K_neighbors, and nsubject columns
    from tcrdist.public import _neighbors_variable_radius, _neighbors_sparse_variable_radius
    centers_df['neighbors'] = _neighbors_variable_radius(pwmat=tr.pw_beta, radius_list = centers_df['radius'])
    centers_df['K_neighbors'] = centers_df['neighbors'].apply(lambda x : len(x))
    # We determine how many <nsubjects> are in the set of neighbors 
    centers_df['nsubject']  = centers_df['neighbors'].\
            apply(lambda x: tr.clone_df['subject'].iloc[x].nunique())
    centers_df.to_csv( os.path.join(project_path, f'{antigen_enriched_file}.centers_bkgd_ctlr_{level_tag}.tsv'), sep = "\t" )

    from tcrdist.centers import rank_centers
    ranked_centers_df = rank_centers(
        centers_df = centers_df, 
        rank_column = 'chi2joint', 
        min_nsubject = 2, 
        min_nr = 1)
    ############################################################################
    # Step 6.3:  (theta = 1E5) NR meta-clonotypes .tsv file                  ###
    ############################################################################
    # Output, ready to search bulk data.
    ranked_centers_df.to_csv( os.path.join(project_path, f'{antigen_enriched_file}.ranked_centers_bkgd_ctlr_{level_tag}.tsv'), sep = "\t" )
    ############################################################################
    # Step 6.4: (theta = 1E5) Output Meta-Clonotypes HTML Summary            ###
    ############################################################################
    # Here we can make a svg logo for each NR meta-clonotype
    if ranked_centers_df.shape[0] > 0:
        from progress.bar import IncrementalBar
        from tcrdist.public import make_motif_logo
        cdr3_name = 'cdr3_b_aa'
        v_gene_name = 'v_b_gene'
        svgs = list()
        svgs_raw = list()
        bar = IncrementalBar('Processing', max = ranked_centers_df.shape[0])
        for i,r in ranked_centers_df.iterrows():
            bar.next()
            centroid = r[cdr3_name]
            v_gene   = r[v_gene_name]
            svg, svg_raw = make_motif_logo( tcrsampler = ts, 
                                            pwmat = tr.pw_beta,
                                            clone_df = tr.clone_df,
                                            centroid = centroid ,
                                            v_gene = v_gene ,
                                            radius = r['radius'],
                                            pwmat_str = 'pw_beta',
                                            cdr3_name = 'cdr3_b_aa',
                                            v_name = 'v_b_gene',
                                            gene_names = ['v_b_gene','j_b_gene'])
            svgs.append(svg)
            svgs_raw.append(svg_raw)
        bar.next();bar.finish()
        ranked_centers_df['svg']      = svgs
        ranked_centers_df['svg_raw'] = svgs_raw

        def shrink(s):
            return s.replace('height="100%"', 'height="20%"').replace('width="100%"', 'width="20%"')
        labels =['cdr3_b_aa','v_b_gene', 'j_b_gene', 'pgen',
                'radius', 'regex','nsubject','K_neighbors', 
                'bkgd_hits_weighted','chi2dist','chi2re','chi2joint']
        
        output_html_name = os.path.join(project_path, f'{antigen_enriched_file}.ranked_centers_bkgd_ctlr_{level_tag}.html')
        # 888    888 88888888888 888b     d888 888      
        # 888    888     888     8888b   d8888 888      
        # 888    888     888     88888b.d88888 888      
        # 8888888888     888     888Y88888P888 888      
        # 888    888     888     888 Y888P 888 888      
        # 888    888     888     888  Y8P  888 888      
        # 888    888     888     888   "   888 888      
        # 888    888     888     888       888 88888888
        with open(output_html_name, 'w') as output_handle:
            for i,r in ranked_centers_df.iterrows():
                #import pdb; pdb.set_trace()
                svg, svg_raw = r['svg'],r['svg_raw']
                output_handle.write("<br></br>")
                output_handle.write(shrink(svg))
                output_handle.write(shrink(svg_raw))
                output_handle.write("<br></br>")
                output_handle.write(pd.DataFrame(r[labels]).transpose().to_html())
                output_handle.write("<br></br>")
    # To ascertain which meta-clonotypes are likely to be most specific, 
    # take advantage of an existing function <bkgd_cntrl_nn2>.       
    #  d888   .d8888b.  8888888888       .d8888b.  
    # d8888  d88P  Y88b 888             d88P  Y88b 
    #   888  888    888 888             888        
    #   888  888    888 8888888         888d888b.  
    #   888  888    888 888             888P "Y88b 
    #   888  888    888 888      888888 888    888 
    #   888  Y88b  d88P 888             Y88b  d88P 
    # 8888888 "Y8888P"  8888888888       "Y8888P" 
    ############################################################################
    # Step 6.5: Find optimal radii  (theta = 1E6)                            ###
    ############################################################################
    level_tag = '1E6'
    from tcrdist.neighbors import bkgd_cntl_nn2
    centers_df  = bkgd_cntl_nn2(
        tr               = tr,
        tr_background    = tr_bkgd,
        weights          = tr_bkgd.clone_df.weights,
        ctrl_bkgd        = 10**-6, 
        col              = 'cdr3_b_aa',
        add_cols         = ['v_b_gene', 'j_b_gene'],
        ncpus            = 4,
        include_seq_info = True,
        thresholds       = [x for x in range(0,50,2)],
        generate_regex   = True,
        test_regex       = True,
        forced_max_radius = 36)
    ############################################################################
    # Step 6.6: (theta = 1E6) ALL meta-clonotypes .tsv file                   ##
    ############################################################################
    # save center to project_path for future use
    centers_df.to_csv( os.path.join(project_path, f'{antigen_enriched_file}.centers_bkgd_ctlr_{level_tag}.tsv'), sep = "\t" )
    
    # Many of meta-clonotypes contain redundant information. 
    # We can winnow down to less-redundant list. We do this 
    # by ranking clonotypes from most to least specific. 
        # <min_nsubject> is minimum publicity of the meta-clonotype,  
        # <min_nr> is minimum non-redundancy
    # Add neighbors, K_neighbors, and nsubject columns
    from tcrdist.public import _neighbors_variable_radius, _neighbors_sparse_variable_radius
    centers_df['neighbors'] = _neighbors_variable_radius(pwmat=tr.pw_beta, radius_list = centers_df['radius'])
    centers_df['K_neighbors'] = centers_df['neighbors'].apply(lambda x : len(x))
    # We determine how many <nsubjects> are in the set of neighbors 
    centers_df['nsubject']  = centers_df['neighbors'].\
            apply(lambda x: tr.clone_df['subject'].iloc[x].nunique())
    centers_df.to_csv( os.path.join(project_path, f'{antigen_enriched_file}.centers_bkgd_ctlr_{level_tag}.tsv'), sep = "\t" )

    from tcrdist.centers import rank_centers
    ranked_centers_df = rank_centers(
        centers_df = centers_df, 
        rank_column = 'chi2joint', 
        min_nsubject = 2, 
        min_nr = 1)
    ############################################################################
    # Step 6.7:  (theta = 1E6) NR meta-clonotypes .tsv file                  ###
    ############################################################################
    # Output, ready to search bulk data.
    ranked_centers_df.to_csv( os.path.join(project_path, f'{antigen_enriched_file}.ranked_centers_bkgd_ctlr_{level_tag}.tsv'), sep = "\t" )

    ############################################################################
    # Step 6.8: (theta = 1E6) Output Meta-Clonotypes HTML Summary            ###
    ############################################################################
    # Here we can make a svg logo for each meta-clonotype
    from progress.bar import IncrementalBar
    from tcrdist.public import make_motif_logo
    if ranked_centers_df.shape[0] > 0:
        cdr3_name = 'cdr3_b_aa'
        v_gene_name = 'v_b_gene'
        svgs = list()
        svgs_raw = list()
        bar = IncrementalBar('Processing', max = ranked_centers_df.shape[0])
        for i,r in ranked_centers_df.iterrows():
            bar.next()
            centroid = r[cdr3_name]
            v_gene   = r[v_gene_name]
            svg, svg_raw = make_motif_logo( tcrsampler = ts, 
                                            pwmat = tr.pw_beta,
                                            clone_df = tr.clone_df,
                                            centroid = centroid ,
                                            v_gene = v_gene ,
                                            radius = r['radius'],
                                            pwmat_str = 'pw_beta',
                                            cdr3_name = 'cdr3_b_aa',
                                            v_name = 'v_b_gene',
                                            gene_names = ['v_b_gene','j_b_gene'])
            svgs.append(svg)
            svgs_raw.append(svg_raw)
        bar.next();bar.finish()
        ranked_centers_df['svg']      = svgs
        ranked_centers_df['svg_raw'] = svgs_raw

        def shrink(s):
            return s.replace('height="100%"', 'height="20%"').replace('width="100%"', 'width="20%"')
        labels =['cdr3_b_aa', 'v_b_gene', 'j_b_gene', 'pgen', 'radius', 'regex','nsubject','K_neighbors', 'bkgd_hits_weighted','chi2dist','chi2re','chi2joint']
        
        output_html_name = os.path.join(project_path, f'{antigen_enriched_file}.ranked_centers_bkgd_ctlr_{level_tag}.html')
        # 888    888 88888888888 888b     d888 888      
        # 888    888     888     8888b   d8888 888      
        # 888    888     888     88888b.d88888 888      
        # 8888888888     888     888Y88888P888 888      
        # 888    888     888     888 Y888P 888 888      
        # 888    888     888     888  Y8P  888 888      
        # 888    888     888     888   "   888 888      
        # 888    888     888     888       888 88888888     
        with open(output_html_name, 'w') as output_handle:
            for i,r in ranked_centers_df.iterrows():
                #import pdb; pdb.set_trace()
                svg, svg_raw = r['svg'],r['svg_raw']
                output_handle.write("<br></br>")
                output_handle.write(shrink(svg))
                output_handle.write(shrink(svg_raw))
                output_handle.write("<br></br>")
                output_handle.write(pd.DataFrame(r[labels]).transpose().to_html())
                output_handle.write("<br></br>")




if __name__ == "__main__":
    # 888      888                               888             
    # 888      888                               888             
    # 888      888                               888             
    # 88888b.  888  8888b.     .d8888b   .d88b.  888888 .d8888b  
    # 888 "88b 888     "88b    88K      d8P  Y8b 888    88K      
    # 888  888 888 .d888888    "Y8888b. 88888888 888    "Y8888b. 
    # 888  888 888 888  888         X88 Y8b.     Y88b.       X88 
    # 888  888 888 "Y888888     88888P'  "Y8888   "Y888  88888P' 
    # December 2020 
    # Seattle, WA
    
    # If run as a script, this will find meta-clonotypes for the 18
    # SARS-CoV-2 MIRA sets with the most prior evidence of restriction
    # by as specific HLA-alleles.

    from collections import defaultdict 
    import os 
    import pandas as pd
    import re
    # <path> where files reside
    path = os.path.join(path_to_base,'tcrdist','data','covid19')
    # <all_files> list of all files
    all_files = [f for f in os.listdir(path) if f.endswith('.tcrdist3.csv')]
    # <restriction> list of tuples from Supporting Table S5
    restriction = \
    [('m_55_524_ALR','A*01'),
    ('m_1_8260_HTT','A*01'),
    ('m_45_689_SYF','A*01'),
    ('m_10_2274_LSP','B*07'),
    ('m_155_59_RAR','B*07'),
    ('m_133_102_NQK','B*15'),
    ('m_48_610_YLQ','A*02'),
    ('m_111_146_AEI','A*11'),
    ('m_53_532_NYL','A*24'),
    ('m_90_216_GYQ','C*07'),
    ('m_140_92_NSS','A*01'),
    ('m_55_524_ALR','B*08'),
    ('m_183_39_RIR','A*03'),
    ('m_10_2274_LSP','C*07'),
    ('m_99_191_QEC','B*40'),
    ('m_155_59_RAR','C*07'),
    ('m_185_39_ASQ','B*15'),
    ('m_147_73_DLF','B*08'),
    ('m_110_148_ELI','B*44'),
    ('m_51_546_AYK','A*03'),
    ('m_44_697_FPP','B*35'),
    ('m_118_136_ALN','A*11'),
    ('m_176_44_SST','A*11'),
    ('m_30_1064_KAY','B*57'),
    ('m_192_31_FQP','B*15'),
    ('m_70_345_DTD','A*01')]
    # <restrictions_dict> convert list to dictionary
    restrictions_dict = defaultdict(list)
    for k,v in restriction:
        restrictions_dict[k].append(v)
    # Loop through all files to construct a Dataframe of only those with strongest evidence of HLA-restriction
    cache = list()
    for f in all_files:
        rgs = re.search(pattern ='(mira_epitope)_([0-9]{1,3})_([0-9]{1,6})_([A-Z]{1,3})', 
            string = f).groups()    
        rgs4 = re.search(pattern ='(mira_epitope)_([0-9]{1,3})_([0-9]{1,6})_([A-Z]{1,4})', 
            string = f).groups()
        key3 = f'm_{rgs[1]}_{rgs[2]}_{rgs[3]}'
        key4 = f'm_{rgs4[1]}_{rgs4[2]}_{rgs4[3]}'
        setkey = f"M_{rgs[1]}"
        include = key3 in restrictions_dict.keys()
        alleles = restrictions_dict.get(key3)
        cache.append((setkey, key3, key4, f, int(rgs[1]), int(rgs[2]), include, alleles))
    # <hla_df>
    hla_df = pd.DataFrame(cache, columns = ['set', 'key3','key4','filename','set_rank','clones','hla_restricted','alleles']).\
        sort_values(['hla_restricted','clones'], ascending = True).\
        query('hla_restricted == True').reset_index(drop = True)
    
    from tcrdist.paths import path_to_base
    for ind,row in hla_df.iterrows():
        file = row['filename']
        print(file, ind)
        print(row)
        find_metaclonotypes(project_path = "hla_restricted_meta_clonotypes",
                            source_path = os.path.join(path_to_base,'tcrdist','data','covid19'),
                            antigen_enriched_file = file,
                            ncpus = 6, 
                            seed = 3434)


# 8888888888 888b    888 8888888b.  
# 888        8888b   888 888  "Y88b 
# 888        88888b  888 888    888 
# 8888888    888Y88b 888 888    888 
# 888        888 Y88b888 888    888 
# 888        888  Y88888 888    888 
# 888        888   Y8888 888  .d88P 
# 8888888888 888    Y888 8888888P"  
For each run, the function will write the following outputs:

project_folder/
├── .centers_bkgd_ctlr_1E5.tsv
├── .centers_bkgd_ctlr_1E6.tsv
├── .ecdf_AER_plot.png
├── .ecdf_BUR_plot.png
├── .olga100K_brit100K_bkgd.csv
├── .ranked_centers_bkgd_ctlr_1E5.html
├── .ranked_centers_bkgd_ctlr_1E5.tsv
├── .ranked_centers_bkgd_ctlr_1E6.html
├── .ranked_centers_bkgd_ctlr_1E6.tsv
└── .rw_beta.npz

图片.png
图片.png

Meta-Clonotype Tabulation

The .ranked_centers_bkgd_ctlr_1E6.tsv files produced above can be used to directly search for the presence of meta-clonotypes is bulk data. How do we do this? Before showing how to do this for many files consider doing it for a single bulk repertoire:
In this example:
  • We download a raw ImmunoSEQ file.
  • Format it for use in tcrdist3.
  • Search it with one of the meta-clonotypes file we made from above.
import os
import pandas as pd 
from tcrdist.adpt_funcs import import_adaptive_file
from tcrdist.repertoire import TCRrep
from tcrdist.tabulate import tabulate
import math
import time
"""
You can download the file with wget or follow the link
!wget https://www.dropbox.com/s/1zbcf1ep8kgmlzy/KHBR20-00107_TCRB.tsv
"""
bulk_file = 'KHBR20-00107_TCRB.tsv'
df_bulk = import_adaptive_file(bulk_file)
df_bulk = df_bulk[['cdr3_b_aa',
                    'v_b_gene',
                    'j_b_gene',
                    'templates',
                    'productive_frequency',
                    'valid_cdr3']].\
        rename(columns = {'templates':'count'})

df_bulk = df_bulk[(df_bulk['v_b_gene'].notna()) & (df_bulk['j_b_gene'].notna())].reset_index()
tr_bulk = TCRrep(cell_df = df_bulk,
                 organism = 'human',
                 chains = ['beta'],
                 db_file = 'alphabeta_gammadelta_db.tsv',
                 compute_distances = False)

search_file = os.path.join(
    'tcrdist',
    'data',
    'covid19',
    'mira_epitope_48_610_YLQPRTFL_YLQPRTFLL_YYVGYLQPRTF.tcrdist3.csv.ranked_centers_bkgd_ctlr_1E6.tsv')

df_search = pd.read_csv(search_file, sep = '\t')
df_search = df_search[['cdr3_b_aa','v_b_gene','j_b_gene','pgen','radius','regex']]
tr_search = TCRrep(cell_df = df_search,
                   organism = 'human',
                   chains = ['beta'],
                   db_file = 'alphabeta_gammadelta_db.tsv',
                   compute_distances = False)
tr_search.cpus = 1
tic = time.perf_counter()
tr_search.compute_sparse_rect_distances(df = tr_search.clone_df, df2 = tr_bulk.clone_df, chunk_size = 50, radius = 50) 
results = tabulate(clone_df1 = tr_search.clone_df, clone_df2 = tr_bulk.clone_df, pwmat = tr_search.rw_beta)
toc = time.perf_counter()
print(f"TABULATED IN {toc - tic:0.4f} seconds")
结果文件包括

cdr3_b_aa - meta-clonotype centroid CDR3
v_b_gene - meta-clonotype centroid TRBV
j_b_gene - meta-clonotype centroid TRBJ
pgen. - meta-clonotype centroid TRBV, CDR3 Pgen (Probability of Generation, estimated with OGLA)
radius - meta-clonotype maximum TCRdist to find a neighbor
regex - pattern of conserved positions learned from the all those sequences with <radius> of the centroid in the antigen enriched dataset.
cdr1_b_aa - meta-clonotype centroid CDR1
cdr2_b_aa - meta-clonotype centroid CDR2
pmhc_b_aa - meta-clonotype centroid CDR2.5
bulk_sum_freq - In the bulk sample, sum of frequencies of TCRs within the <radius> of the centroid (* RADIUS)
bulk_sum_counts - In the bulk sample, total template TCRs (counts) within the <radius> of the centroid (* RADIUS)
bulk_seqs - In the bulk sample, CDR3 seqs of TCRs within the <radius> of the centroid
bulk_v_genes - In the bulk sample, TRBVs of TCRs within the <radius> of the centroid
bulk_j_genes - In the bulk sample, TRBJs of TCRs within the <radius> of the centroid
bulk_distances - In the bulk sample, distances of TCRs from centroid, for those within the <radius> of the centroid
bulk_counts - In the bulk sample, individual counts of each TCR within radius of the centroid
bulk_freqs - In the bulk sample, individual frequencies of each TCR within radius of the centroid
bulk_regex_match - In the bulk sample, boolean for each CDR3 within the <radius> whether it matched the regex motif pattern
bulk_sum_freqs_regex_adj - In the bulk sample, sum of frequencies of TCRs within the <radius> of the centroid and matching regex (RADIUS + MOTIF)
bulk_sum_counts_regex_adj - In the bulk sample, sum of counts of TCRs within the <radius> of the centroid and matching regex (RADIUS + MOTIF)
bulk_sum_freqs_tcrdist0 - In the bulk sample, sum of frequencies of TCRs within the <radius> = 0 of the centroid (EXACT)
bulk_sum_counts_tcrdist0 - In the bulk sample, sum of counts of TCRs within the <radius> = 0 of the centroid (EXACT)

Tabulation Against Many

下面是一个完整的示例,用于将 694 个批量样本中每个元克隆型的频率和计数制成表格。 注意必须提供指向环境中所有批量文件所在目录的有效路径。
# December 2020 
# Seattle, WA
import pandas as pd 
from tcrdist.repertoire import TCRrep
from tcrdist.tabulate import tabulate
import os
import numpy as np
from tcrdist.paths import path_to_base
import time
import math
import parmap

def get_safe_chunk(search_clones, bulk_clones,target = 10**7):
    """
    This function help pick a chunk size that prevents excessive memory use,
    With two CPU, 10*7 should keep total overall memory demand below 1GB
    """
    ideal_divisor = (search_clones * bulk_clones) / target
    if ideal_divisor < 1:
        ideal_chunk_size = search_clones
        print(ideal_chunk_size)
    else:
        ideal_chunk_size = math.ceil((search_clones)/ ideal_divisor)
        print(ideal_chunk_size)
    return ideal_chunk_size

def do_search2(file, df_search, dest, tag):
    sample_name = file.replace('.tsv.tcrdist3.v_max.tsv','')
    tic = time.perf_counter()
    
    # <tr_search> tcrdist.repertoire.TCRrep object for computing distances
    tr_search = TCRrep(cell_df = df_search,
                    organism = 'human',
                    chains = ['beta'],
                    db_file = 'alphabeta_gammadelta_db.tsv',
                    compute_distances = False)
    # set cpus according to parameter above
    tr_search.cpus = 1
    df_bulk   = pd.read_csv(os.path.join(path, file), sep = '\t')
    df_bulk = df_bulk[['cdr3_b_aa','v_b_gene','j_b_gene','templates','productive_frequency']].rename(columns = {'templates':'count'})

    tr_bulk = TCRrep(   cell_df = df_bulk,                 
                        organism = 'human', 
                        chains = ['beta'], 
                        db_file = 'alphabeta_gammadelta_db.tsv',
                        compute_distances = False)
    
    #lines_per_file.append(tr_bulk.clone_df.shape[0]) 
    
    search_clones = tr_search.clone_df.shape[0]
    bulk_clones   = tr_bulk.clone_df.shape[0]
    # To avoid memory pressure on the system we set a target that tcrdist doesn't do more than 10M comparisons per process
    ideal_chunk_size = get_safe_chunk(tr_search.clone_df.shape[0], tr_bulk.clone_df.shape[0],target = 10**8)
    tr_search.compute_sparse_rect_distances(df = tr_search.clone_df, df2 = tr_bulk.clone_df, chunk_size = ideal_chunk_size) #(5)
    r1 = tabulate(clone_df1 = tr_search.clone_df, clone_df2 = tr_bulk.clone_df, pwmat = tr_search.rw_beta)
    outfile = os.path.join(dest, f"{sample_name}.{tag}.bulk_tabulation.tsv")
    print(f"WRITING: {outfile}")
    r1.to_csv(outfile, sep = '\t')
    toc = time.perf_counter()
    print(f"TABULATED IN {toc - tic:0.4f} seconds")
    del(tr_search)
    del(tr_bulk)
    return(f"{toc - tic:0.4f}s")


if __name__ == "__main__":
    # 888      888                               888             
    # 888      888                               888             
    # 888      888                               888             
    # 88888b.  888  8888b.     .d8888b   .d88b.  888888 .d8888b  
    # 888 "88b 888     "88b    88K      d8P  Y8b 888    88K      
    # 888  888 888 .d888888    "Y8888b. 88888888 888    "Y8888b. 
    # 888  888 888 888  888         X88 Y8b.     Y88b.       X88 
    # 888  888 888 "Y888888     88888P'  "Y8888   "Y888  88888P' 
    # December 18, 2020 
    # Seattle, WA
    from collections import defaultdict 
    import os 
    import pandas as pd
    import re
    # <path> path where bulk files can be found 
    path_bulkfiles = INSERT_HERE # <--------------------------

    # <path> where files reside
    path = os.path.join(path_to_base,'tcrdist','data','covid19')#os.path.join('data-raw','2020-08-31-mira_tcr_by_epitope/')
    # <all_files> list of all files
    all_files = [f for f in os.listdir(path) if f.endswith('.tcrdist3.csv')]
    # <restriction> list of tuples from Supporting Table S5: https://docs.google.com/spreadsheets/d/1WAmze6lir-v11odO-nYYbCiYVh8WhQh_vy2d1UPPKb0/edit#gid=942295061
    restriction = \
    [('m_55_524_ALR','A*01'),
    ('m_1_8260_HTT','A*01'),
    ('m_45_689_SYF','A*01'),
    ('m_10_2274_LSP','B*07'),
    ('m_155_59_RAR','B*07'),
    ('m_133_102_NQK','B*15'),
    ('m_48_610_YLQ','A*02'),
    ('m_111_146_AEI','A*11'),
    ('m_53_532_NYL','A*24'),
    ('m_90_216_GYQ','C*07'),
    ('m_140_92_NSS','A*01'),
    ('m_55_524_ALR','B*08'),
    ('m_183_39_RIR','A*03'),
    ('m_10_2274_LSP','C*07'),
    ('m_99_191_QEC','B*40'),
    ('m_155_59_RAR','C*07'),
    ('m_185_39_ASQ','B*15'),
    ('m_147_73_DLF','B*08'),
    ('m_110_148_ELI','B*44'),
    ('m_51_546_AYK','A*03'),
    ('m_44_697_FPP','B*35'),
    ('m_118_136_ALN','A*11'),
    ('m_176_44_SST','A*11'),
    ('m_30_1064_KAY','B*57'),
    ('m_192_31_FQP','B*15'),
    ('m_70_345_DTD','A*01')]
    # <restrictions_dict> convert list to dictionary
    restrictions_dict = defaultdict(list)
    for k,v in restriction:
        restrictions_dict[k].append(v)
    # Loop through all files to construct a Dataframe of only those with strongest evidence of HLA-restriction
    cache = list()
    for f in all_files:
        rgs = re.search(pattern ='(mira_epitope)_([0-9]{1,3})_([0-9]{1,6})_([A-Z]{1,3})', 
            string = f).groups()    
        rgs4 = re.search(pattern ='(mira_epitope)_([0-9]{1,3})_([0-9]{1,6})_([A-Z]{1,4})', 
            string = f).groups()
        key3 = f'm_{rgs[1]}_{rgs[2]}_{rgs[3]}'
        key4 = f'm_{rgs4[1]}_{rgs4[2]}_{rgs4[3]}'
        setkey = f"M_{rgs[1]}"
        include = key3 in restrictions_dict.keys()
        alleles = restrictions_dict.get(key3)
        cache.append((setkey, key3, key4, f, int(rgs[1]), int(rgs[2]), include, alleles))

    # <hla_df>
    hla_df = pd.DataFrame(cache, columns = ['set', 'key3','key4','filename','set_rank','clones','hla_restricted','alleles']).\
        sort_values(['hla_restricted','clones'], ascending = True).\
        query('hla_restricted == True').reset_index(drop = True)

    for ind, row in hla_df.iterrows():
        print(ind)
        print(row)
        tag_level = '1E6'
        # <tag> the results with this symbol
        tag = f"{row['set']}_{tag_level}"
        ranked_centers_fn = f"{row['filename']}.ranked_centers_bkgd_ctlr_{tag_level}.tsv"
        benchmark_fn      = f"{row['filename']}.ranked_centers_bkgd_ctlr_{tag_level}.tsv.benchmark_tabulation.tsv"
        # <dest> place where .tsv files shall be saved
        dest = f'/fh/fast/gilbert_p/fg_data/tcrdist/t3/{tag}'
        if not os.path.isdir(dest):
            os.mkdir(dest)

        # <path> path where bulk files can be found 
        path = path_bulkfiles
        # <covid_smaples> list of all samples to consider>
        covid_samples = ["BS-EQ-09-T1-replacement_TCRB", "BS-EQ-0022-T0-replacement_TCRB", "860011133_TCRB", "BS-GIGI_86-replacement_TCRB", "BS-EQ-12-T1-replacement_TCRB", "BS-GIGI_44-replacement_TCRB", "KH20-09697_TCRB", "BS-GN-0007-T0-replacement_TCRB", "860011216_TCRB", "BS-EQ-0002-T1-replacement_TCRB", "KH20-09988_TCRB", "BS-GIGI_21-replacement_TCRB", "INCOV018-AC-5_TCRB", "860011205_TCRB", "BS-GN-06-T0-replacement_TCRB", "KHBR20-00107_TCRB", "550043156_TCRB", "550041451_TCRB", "KH20-09700_TCRB", "BS-EQ-25-T0-replacement_TCRB", "860011320_TCRB", "BS-GN-0008-T0-replacement_TCRB", "BS-EQ-0002-T2-replacement_TCRB", "860011277_TCRB", "KH20-11306_TCRB", "BS-EQ-10-T1-replacement_TCRB", "BS-EQ-18-T1-replacement_TCRB", "KH20-11638_TCRB", "860011493_TCRB", "BS-GN-0010-T0-replacement_TCRB", "860011220_TCRB", "KH20-09681_TCRB", "550041430_TCRB", "INCOV033-AC-3_TCRB", "KHBR20-00093_TCRB", "860011221_TCRB", "BS-GIGI_28-replacement_TCRB", "BS-GIGI_46-replacement_TCRB", "KH20-09665_TCRB", "BS-EQ-0018-T0-replacement_TCRB", "860011287_TCRB", "BS-GIGI_42-replacement_TCRB", "INCOV004-BL-5_TCRB", "KH20-11809_TCRB", "BS-EQ-0001-T1-replacement_TCRB", "KH20-09671_TCRB", "KHBR20-00111_TCRB", "BS-EQ-0027-T1-replacement_TCRB", "BS-EQ-0024-T0-replacement_TCRB", "BS-GN-13-T0-replacement_TCRB", "BS-EQ-11-T2-replacement_TCRB", "BS-GIGI_32-replacement_TCRB", "BS-EQ-31-T0-replacement_TCRB", "BS-EQ-0027-T2-replacement_TCRB", "BS-GIGI_06-replacement_TCRB", "KH20-09679_TCRB", "KH20-09676_TCRB", "860011462_TCRB", "BS-GIGI_56-replacement_TCRB", "KH20-09722_TCRB", "KH20-09751_TCRB", "550040014_TCRB", "BS-GIGI_96-replacement_TCRB", "BS-GN-0012-T0-replacement_TCRB", "KHBR20-00085_TCRB", "KH20-09667_TCRB", "INCOV086-AC-3_TCRB", "KH20-11307_TCRB", "BS-GIGI_49-replacement_TCRB", "INCOV020-AC-5_TCRB", "BS-GIGI_85-replacement_TCRB", "KH20-11650_TCRB", "BS-GIGI_15-replacement_TCRB", "BS-EQ-43-T0-replacement_TCRB", "BS-GIGI_50-replacement_TCRB", "INCOV013-AC-5_TCRB", "BS-EQ-0011-T1-replacement_TCRB", "BS-GIGI_66-replacement_TCRB", "860011140_TCRB", "BS-EQ-12-T2-replacement_TCRB", "KHBR20-00089_TCRB", "860011235_TCRB", "860011123_TCRB", "KHBR20-00121_TCRB", "KHBR20-00180_TCRB", "BS-GIGI_18-replacement_TCRB", "BS-GIGI_64-replacement_TCRB", "BS-EQ-0024-T1-replacement_TCRB", "550042607_TCRB", "KH20-11683_TCRB", "KHBR20-00179_TCRB", "BS-GN-0002-T0-replacement_TCRB", "550043906_TCRB", "BS-EQ-0021-T0-replacement_TCRB", "BS-EQ-07-T1-replacement_TCRB", "860011306_TCRB", "550043943_TCRB", "BS-GIGI_20-replacement_TCRB", "BS-GIGI_52-replacement_TCRB", "KH20-09970_TCRB", "860011129_TCRB", "KHBR20-00098_TCRB", "860011124_TCRB", "860011110_TCRB", "BS-GIGI_01-replacement_TCRB", "KH20-09670_TCRB", "BS-GIGI_72-replacement_TCRB", "BS-EQ-11-T3-replacement_TCRB", "860011120_TCRB", "INCOV008-BL-5_TCRB", "BS-GN-0004-T0-replacement_TCRB", "INCOV028-AC-3_TCRB", "KH20-11295_TCRB", "860011214_TCRB", "BS-GIGI_13-replacement_TCRB", "860011204_TCRB", "BS-GIGI_87-replacement_TCRB", "860011215_TCRB", "860011202_TCRB", "INCOV005-BL-5_TCRB", "BS-GIGI_10-replacement_TCRB", "860011489_TCRB", "550043905_TCRB", "BS-GN-01-T0-replacement_TCRB", "550042578_TCRB", "KHBR20-00118_TCRB", "KH20-09655_TCRB", "BS-EQ-0016-T1-replacement_TCRB", "KHBR20-00106_TCRB", "BS-GIGI_62-replacement_TCRB", "BS-EQ-37-T0-replacement_TCRB", "860011244_TCRB", "KHBR20-00157_TCRB", "BS-GIGI_80-replacement_TCRB", "KHBR20-00182_TCRB", "KHBR20-00206_TCRB", "860011318_TCRB", "KH20-11645_TCRB", "KH20-09673_TCRB", "550040140_TCRB", "BS-GIGI_36-replacement_TCRB", "KHBR20-00171_TCRB", "BS-GIGI_81-replacement_TCRB", "860011225_TCRB", "550043911_TCRB", "BS-GIGI_03-replacement_TCRB", "KH20-09698_TCRB", "BS-EQ-29-T0-replacement_TCRB", "860011239_TCRB", "KH20-11303_TCRB", "550041293_TCRB", "INCOV028-BL-3_TCRB", "INCOV004-AC-5_TCRB", "110047437_TCRB", "860011201_TCRB", "BS-EQ-23-T1-replacement_TCRB", "KHBR20-00119_TCRB", "KHBR20-00149_TCRB", "BS-EQ-0017-T0-replacement_TCRB", "KH20-11810_TCRB", "BS-GIGI_45-replacement_TCRB", "550042567_TCRB", "860011109_TCRB", "KH20-09752_TCRB", "KH20-09675_TCRB", "KH20-09987_TCRB", "BS-GIGI_41-replacement_TCRB", "BS-EQ-0008-T1-replacement_TCRB", "550042420_TCRB", "KH20-09747_TCRB", "KHBR20-00144_TCRB", "KH20-11675_TCRB", "INCOV031-BL-3_TCRB", "KH20-11697_TCRB", "KHBR20-00141_TCRB", "KH20-09948_TCRB", "INCOV036-AC-3_TCRB", "550041421_TCRB", "BS-EQ-0026-T0-replacement_TCRB", "BS-GIGI_08-replacement_TCRB", "BS-EQ-29-T1-replacement_TCRB", "KHBR20-00160_TCRB", "860011322_TCRB", "KHBR20-00165_TCRB", "860011327_TCRB", "KH20-09674_TCRB", "KH20-09661_TCRB", "BS-GIGI_16-replacement_TCRB", "BS-EQ-43-T1_BS-GIGI-137-replacement_TCRB", "860011117_TCRB", "INCOV033-BL-3_TCRB", "KHBR20-00148_TCRB", "BS-GIGI_79-replacement_TCRB", "550042515_TCRB", "550042526_TCRB", "INCOV003-BL-5_TCRB", "BS-EQ-35-T0-replacement_TCRB", "550042550_TCRB", "BS-GN-0011-T0-replacement_TCRB", "BS-EQ-07-T2-replacement_TCRB", "KHBR20-00103_TCRB", "INCOV030-AC-3_TCRB", "KH20-09997_TCRB", "BS-GIGI_61-replacement_TCRB", "KHBR20-00161_TCRB", "KH20-09996_TCRB", "INCOV030-BL-3_TCRB", "KHBR20-00156_TCRB", "KHBR20-00137_TCRB", "550041125_TCRB", "KHBR20-00130_TCRB", "860011498_TCRB", "KH20-11698_TCRB", "BS-EQ-17-T1b-replacement_TCRB", "KHBR20-00105_TCRB", "550043986_TCRB", "INCOV071-BL-3_TCRB", "KH20-11304_TCRB", "BS-EQ-0023-T0-replacement_TCRB", "KH20-11692_TCRB", "550042377_TCRB", "KH20-09720_TCRB", "BS-GIGI_35-replacement_TCRB", "KHBR20-00113_TCRB", "860011105_TCRB", "KH20-11294_TCRB", "KHBR20-00167_TCRB", "INCOV031-AC-3_TCRB", "KH20-09666_TCRB", "KHBR20-00081_TCRB", "BS-GIGI_70-replacement_TCRB", "BS-GIGI_55-replacement_TCRB", "INCOV009-BL-5_TCRB", "BS-EQ-34-T0-replacement_TCRB", "BS-GIGI_34-replacement_TCRB", "KHBR20-00185_TCRB", "BS-EQ-33-T0-replacement_TCRB", "BS-GIGI_19-replacement_TCRB", "KHBR20-00126_TCRB", "550042351_TCRB", "KH20-11647_TCRB", "BS-GIGI_74-replacement_TCRB", "860011229_TCRB", "KH20-11640_TCRB", "BS-GN-0005-T0-replacement_TCRB", "KHBR20-00083_TCRB", "KH20-09735_TCRB", "550042361_TCRB", "KH20-09963_TCRB", "KHBR20-00166_TCRB", "KH20-11695_TCRB", "KHBR20-00096_TCRB", "BS-GIGI_07-replacement_TCRB", "KH20-11811_TCRB", "INCOV069-BL-3_TCRB", "BS-GIGI_05-replacement_TCRB", "BS-GIGI_40-replacement_TCRB", "INCOV053-AC-3_TCRB", "860011116_TCRB", "INCOV016-BL-5_TCRB", "860011112_TCRB", "INCOV034-BL-3_TCRB", "INCOV032-BL-3_TCRB", "KH20-09664_TCRB", "550041314_TCRB", "860011206_TCRB", "550042640_TCRB", "INCOV017-BL-5_TCRB", "BS-EQ-10-T2-replacement_TCRB", "KHBR20-00172_TCRB", "KH20-11813_TCRB", "550043940_TCRB", "550041390_TCRB", "INCOV015-BL-5_TCRB", "KH20-09684_TCRB", "BS-EQ-36-T0-replacement_TCRB", "KH20-11637_TCRB", "BS-EQ-39-T0-replacement_TCRB", "BS-EQ-16-T2-replacement_TCRB", "KHBR20-00153_TCRB", "BS-GIGI_92-replacement_TCRB", "BS-GIGI_91-replacement_TCRB", "860011492_TCRB", "KHBR20-00117_TCRB", "550041499_TCRB", "KHBR20-00181_TCRB", "KHBR20-00173_TCRB", "KHBR20-00168_TCRB", "KHBR20-00123_TCRB", "KH20-09945_TCRB", "BS-GN-0014-T0-replacement_TCRB", "550043912_TCRB", "KH20-09955_TCRB", "KH20-11298_TCRB", "550043995_TCRB", "KHBR20-00170_TCRB", "INCOV070-BL-3_TCRB", "KH20-11305_TCRB", "KHBR20-00203_TCRB", "KHBR20-00188_TCRB", "BS-GIGI_53-replacement_TCRB", "INCOV036-BL-3_TCRB", "KH20-09959_TCRB", "KHBR20-00145_TCRB", "BS-EQ-33-T1-replacement_TCRB", "KH20-09677_TCRB", "INCOV087-BL-3_TCRB", "INCOV084-BL-3_TCRB", "KH20-11642_TCRB", "KH20-09687_TCRB", "KH20-09724_TCRB", "KH20-11687_TCRB", "BS-EQ-14-T3-replacement_TCRB", "KH20-11688_TCRB", "INCOV016-AC-5_TCRB", "BS-EQ-38-T1_BS-GIGI-117-replacement_TCRB", "INCOV075-AC-3_TCRB", "KHBR20-00162_TCRB", "550043973_TCRB", "550042259_TCRB", "550042183_TCRB", "BS-EQ-18-T2-replacement_TCRB", "KH20-11672_TCRB", "KH20-11643_TCRB", "KH20-11639_TCRB", "BS-GIGI_26-replacement_TCRB", "KH20-09696_TCRB", "KH20-09952_TCRB", "860011325_TCRB", "550043955_TCRB", "KHBR20-00177_TCRB", "BS-EQ-15-T1-replacement_TCRB", "BS-EQ-34-T1-replacement_TCRB", "860011312_TCRB", "KH20-11671_TCRB", "KHBR20-00196_TCRB", "KH20-09695_TCRB", "860011490_TCRB", "INCOV005-AC-5_TCRB", "550043980_TCRB", "KH20-09734_TCRB", "KH20-09750_TCRB", "KHBR20-00094_TCRB", "860011243_TCRB", "KHBR20-00143_TCRB", "BS-EQ-28-T1-replacement_TCRB", "KH20-09947_TCRB", "860011309_TCRB", "860011450_TCRB", "INCOV049-AC-3_TCRB", "860011310_TCRB", "860011113_TCRB", "KH20-09964_TCRB", "KH20-09991_TCRB", "KH20-11292_TCRB", "860011131_TCRB", "KHBR20-00200_TCRB", "KHBR20-00078_TCRB", "KHBR20-00186_TCRB", "BS-GIGI_27-replacement_TCRB", "550041250_TCRB", "KH20-09962_TCRB", "860011127_TCRB", "KH20-09999_TCRB", "860011338_TCRB", "KHBR20-00076_TCRB", "860011137_TCRB", "KHBR20-00204_TCRB", "550042523_TCRB", "INCOV017-AC-5_TCRB", "550043908_TCRB", "KHBR20-00205_TCRB", "KHBR20-00158_TCRB", "KH20-09708_TCRB", "860011218_TCRB", "860011382_TCRB", "BS-EQ-36-T1-replacement_TCRB", "860011125_TCRB", "BS-EQ-41-T1-replacement_TCRB", "BS-EQ-33-T2-replacement_TCRB", "860011132_TCRB", "550043971_TCRB", "860011246_TCRB", "INCOV006-BL-5_TCRB", "KHBR20-00134_TCRB", "550041349_TCRB", "KH20-10000_TCRB", "KH20-09718_TCRB", "KH20-09694_TCRB", "550042609_TCRB", "860011383_TCRB", "KH20-09741_TCRB", "550043927_TCRB", "KH20-11807_TCRB", "KH20-09668_TCRB", "KH20-11682_TCRB", "550042599_TCRB", "KHBR20-00127_TCRB", "550042631_TCRB", "KHBR20-00102_TCRB", "KH20-11814_TCRB", "KH20-09992_TCRB", "KHBR20-00178_TCRB", "KH20-09701_TCRB", "860011240_TCRB", "BS-EQ-41-T0-replacement_TCRB", "INCOV014-BL-5_TCRB", "KH20-11659_TCRB", "INCOV007-BL-5_TCRB", "KH20-09685_TCRB", "BS-EQ-28-T2-replacement_TCRB", "INCOV035-AC-3_TCRB", "INCOV011-AC-5_TCRB", "BS-GN-0016-T0-replacement_TCRB", "KHBR20-00152_TCRB", "BS-GN-0009-T0-replacement_TCRB", "INCOV025-AC-3_TCRB", "KHBR20-00128_TCRB", "550043167_TCRB", "INCOV051-AC-3_TCRB", "860011314_TCRB", "KHBR20-00108_TCRB", "KH20-09704_TCRB", "KH20-09749_TCRB", "KH20-09956_TCRB", "550042447_TCRB", "860011212_TCRB", "KH20-11670_TCRB", "BS-EQ-0028-T0-replacement_TCRB", "550040030_TCRB", "KH20-09746_TCRB", "KH20-11296_TCRB", "860011233_TCRB", "860011139_TCRB", "BS-GN-0003-T0-replacement_TCRB", "INCOV073-BL-3_TCRB", "860011203_TCRB", "KH20-09971_TCRB", "INCOV007-AC-5_TCRB", "KHBR20-00135_TCRB", "KH20-11291_TCRB", "860011241_TCRB", "KH20-09736_TCRB", "KHBR20-00201_TCRB", "BS-EQ-26-T1-replacement_TCRB", "KH20-11657_TCRB", "860011231_TCRB", "860011227_TCRB", "KHBR20-00082_TCRB", "KH20-09657_TCRB", "KH20-09966_TCRB", "KH20-09725_TCRB", "KH20-11681_TCRB", "550042442_TCRB", "KHBR20-00109_TCRB", "860011210_TCRB", "KH20-09985_TCRB", "860011238_TCRB", "550044028_TCRB", "860011445_TCRB", "550042400_TCRB", "BS-EQ-30-T2-replacement_TCRB", "KHBR20-00084_TCRB", "550042395_TCRB", "KHBR20-00136_TCRB", "KH20-11641_TCRB", "KH20-11302_TCRB", "KH20-11676_TCRB", "BS-EQ-0014-T1-replacement_TCRB", "KH20-11673_TCRB", "860011111_TCRB", "KH20-09967_TCRB", "KH20-09984_TCRB", "KH20-11300_TCRB", "KH20-11297_TCRB", "860011228_TCRB", "KH20-11301_TCRB", "KH20-11646_TCRB", "KHBR20-00151_TCRB", "KH20-11661_TCRB", "KH20-11299_TCRB", "KHBR20-00131_TCRB", "KH20-09709_TCRB", "860011496_TCRB", "KH20-09946_TCRB", "KHBR20-00163_TCRB", "550041393_TCRB", "KH20-09690_TCRB", "KHBR20-00080_TCRB", "KH20-09753_TCRB", "KH20-09669_TCRB", "KHBR20-00139_TCRB", "KH20-09691_TCRB", "INCOV019-AC-5_TCRB", "860011217_TCRB", "KH20-09980_TCRB", "INCOV034-AC-3_TCRB", "KH20-09951_TCRB", "BS-EQ-0003-T1-replacement_TCRB", "KH20-09978_TCRB", "KH20-11660_TCRB", "KH20-09961_TCRB", "INCOV035-BL-3_TCRB", "KH20-09990_TCRB", "860011303_TCRB", "KHBR20-00184_TCRB", "KH20-09954_TCRB", "INCOV044-AC-3_TCRB", "KHBR20-00154_TCRB", "KHBR20-00183_TCRB", "KHBR20-00122_TCRB", "KH20-09993_TCRB", "KH20-09748_TCRB", "550042547_TCRB", "KH20-09689_TCRB", "KH20-09727_TCRB", "BS-EQ-0014-T2-replacement_TCRB", "860011323_TCRB", "KH20-11812_TCRB", "KHBR20-00091_TCRB", "INCOV048-BL-3_TCRB", "KH20-09977_TCRB", "KH20-09986_TCRB", "KH20-09745_TCRB", "860011115_TCRB", "KHBR20-00087_TCRB", "550043914_TCRB", "550043981_TCRB", "860011499_TCRB", "INCOV054-BL-3_TCRB", "KHBR20-00100_TCRB", "550042239_TCRB", "KH20-11649_TCRB", "KH20-09953_TCRB", "860011304_TCRB", "KHBR20-00133_TCRB", "550042210_TCRB", "INCOV051-BL-3_TCRB", "860011497_TCRB", "KHBR20-00116_TCRB", "KH20-09715_TCRB", "BS-EQ-30-T1-replacement_TCRB", "860011348_TCRB", "860011118_TCRB", "KH20-09975_TCRB", "550043928_TCRB", "KH20-09714_TCRB", "860011242_TCRB", "KHBR20-00079_TCRB", "550041441_TCRB", "KH20-09960_TCRB", "860011326_TCRB", "KHBR20-00191_TCRB", "BS-GIGI_75-replacement_TCRB", "KH20-09968_TCRB", "KH20-09723_TCRB", "KH20-09979_TCRB", "KH20-11658_TCRB", "INCOV047-BL-3_TCRB", "INCOV002-AC-5_TCRB", "KH20-09731_TCRB", "INCOV029-AC-3_TCRB", "KH20-11674_TCRB", "110047542_TCRB", "KH20-09743_TCRB", "KH20-11678_TCRB", "KH20-11689_TCRB", "KH20-09699_TCRB", "KH20-09726_TCRB", "860011336_TCRB", "860011256_TCRB", "KHBR20-00097_TCRB", "INCOV006-AC-5_TCRB", "KHBR20-00086_TCRB", "KH20-11651_TCRB", "KHBR20-00125_TCRB", "KH20-09755_TCRB", "KH20-11655_TCRB", "KH20-09683_TCRB", "KH20-09976_TCRB", "860011280_TCRB", "INCOV047-AC-3_TCRB", "KH20-11684_TCRB", "860011219_TCRB", "KH20-09654_TCRB", "INCOV044-BL-3_TCRB", "KH20-09738_TCRB", "INCOV015-AC-5_TCRB", "INCOV050-AC-3_TCRB", "860011494_TCRB", "KH20-11662_TCRB", "KH20-11665_TCRB", "INCOV062-BL-3_TCRB", "860011226_TCRB", "KH20-09702_TCRB", "860011354_TCRB", "860011236_TCRB", "KHBR20-00095_TCRB", "860011209_TCRB", "860011317_TCRB", "KH20-09711_TCRB", "KH20-11699_TCRB", "KH20-09754_TCRB", "KHBR20-00190_TCRB", "860011388_TCRB", "860011356_TCRB", "550043920_TCRB", "KHBR20-00099_TCRB", "KHBR20-00129_TCRB", "KH20-11696_TCRB", "KH20-11686_TCRB", "KH20-11664_TCRB", "INCOV026-BL-3_TCRB", "KH20-11667_TCRB", "KH20-11666_TCRB", "860011347_TCRB", "KH20-11648_TCRB", "KH20-09739_TCRB", "550044000_TCRB", "KH20-09972_TCRB", "KH20-11668_TCRB", "860011138_TCRB", "860011208_TCRB", "550041239-1_TCRB", "550042542_TCRB", "860011223_TCRB", "KHBR20-00092_TCRB", "INCOV052-BL-3_TCRB", "INCOV050-BL-3_TCRB", "860011311_TCRB", "550044029_TCRB", "INCOV003-AC-5_TCRB", "KHBR20-00088_TCRB", "860011264_TCRB", "INCOV012-AC-5_TCRB", "INCOV001-AC-5_TCRB", "INCOV002-BL-5_TCRB", "KH20-11654_TCRB", "KHBR20-00115_TCRB", "KH20-09707_TCRB", "860011392_TCRB", "KH20-11656_TCRB", "KH20-09717_TCRB", "860011284_TCRB", "KHBR20-00174_TCRB", "KH20-11653_TCRB", "KH20-09719_TCRB", "KH20-09958_TCRB", "INCOV026-AC-3_TCRB", "860011119_TCRB", "KH20-09981_TCRB", "KH20-11652_TCRB", "860011250_TCRB", "KH20-09995_TCRB", "INCOV059-AC-3_TCRB", "860011282_TCRB", "KH20-11669_TCRB", "550041173_TCRB", "KH20-09973_TCRB", "KH20-09706_TCRB", "860011122_TCRB", "KH20-09969_TCRB", "KH20-11293_TCRB", "INCOV062-AC-3_TCRB", "860011340_TCRB", "INCOV055-BL-3_TCRB", "860011224_TCRB", "KH20-09658_TCRB", "KHBR20-00104_TCRB", "860011211_TCRB", "KH20-11679_TCRB", "860011130_TCRB", "860011495_TCRB", "INCOV027-BL-3_TCRB", "KH20-09693_TCRB", "KHBR20-00146_TCRB", "860011106_TCRB", "KH20-09721_TCRB", "INCOV077-BL-3_TCRB", "KH20-09716_TCRB", "INCOV027-AC-3_TCRB", "860011121_TCRB", "KH20-11644_TCRB", "INCOV045-BL-3_TCRB", "860011330_TCRB", "INCOV023-AC-3_TCRB", "860011260_TCRB", "KH20-09994_TCRB", "KH20-09744_TCRB", "KH20-09712_TCRB", "860011488_TCRB", "550042451_TCRB", "INCOV078-BL-3_TCRB", "KHBR20-00164_TCRB"]
        # <max number of cpus to use>
        ncpus = 2
        # <df_search> dataframe containing metaclonotypes
        df_search = pd.read_csv(os.path.join('hla_restricted_meta_clonotypes', ranked_centers_fn), sep = "\t")
        if df_search.shape[0] > 0:
            df_search = df_search[['cdr3_b_aa','v_b_gene','j_b_gene','pgen','radius','regex']]
            # <all_files> list of all files in the project <path>
            all_files = [f for f in os.listdir(path) if f.endswith('v_max.tsv')]
            # add file size
            all_files = [ (f, f.replace('.tsv.tcrdist3.v_max.tsv',''), os.stat(os.path.join(path,f)).st_size) for f in all_files ]
            # sort ascending by file size, smaller files first
            all_files = sorted(all_files, key=lambda x: x[2])
            # <all_files_ms> is <all_files> subset to only the samples with 30 days of COVID19 diagnosis, and not including ADPATIVE-COVID19 training samples
            all_files_ms = [(x,y,z) for x,y,z in all_files if y in covid_samples] 
            ts_iterable = [f for f,sn,_  in all_files_ms]
            import parmap
            timings = parmap.map(do_search2, ts_iterable, df_search = df_search.copy(), dest = dest, tag = tag, pm_processes=6, pm_pbar = True)
            pd.DataFrame({'file':ts_iterable, 'seconds':timings}).to_csv(os.path.join('hla_restricted_meta_clonotypes', benchmark_fn), sep = "\t")


# 8888888888 888b    888 8888888b.  
# 888        8888b   888 888  "Y88b 
# 888        88888b  888 888    888 
# 8888888    888Y88b 888 888    888 
# 888        888 Y88b888 888    888 
# 888        888  Y88888 888    888 
# 888        888   Y8888 888  .d88P 
# 8888888888 888    Y888 8888888P"

Tabulating Meta-Clonotypes

这部分提供了如何使用 tcrdist.repertoire.TCRrep.compute_sparse_rect_distances() 和 tcrdist.join.join_by_dist() 在批量 TCRb Repertoire 中制表元克隆型一致克隆的分步说明。
Meta-clonotypes can be learned from antigen-associated TCR data collected via tetramer sorting or activation marker enrichment. However, a single TCR may be conformant multiple meta-clonotypes, which should be a consideration when applying a set of meta-clonotypes together. For instance, tallying conformant TCRs in a repertoire should avoid counting a TCR more than once. This example illustrates an efficient approach to tabulating meta-clonotypes conformant sequences in bulk repertoires.(不知道大家晕了没??)

Determine the appropriate number of CPUS based on your system

ncpus = min(multiprocessing.cpu_count(), 6)
完整的代码
import multiprocessing
import numpy as np
import os
import pandas as pd
from tcrdist.setup_tests import download_and_extract_zip_file
from tcrdist.repertoire import TCRrep
from tcrdist.breadth import get_safe_chunk
from tcrdist.join import join_by_dist
import re

ncpus = min(multiprocessing.cpu_count(), 6)
files = ['1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv']
if not np.all([os.path.isfile(f) for f in files]):
    download_and_extract_zip_file("ImmunoSeq_MIRA_matched_tcrdist3_ready_2_files.zip", source = "dropbox", dest = ".")
if not os.path.isfile("bioRxiv_v2_metaclonotypes.tsv.zip"):
    download_and_extract_zip_file('bioRxiv_v2_metaclonotypes.tsv.zip', source = "dropbox", dest = ".")

df_search = pd.read_csv("bioRxiv_v2_metaclonotypes.tsv", sep = "\t")
f = '1588BW_20200417_PBMC_unsorted_cc1000000_ImmunRACE_050820_008_gDNA_TCRB.tsv.tcrdist3.tsv'
df_bulk = pd.read_csv(f, sep = "\t")
# When one want to track each clone indivually regardless of identical TRBV-CDR3-TRBJ
df_bulk = df_bulk.sort_values('count').reset_index(drop = True)

df_bulk['rank'] = df_bulk.index.to_list()

from tcrdist.repertoire import TCRrep
tr = TCRrep(
    cell_df = df_search,
    organism = "human",
    chains = ['beta'],
    compute_distances= False)
tr.cpus = ncpus

tr_bulk = TCRrep(
    cell_df = df_bulk,
    organism = "human",
    chains = ['beta'],
    compute_distances= False)

chunk_size = get_safe_chunk(tr.clone_df.shape[0], tr_bulk.clone_df.shape[0])
tr.compute_sparse_rect_distances(
    df = tr.clone_df,
    df2 = tr_bulk.clone_df,
    radius = 36,
    chunk_size = chunk_size)

df_join = join_by_dist(
    how = 'inner',
    csrmat = tr.rw_beta,
    left_df = tr.clone_df,
    right_df = tr_bulk.clone_df,
    left_cols  = tr.clone_df.columns.to_list(),
    right_cols = tr_bulk.clone_df.columns.to_list(),
    left_suffix = '_search',
    right_suffix = '_bulk',
    max_n= 1000,
    radius = 36)

df_join['RADIUS'] = df_join.apply(lambda x: x['dist'] <= x['radius_search'], axis = 1)
import re
df_join['MOTIF'] = df_join.apply(lambda x: re.search(string = x['cdr3_b_aa_bulk'],
    pattern = x['regex_search']) is not None, axis = 1)

df_join['RADIUSANDMOTIF'] =  df_join['RADIUS'] & df_join['MOTIF']
df_join['unique_clones'] = 1

df_join.query('RADIUSANDMOTIF').\
    sort_values('dist', ascending = True).\
    groupby(['rank_bulk']).\
    head(1).\
    groupby('protein_search').\
    sum()[['count_bulk', 'templates_bulk','unique_clones']]

绘制树形结构(CD3 motif)

"""
All imports are provided here, and are repeated 
step-wise below, for clarity, and for
module cut-and-paste. This example
performs paired alpha-beta analysis,
but code blocks can be used for single
chain analysis as well.
"""
import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.rep_diff import hcluster_diff, member_summ
from tcrsampler.sampler import TCRsampler
from tcrdist.adpt_funcs import get_centroid_seq
from tcrdist.summarize import _select
from palmotif import compute_pal_motif, svg_logo
from hierdiff import plot_hclust_props
"""
Load a subset of data that contains paired alpha-beta
chain mouse TCR receptors that recognized 
the PA or PB1 epitopes (present in mouse influenza). 
"""
import pandas as pd
df = pd.read_csv("dash.csv")
conditional = df['epitope'].apply( lambda x: x in ['PA','PB1'])
"""
For illustrative/testing purposes, randomly subset the data to include 
only 100 clones. Increase for more informative plot.
"""
df = df[conditional].\
    reset_index(drop = True).\
    sample(100, random_state = 3).\
    reset_index(drop = True).\
    copy()
"""
Load DataFrame into TCRrep instance, 
which automatically computes attributes:
1. .clone_df DataFrame
2. .pw_beta nd.array 
3. .pw_alpha nd.array 
"""
from tcrdist.repertoire import TCRrep
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['beta','alpha'], 
            db_file = 'alphabeta_gammadelta_db.tsv')

"""
Apply hcluster_diff, which hierarchically clusters.

Note
----
pwmat could easily be tr.pw_beta or tr.pw_alpha if 
clustering should be done on a single chain.
"""
from tcrdist.rep_diff import hcluster_diff
tr.hcluster_df, tr.Z =\
    hcluster_diff(clone_df = tr.clone_df, 
                  pwmat    = tr.pw_beta + tr.pw_alpha,
                  x_cols = ['epitope'], 
                  count_col = 'count')

"""
Load a custom background, mouse appropriate dataset to sample CDR3s 
according to the V and J gene usage frequencies observed in each node.
See the tcrsampler package for more details 
(https://github.com/kmayerb/tcrsampler/blob/master/docs/getting_default_backgrounds.md)
"""
from tcrsampler.sampler import TCRsampler

t = TCRsampler()
t.download_background_file("ruggiero_mouse_sampler.zip")
tcrsampler_beta = TCRsampler(default_background = 'ruggiero_mouse_beta_t.tsv.sampler.tsv')
tcrsampler_alpha = TCRsampler(default_background = 'ruggiero_mouse_alpha_t.tsv.sampler.tsv')

"""
Add an SVG graphic to every node of the tree 
aligned to the cluster centroid.
"""
from tcrdist.adpt_funcs import get_centroid_seq
from tcrdist.summarize import _select
from palmotif import compute_pal_motif, svg_logo

"""Beta Chain"""
svgs_beta = list()
for i,r in tr.hcluster_df.iterrows():

    dfnode = tr.clone_df.iloc[r['neighbors_i'],]
    if dfnode.shape[0] > 2:
        centroid, *_ = get_centroid_seq(df = dfnode)
    else:
        centroid = dfnode['cdr3_b_aa'].to_list()[0]
    print(f"BETA-CHAIN: {centroid}")

    gene_usage_beta = dfnode.groupby(['v_b_gene','j_b_gene']).size()
    sampled_rep = tcrsampler_beta.sample( gene_usage_beta.reset_index().to_dict('split')['data'],
                    flatten = True, depth = 10)
    sampled_rep  = [x for x in sampled_rep if x is not None]
    motif, stat = compute_pal_motif(
                    seqs = _select(df = tr.clone_df, 
                                   iloc_rows = r['neighbors_i'], 
                                   col = 'cdr3_b_aa'),
                    refs = sampled_rep, 
                    centroid = centroid)
    
    svgs_beta.append(svg_logo(motif, return_str= True))

"""Add Beta SVG graphics to hcluster_df"""
tr.hcluster_df['svg_beta'] = svgs_beta


"""Alpha Chain"""
svgs_alpha = list()
for i,r in tr.hcluster_df.iterrows():

    dfnode = tr.clone_df.iloc[r['neighbors_i'],]
    if dfnode.shape[0] > 2:
        centroid, *_ = get_centroid_seq(df = dfnode)
    else:
        centroid = dfnode['cdr3_a_aa'].to_list()[0]
    print(f"ALPHA-CHAIN: {centroid}")
    gene_usage_alpha = dfnode.groupby(['v_a_gene','j_a_gene']).size()
    sampled_rep = tcrsampler_alpha.sample( gene_usage_alpha.reset_index().to_dict('split')['data'], 
                    flatten = True, depth = 10)
    
    sampled_rep  = [x for x in sampled_rep if x is not None]
    motif, stat = compute_pal_motif(
                    seqs = _select(df = tr.clone_df, 
                                   iloc_rows = r['neighbors_i'], 
                                   col = 'cdr3_a_aa'),
                    refs = sampled_rep, 
                    centroid = centroid)

    svgs_alpha.append(svg_logo(motif, return_str= True))

"""Add Alpha SVG graphics to hcluster_df"""
tr.hcluster_df['svg_alpha'] = svgs_alpha
"""
Produce summary information for tooltips. 
For instance, describe percentage of TCRs with 
a given epitope at a given node.
"""
res_summary = member_summ(  res_df = tr.hcluster_df,
                            clone_df = tr.clone_df, 
                            addl_cols=['epitope'])

tr.hcluster_df_detailed = \
    pd.concat([tr.hcluster_df, res_summary], axis = 1)
"""
Write D3 html for interactive denogram graphic. 
Specify desired tooltips.
"""
from hierdiff import plot_hclust_props
html = plot_hclust_props(tr.Z,
            title='PA Epitope Example',
            res=tr.hcluster_df_detailed,
            tooltip_cols=['cdr3_b_aa','v_b_gene', 'j_b_gene','svg_alpha','svg_beta'],
            alpha=0.00001, colors = ['blue','gray'],
            alpha_col='pvalue')

with open('hierdiff_example_PA_v_PB1.html', 'w') as fh:
    fh.write(html)
图片.png
移动到图片的相应位置可以看具体的信息。

最后可视化,就一个函数tcrdist.gene_pairing_plot.plot_pairings

图片.png
import tcrdist as td
import pandas as pd
import numpy as np
import IPython
from tcrdist import plotting

np.random.seed(110820)
n = 50
df = pd.DataFrame({'v_a_gene':np.random.choice(['TRAV14', 'TRAV12', 'TRAV3',
                                'TRAV23', 'TRAV11', 'TRAV6', 'TRAV89'], n),
                   'j_a_gene':np.random.choice(['TRAJ4', 'TRAJ2', 'TRAJ3',
                                'TRAJ5', 'TRAJ21', 'TRAJ13'], n),
                   'v_b_gene':np.random.choice(['TRBV14', 'TRBV12',
                                'TRBV3'], n),
                   'j_b_gene':np.random.choice(['TRBJ4', 'TRBJ2', 'TRBJ3',
                                'TRBJ5', 'TRBJ21', 'TRBJ13'], n)})

df = df.assign(count=1)
df.loc[:10, 'count'] = 10 # Sim expansion of the genes used in the first 10 rows

svg = plotting.plot_pairings(cell_df = df,
                                cols = ['j_a_gene', 'v_a_gene',
                                        'v_b_gene', 'j_b_gene'],
                                count_col='count')
IPython.display.SVG(data=svg)
图片.png
svg = td.plotting.plot_pairings(cell_df = df,
                               cols = ['v_a_gene', 'j_a_gene'],
                               count_col='count')
IPython.display.SVG(data=svg)
图片.png
import tcrdist as td
import pandas as pd
import numpy as np
import IPython
from tcrdist import plotting

pd_df = pd.read_csv("vdjDB_PMID28636592.tsv", sep = "\t")       # 1
t_df = td.mappers.vdjdb_to_tcrdist2(pd_df = pd_df)              # 2

organism =  "MusMusculus"                                       # 3
epitope  =  "PA"

ind_org   = t_df.organism == organism
ind_epi   = t_df.epitope  == epitope

t_df_spec = t_df.loc[(ind_org)&(ind_epi),:]

svg = plotting.plot_pairings(cell_df = t_df_spec,
                                cols = ['j_a_gene', 'v_a_gene',
                                        'v_b_gene', 'j_b_gene'],
                                count_col='count',
                                other_frequency_threshold = 0.01)
IPython.display.SVG(data=svg)
图片.png

真的非常多,但这还没到我们这个专题的一半,看来难度有点高

生活很好,有你更好

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,657评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,662评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,143评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,732评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,837评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,036评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,126评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,868评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,315评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,641评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,773评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,470评论 4 333
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,126评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,859评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,095评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,584评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,676评论 2 351