This directory contains the scripts used to generate the figures and results for the paper "MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets".
The core MetaTCR python package and usage instructions can be found in https://github.com/deepomicslab/MetaTCR/tree/main.
The scripts are organized by the section of the paper they correspond to. Below is a description of each directory and the experiments they contain.
Corresponds to Figure 1
These scripts quantify the baseline technical divergence between datasets.
dataset_feature_stats/:- Shannon Entropy: Calculates and compares repertoire diversity across cohorts (
healthy_datasets_shannon.py,melanoma_datasets_shannon.py). - k-mer Distribution: Performs PCA on CDR3 k-mer counts to visualize batch-driven clustering (
*_kmer_distribution2x2.py).
- Shannon Entropy: Calculates and compares repertoire diversity across cohorts (
cdr3_analysis/:- Clonotype Overlap: Analyzes the fraction of shared clonotypes within vs. across studies to demonstrate sparsity (
plot_shared_cdr3_by_datasets_heatmap.py).
- Clonotype Overlap: Analyzes the fraction of shared clonotypes within vs. across studies to demonstrate sparsity (
Scripts for constructing the "Referenced TCR Space" and determining the optimal granularity.
-
determin_optimal_cluster_num_by_antigen.py: Benchmarks different cluster numbers ($k$ ) using antigen specificity and epitope purity metrics. -
spectral_robustness_k96.py: Evaluates the stability and robustness of the chosen$k=96$ functional clusters. -
function_cluster_by_spectral.py: Implementation of the spectral clustering step for reference construction.
Corresponds to Figure 3
Validates that the Meta-vector representation retains biological signals while exposing technical noise.
biological_rep/: Analysis of the Genolet2023 dataset to quantify technical noise between biological replicates across batches/platforms (Genolet2023_data_analysis_abundance.py).robustness_Sherwood2015/: Longitudinal analysis of the Sherwood2015 dataset to demonstrate intra-individual stability of meta-vectors (meta_vec_individual_robunstness.py).feature_compare/: Comparison of MetaTCR against other feature encoding methods (e.g., k-mers, V/J usage).
Corresponds to Figure 4
A systematic evaluation of metrics (kBET, JSD, iLISI, MMD) to identify the best tool for quantifying batch effects.
scenario1_simu_by_three_methods.py: Task 1 (Sensitivity) - Tests metric correlation with simulated batch effect magnitudes.scenario2_data_sampling_robustness.py: Task 2 (Stability) - Evaluates metric robustness to sampling size and stochasticity.scenario3_dataset_pairs_classification_auc_multi_class.py: Task 3 (Discrimination) - Assesses ability to distinguish real-world technical batches from biological variation.benchmark_metrics_by_ranking_scaled.py: Generates the final ranking of metrics (Fig 4e).
Corresponds to Figure 5
Evaluates integration algorithms (Covariance Matching, Harmony, MNN, Scanorama) on MetaTCR data.
2.simu_celltype_test_integration_tools.py: Task 4 (Archetype Simulation) - Tests preservation of biological structure (Bio-Silhouette) vs. batch mixing (kBET) in simulated repertoires.3.integration_same_label.py: Task 5 (Real-world Integration) - Integration of healthy/melanoma cohorts from different studies.1.domain_shift_on_cmv_clf_with_val_baseline.py: Task 6 (Generalizability) - Cross-study transfer learning (CMV serostatus prediction) to measure improvement in model performance after integration.
Corresponds to Figure 6
Application of the framework to the Wang2022 dataset to detect and correct latent batch effects.
wang2022_data_segmentation.py: Unsupervised segmentation algorithm to discover hidden technical subgroups (latent batches).wang2022_group_comparison.py: Comparison of the identified latent batches.Vgene/:case_vdj_fold_change_Wang2022_meta.py: Differential V/J gene usage analysis before and after batch correction (Fig 6c-f).case_vdj_distribution_Wang2022_sample_level.py: Visualization of specific gene distributions.
Corresponds to Discussion Section
Exploratory analysis of TCRs that do not fit into the static reference clusters ("Novel TCRs").
novel_groups_emerson2017_new.py: Identification of novel TCR groups in the Emerson2017 dataset.outline_melanoma_dataset_distribution.py: Distribution of outlier TCRs across melanoma datasets.