Skip to content

deepomicslab/MetaTCR_exp_scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Scripts for Reproducing MetaTCR Paper Results

This directory contains the scripts used to generate the figures and results for the paper "MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets".

The core MetaTCR python package and usage instructions can be found in https://github.com/deepomicslab/MetaTCR/tree/main.

Directory Organization

The scripts are organized by the section of the paper they correspond to. Below is a description of each directory and the experiments they contain.

1. Analysis of Pervasive Batch Effects (dataset_feature_stats/, cdr3_analysis/)

Corresponds to Figure 1

These scripts quantify the baseline technical divergence between datasets.

  • dataset_feature_stats/:
    • Shannon Entropy: Calculates and compares repertoire diversity across cohorts (healthy_datasets_shannon.py, melanoma_datasets_shannon.py).
    • k-mer Distribution: Performs PCA on CDR3 k-mer counts to visualize batch-driven clustering (*_kmer_distribution2x2.py).
  • cdr3_analysis/:
    • Clonotype Overlap: Analyzes the fraction of shared clonotypes within vs. across studies to demonstrate sparsity (plot_shared_cdr3_by_datasets_heatmap.py).

2. MetaTCR Framework Optimization (functional_cluster_num/)

Scripts for constructing the "Referenced TCR Space" and determining the optimal granularity.

  • determin_optimal_cluster_num_by_antigen.py: Benchmarks different cluster numbers ($k$) using antigen specificity and epitope purity metrics.
  • spectral_robustness_k96.py: Evaluates the stability and robustness of the chosen $k=96$ functional clusters.
  • function_cluster_by_spectral.py: Implementation of the spectral clustering step for reference construction.

3. Meta-vector Profiling Evaluation (metavec_evaluation/)

Corresponds to Figure 3

Validates that the Meta-vector representation retains biological signals while exposing technical noise.

  • biological_rep/: Analysis of the Genolet2023 dataset to quantify technical noise between biological replicates across batches/platforms (Genolet2023_data_analysis_abundance.py).
  • robustness_Sherwood2015/: Longitudinal analysis of the Sherwood2015 dataset to demonstrate intra-individual stability of meta-vectors (meta_vec_individual_robunstness.py).
  • feature_compare/: Comparison of MetaTCR against other feature encoding methods (e.g., k-mers, V/J usage).

4. Benchmarking Batch Dissimilarity Metrics (metric_benchmarking/)

Corresponds to Figure 4

A systematic evaluation of metrics (kBET, JSD, iLISI, MMD) to identify the best tool for quantifying batch effects.

  • scenario1_simu_by_three_methods.py: Task 1 (Sensitivity) - Tests metric correlation with simulated batch effect magnitudes.
  • scenario2_data_sampling_robustness.py: Task 2 (Stability) - Evaluates metric robustness to sampling size and stochasticity.
  • scenario3_dataset_pairs_classification_auc_multi_class.py: Task 3 (Discrimination) - Assesses ability to distinguish real-world technical batches from biological variation.
  • benchmark_metrics_by_ranking_scaled.py: Generates the final ranking of metrics (Fig 4e).

5. Integration Methods Benchmarking (integration_benchmarking/)

Corresponds to Figure 5

Evaluates integration algorithms (Covariance Matching, Harmony, MNN, Scanorama) on MetaTCR data.

  • 2.simu_celltype_test_integration_tools.py: Task 4 (Archetype Simulation) - Tests preservation of biological structure (Bio-Silhouette) vs. batch mixing (kBET) in simulated repertoires.
  • 3.integration_same_label.py: Task 5 (Real-world Integration) - Integration of healthy/melanoma cohorts from different studies.
  • 1.domain_shift_on_cmv_clf_with_val_baseline.py: Task 6 (Generalizability) - Cross-study transfer learning (CMV serostatus prediction) to measure improvement in model performance after integration.

6. Gastric Cancer Case Study (case_study/)

Corresponds to Figure 6

Application of the framework to the Wang2022 dataset to detect and correct latent batch effects.

  • wang2022_data_segmentation.py: Unsupervised segmentation algorithm to discover hidden technical subgroups (latent batches).
  • wang2022_group_comparison.py: Comparison of the identified latent batches.
  • Vgene/:
    • case_vdj_fold_change_Wang2022_meta.py: Differential V/J gene usage analysis before and after batch correction (Fig 6c-f).
    • case_vdj_distribution_Wang2022_sample_level.py: Visualization of specific gene distributions.

7. Novel TCR Analysis (find_outlier_tcrs/)

Corresponds to Discussion Section

Exploratory analysis of TCRs that do not fit into the static reference clusters ("Novel TCRs").

  • novel_groups_emerson2017_new.py: Identification of novel TCR groups in the Emerson2017 dataset.
  • outline_melanoma_dataset_distribution.py: Distribution of outlier TCRs across melanoma datasets.

About

Contains the scripts used to generate the figures and results for the paper MetaTCR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages