TeloQuest

TeloQuest is a machine learning pipeline on a mission to uncover tumor status by analyzing telomere content variation!

This repository contains scripts and instructions for downloading BAM files for all TCGA projects from the GDC data portal, analyzing telomere content using qmotif, saving output files with telomere data, extracting variant information for 15 telomere-related genes, and building a machine learning model using all these features.
Each step in the pipeline is outlined below, along with requirements and execution instructions.

We utilized the GDC Data Portal to access the BAM files for all the TCGA projects.

Figure 1. TeloQuest Pipeline: A schematic overview of the TeloQuest pipeline used to predict tumor status based on phenotypic data, telomeric read content, and genomic variants. The pipeline consists of the following key steps: (1) Downloading phenotypic data for all 33 cancer types from The Cancer Genome Atlas (TCGA) and compiling it into a CSV file; (2) Downloading whole-genome sequencing (WGS) BAM files, slicing the BAM files according to telomeric regions, using the qmotif pipeline to quantify the telomeric reads, which are then compiled into a CSV file; (3) Re-downloading the same WGS BAM files, slicing them at the loci of 15 telomere length-associated genes, and identifying variants using bcftools “mpileup” command, followed by integration of variants data into the existing CSV file containing phenotypic and telomeric read information. The final integrated CSV file then serves as the input for training the machine learning model for tumor prediction. [Figure created with BioRender.com].

Requirements

To run this pipeline, you'll need:

Controlled-data Access Authorization: Follow the steps on GDC to get access to the controlled data (aka the BAM files).
GDC Download Token: The token is required to download the controlled access data. To download controlled-access data, ensure that the authentication token is stored in the same folder as the scripts. The token is valid for 30 days from the date of download and can only be used once. For instance, if you use a token to download BAM files for the TCGA-ACC project, you will need a new token to download BAM files for the TCGA-GBM project. Additionally, you can only download controlled-access files for one project at a time—requesting a new token will immediately invalidate the previous one. However, if multiple team members have access to the portal, each person can generate and use their own individual tokens at the same time. Please be responsible with the tokens and do not share them with anyone, as this is controlled-access data.
samtools: Required to generate BAM Index (BAI) files for the corresponding BAM files
qmotif: Download and install qmotif (Documentation here).

Figure 2. Workflow for Telomere Content Variation Pipeline. Schematic representation of the qmotif-based pipeline used to estimate telomere content variation across samples from TCGA data. The workflow involves extracting and quantifying telomeric reads from whole-genome sequencing data using qmotif v1.0. The tool operates through a two-stage matching system: in Stage 1, a simple string match is used to identify canonical telomeric repeats, while in Stage 2, a more complex regular expression is applied to detect variant telomeric sequences. At the end of Stage 2, a tally of all identified motifs is done, and the final number is recorded. [Figure created with BioRender.com].

Java: Required to run qmotif
bcftools: Required to obtain variants for the 15 telomere-related genes
Anaconda for Jupyter notebook: Required to build the machine learning model

Additional Things to Prepare:

I.) If you're only interested in certain chromosomal coordinates from the BAM file, get the specific coordinates using UCSC's Genome Browser. TCGA data on the GDC Portal has been harmonized and mapped to the GRCh38 human reference genome build, so please be aware of this when you get the coordinates and ensure they are from the correct build.
II.) Refer to the sample script Kidney_TCGA_KICH_loop.sh which utilizes the BAM files for UUIDs mentioned in the TSV file - Kidney_TCGA_KICH_curl.tsv
III.) Generate a similar TSV file using "Cohort Builder" and "Repository" on the GDC Data Portal.

/// Steps to Download a Sample Sheet with UUIDs from the GDC Portal:

Open the Cohort Builder
- Click on Cohort Builder and select the program (e.g., TCGA).
- Choose the project (e.g., TCGA-KICH).
Select Features for Your Cohort
- Customize your cohort by selecting relevant features such as:
  - Disease type
  - Primary diagnosis
  - Primary site
  - Tissue or organ of origin
  - Gender, race, ethnicity
  - Vital status
  - Site of biopsy
  - Prior malignancy
  - Tissue type
  - Data format, etc.
Save Your Cohort
- Once all features are selected, click Save the Cohort and assign it a name.
Access the Repository
- Now, navigate to the Repository and refine your selection by specifying:
  - Experimental strategy
  - Data type
  - Data format
  - Type of access
  - Tissue type
  - Specimen type
Download the Sample Sheet
- Click the Download Sample Sheet button to generate a file containing the UUIDs for all selected files from your chosen project.
Verify the Sample Sheet
- Review the downloaded file to ensure it includes all the selected features.
- If working on multiple projects, give each sample sheet a unique name for better organization.

IV.) Reference Genome FASTA (GRCh38.d1.vd1.fa) and FASTA index files (GRCh38.d1.vd1.fa.fai) were obtained from NCI's website.
v.) More information on BAM slicing using GDC Data Portal is available here.

Gene symbols, names, and genomic coordinates of the 15 telomere-related genes for which variants have been captured from the paper by Burren et al. (2024).

No. Gene Symbol - Gene Full Name - Genomic coordinates

DCLRE1B - DNA cross-link repair 1B - chr1:113905326-113914086
NAF1 - Nuclear assembly factor 1 ribonucleoprotein - chr4:163128669-163166890
TERT - Telomerase reverse transcriptase - chr5:1253167-1295068
G3BP1 - G3BP stress granule assembly factor 1 - chr5:151771954-151812785
ZNF451 - Zinc finger protein 451 - chr6:57090188-57170305
POT1 - Protection of telomeres 1 - chr7:124822386-124929825
TERF1 - Telomeric repeat binding factor 1 - chr8:73008864-73048123
STN1 - STN1 subunit of CST complex - chr10:103877569-103918184
ATM - ATM serine/threonine kinase - chr11:108223067-108369102
TINF2 - TERF1 interacting nuclear factor 2 - chr14:24239643-24242623
PARN - Poly(A)-specific ribonuclease - chr16:14435701-14630260
CTC1 - CST telomere replication complex component 1 - chr17:8224815-8248056
BRIP1 - BRCA1 interacting DNA helicase 1 - chr17:61679139-61863528
SAMHD1 - SAM and HD domain containing deoxynucleoside triphosphate triphosphohydrolase 1 - chr20:36890229-36951708
RTEL1 - Regulator of telomere elongation helicase 1 - chr20:63658312-63696245

File Descriptions

Kidney_TCGA_KICH_curl.tsv: A TSV file containing all the UUIDs for the TCGA project of interest. This file specifically includes UUIDs for both normal and tumor BAM files from the TCGA-KICH project.
Kidney_TCGA_KICH_loop.sh: A Bash script that performs curl requests to download and slice BAM files based on specified genomic regions. The regions used here are telomeric coordinates obtained using UCSC's LiftOver tool.
bamindex.sh: A Bash script that uses samtools to generate corresponding BAI index files for the downloaded BAM files.
Kidney_TCGA_KICH.py: A Python script that runs qmotif on the BAM and BAI files to analyze telomere content.
Kidney_TCGA_KICH_gene_loop.sh: A Bash script that performs curl requests to download and slice BAM files based on specified genomic regions. The regions used here are for 15 telomere-related genes.
runbcftools.sh: A Bash script that runs the bcftools mpileup command on all BAM files to call variants for 15 telomere-related genes.
variantstxt.sh: A Bash script that loops over all generated VCF files and formats the variant data into TXT files. Each line is sorted in the CHROM\tPOS\tREF\tALT\t%GT\n format for the corresponding BAM file.
allgenotype.py: A Python script that performs genotype encoding, mapping: "0/0" → 0, "0/1" → 1, "1/1" → 2, "./." → None
Note: The GDC Portal allows you to download BAM files and other controlled-access files for only one project at a time. If you generate multiple tokens simultaneously, the system will automatically invalidate the second-to-last token, even if it has not been used. Additionally, once a token has been used to download files, it will immediately expire. Therefore, every time you need to download new files, you must obtain a new token.

Usage

Sequence of Execution

To use the pipeline, run the scripts in the following order:

Kidney_TCGA_KICH_loop.sh (requires: Kidney_TCGA_KICH_curl.tsv, generates: sliced BAM files according to telomeric regions)
bamindex.sh (requires: samtools, BAM files, generates: BAI files)
Kidney_TCGA_KICH.py (requires: qmotif, BAM and BAI files, generates: log.txt and output.txt files)
Kidney_TCGA_KICH_gene_loop.sh (requires: Kidney_TCGA_KICH_curl.tsv, generates: sliced BAM files according to 15 telomere-related genes)
runbcftools.sh (requires: GRCh38 reference fasta file, bcftools, sliced BAM files according to 15 telomere-related genes, generates: VCF files)
variantstxt.sh (requires: VCF files, generates: Plain TXT files with variant data)
allgenotype.py (requires: Folder of TXT files with variant data, generates: summary CSV file of mutation counts per file)

Download Recommendations: For optimal performance when downloading these files, we recommend using a Linux-based system or macOS. Due to the large file sizes, these operating systems tend to handle extensive downloads more reliably and efficiently than some alternatives. Additionally, we suggest ensuring a stable internet connection to minimize interruptions during the download process.
Note: When working with long-running processes, such as data analysis scripts or large data transfers, it’s often helpful to use tools like tmux and nohup to keep the process running even if your session disconnects. Documentation on tmux and nohup available here.

Detailed Steps

Download the (sliced) BAM files for telomeric regions

Run Kidney_TCGA_KICH_loop.sh to download the sliced BAM files from the GDC Data Portal. Make sure to:
Include regions that you want the sliced BAM files to have, and that both the token and Kidney_TCGA_KICH_curl.tsv are in the same folder.

chmod +x Kidney_TCGA_KICH_loop.sh

./Kidney_TCGA_KICH_loop.sh

OR

bash Kidney_TCGA_KICH_loop.sh

Get the BAM Index files for each BAM file

Before running bamindex.sh, ensure samtools is installed, and that all the BAM files are located in the same directory.

chmod +x bamindex.sh

./bamindex.sh

OR

bash bamindex.sh

Run qmotif with Kidney_TCGA_KICH.py

Before running this script, make sure the following are in place:
1. qmotif is installed and accessible from your system's path.
2. You’ve set the correct paths to your qmotif executable, as well as your BAM and BAI input files.
This script expects input file names without extensions (i.e., no .bam or .bai).
A quick way to prepare the list: I copied the names of all the BAI files generated at the end of the previous script (bamindex.sh), pasted them into a Google Doc, and used the Find and Replace feature (Cmd + F on Mac or Ctrl + F on Windows). I searched for “.bai” and replaced it with an empty string using the Replace All feature.

python3 Kidney_TCGA_KICH.py

Download the (sliced) BAM files for 15 telomere-related genes regions

Run Kidney_TCGA_KICH_gene_loop.sh to download the sliced BAM files from the GDC Data Portal. Make sure to:
Include regions that you want the sliced BAM files to have, and that both the token and Kidney_TCGA_KICH_curl.tsv are in the same folder.

chmod +x Kidney_TCGA_KICH_gene_loop.sh

./Kidney_TCGA_KICH_gene_loop.sh

OR

bash Kidney_TCGA_KICH_gene_loop.sh

////// 4. Parse qmotif Log Files

Use stage2.py to parse log files created by runqmotif.py. This script will output telomere read counts for each chromosome in a file named {sequence_name}_stage2_coverage.txt.

python3 stage2.py

Generate Chromosome-Level Tally of Telomeric reads

Run realcoverage.sh to tally telomeric reads for each chromosome. This script uses the chrnames file, so make sure it’s in the same folder.
Note: The chrnames file only has chromosome numbers for autosomes; sex chromosomes are not included.

bash realcoverage.sh

Extract Scaled Telomeric Reads for all the samples

Run scaledgenomic.sh file to extract scaled telomeric reads data from output files created by runqmotif.py. This generates a file named ScaledGenomicOutput.txt.

bash scaledgenomic.sh

Outputs

{ProjectID}_${CaseID}_${SampleType}.bam: BAM files sliced according to telomeric regions (output of Kidney_TCGA_KICH_loop.sh)
{filename}.bai: BAM index (BAI) files for the sliced BAM files (output of bamindex.sh)
{sequence_name}_terminal_output.txt: Chromosome-specific telomeric read counts (output of Kidney_TCGA_KICH.py)
{ProjectID}_${CaseID}_${SampleType}.bam: BAM files sliced according to the telomere-related genes (output of Kidney_TCGA_KICH_gene_loop.sh)
{base_name}_variants.vcf.gz: VCF file containing the genomic variants info for each file (output of runbcftools.sh)
{basename}.txt: Generates a TXT file from the VCF file in the CHROM\tPOS\tREF\tALT\t%GT\n format (output of variantstxt.sh)
{ProjectID}_mutation_summary.csv: A CSV file that has the mutation summary of all samples for that TCGA Project (output of allgenotype.py)

Citation

If you use TeloQuest, please cite our paper

Shah, P., & Sethuraman, A. (2025a). A novel machine learning approach for tumor detection based on telomeric signatures. Biology Methods and Protocols, 10(1). https://doi.org/10.1093/biomethods/bpaf069

Contact

If you’d like to discuss this project or get in touch for other inquiries, please email me at priyanshishah213@gmail.com or connect with me on LinkedIn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TeloQuest

TeloQuest is a machine learning pipeline on a mission to uncover tumor status by analyzing telomere content variation!

Table of Contents

Requirements

Additional Things to Prepare:

Gene symbols, names, and genomic coordinates of the 15 telomere-related genes for which variants have been captured from the paper by Burren et al. (2024).

File Descriptions

Usage

Sequence of Execution

Detailed Steps

Outputs

Citation

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
Kidney_TCGA_KICH.py		Kidney_TCGA_KICH.py
Kidney_TCGA_KICH_curl.tsv		Kidney_TCGA_KICH_curl.tsv
Kidney_TCGA_KICH_gene_loop.sh		Kidney_TCGA_KICH_gene_loop.sh
Kidney_TCGA_KICH_loop.sh		Kidney_TCGA_KICH_loop.sh
README.md		README.md
allgenotype.py		allgenotype.py
bamindex.sh		bamindex.sh
random_forest.ipynb		random_forest.ipynb
runbcftools.sh		runbcftools.sh
teloquest.png		teloquest.png
telotales.png		telotales.png
variantstxt.sh		variantstxt.sh

Folders and files

Latest commit

History

Repository files navigation

TeloQuest

TeloQuest is a machine learning pipeline on a mission to uncover tumor status by analyzing telomere content variation!

Table of Contents

Requirements

Additional Things to Prepare:

Gene symbols, names, and genomic coordinates of the 15 telomere-related genes for which variants have been captured from the paper by Burren et al. (2024).

File Descriptions

Usage

Sequence of Execution

Detailed Steps

Outputs

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages