Barcode calling
Spl-IsoQuant includes a built-in barcode calling module (splisoquant_detect_barcodes.py) that extracts
barcodes and UMIs from raw long reads. This module supports multiple single-cell and
spatial transcriptomics platforms and can also be used as a standalone tool.
See single-cell and spatial transcriptomics for supported platforms, pipeline integration, MDF format, and UMI deduplication.
How it works
The barcode calling module processes each read by:
- Searching for known constant sequences (linkers, primers, TSO) on the read
- Locating the polyT tail to determine read orientation
- Extracting the barcode region based on its expected position relative to the anchoring constant sequences
- Matching the extracted barcode against a whitelist
- Extracting the UMI sequence from its expected position
Each read is assigned:
- A barcode (cell/spot identity) corrected against the whitelist, or * if no match is found
- A UMI (unique molecular identifier), or * if not detected
- A strand orientation (+, -, or . if unknown)
- Platform-specific features (polyT position, linker positions, etc.)
Integration with Spl-IsoQuant
When running Spl-IsoQuant with any single-cell or spatial mode, barcode calling is performed automatically as the first pipeline step. The barcode detection output is a TSV file with one line per read, which is then used throughout the rest of the pipeline for grouping and UMI deduplication.
If barcodes have already been called externally (e.g., by Cell Ranger),
use --barcoded_reads to skip the barcode calling step.
Standalone usage
splisoquant_detect_barcodes.py can be run independently:
python splsplisoquant_detect_barcodes.py \
--input reads.fastq.gz \
--barcodes barcode_whitelist.txt \
--mode tenX_v3 \
--output output_prefix \
--threads 16
Command line options
--input or -i (required)
One or more input read files in FASTQ, FASTA or BAM format. Gzipped FASTQ/FASTA files are supported.
--output or -o (required)
Output prefix. The barcode calling results will be written to <prefix>.barcoded_reads.tsv.
When multiple input files are provided, outputs are numbered: <prefix>_0.barcoded_reads.tsv, etc.
--barcodes or -b
One or more barcode whitelist files. The number of files depends on the mode:
- 1 file:
tenX_v3,visium_5prime,stereoseq,stereoseq_nosplit,curio - 2 files:
visium_hd - Not needed for
custom_sc(barcodes defined in MDF file)
File should contain one barcode sequence per line. More than 1 tab-separated column is allowed, but only the first will be used. Supports plain text and gzipped files.
Note: barcode calling is performed much better if the whitelist contains a small number of barcodes. If you have a subset of barcodes from short-read data, provide them instead of the full whitelist.
--mode
Barcode calling mode. Available modes: tenX_v3, curio, stereoseq, stereoseq_nosplit,
visium_5prime, visium_hd, custom_sc.
--molecule
Path to a molecule description format (MDF) file for custom_sc mode.
This file defines the structure of the sequencing molecule (barcodes, UMIs, linkers, polyT, cDNA)ю
See MDF format for the format specification.
--threads or -t
Number of threads for parallel processing (default: 16).
--tmp_dir
Folder for temporary files during parallel processing.
Example commands
10x Genomics:
python splsplisoquant_detect_barcodes.py \
-i reads.fastq.gz \
-b 3M-february-2018.txt.gz \
--mode tenX_v3 \
-o results/my_sample \
-t 16
Stereo-seq:
python splsplisoquant_detect_barcodes.py \
-i reads.fastq.gz \
-b stereo_barcodes.txt \
--mode stereoseq \
-o results/stereo_sample \
-t 16
Custom molecule:
python splsplisoquant_detect_barcodes.py \
-i reads.fastq.gz \
--mode custom_sc \
--molecule my_platform.mdf \
-o results/custom_sample
Output
The main output is a TSV file (*.barcoded_reads.tsv) with a header line followed by one line per read.
The columns depend on the platform mode.
Common columns (all modes)
| Column | Description |
|---|---|
| read_id | Read identifier |
| barcode | Corrected barcode sequence, or * if not found |
| UMI | UMI sequence, or * if not detected |
| BC_score | Barcode alignment score (-1 if not matched) |
| strand | Read orientation: +, -, or . |
Additional columns by mode
10x Genomics (tenX_v3, visium_5prime, visium_hd):
| Column | Description |
|---|---|
| polyT | Position of polyT tail start (-1 if not found) |
| R1 | Position of R1 primer end (-1 if not found) |
Curio (curio):
| Column | Description |
|---|---|
| polyT | PolyT tail position |
| primer | PCR primer position |
| linker_start | Linker start position |
| linker_end | Linker end position |
Stereo-seq (stereoseq_nosplit, stereoseq):
| Column | Description |
|---|---|
| polyT | PolyT tail position |
| primer | Primer position |
| linker_start | Linker start position |
| linker_end | Linker end position |
| TSO | TSO position (-1 if not found) |
Custom (custom_sc): columns depend on the elements defined in the MDF file.
Each variable element produces a column with its detected sequence and score.
Statistics file
A statistics file (*.barcoded_reads.tsv.stats) is generated alongside the TSV output,
reporting the total number of reads processed, barcodes detected, and UMIs found.
Supported platforms
10x Genomics (tenX_v3, visium_5prime)
Molecule structure (3' to 5'):
Requires 1 barcode whitelist file (e.g., the 10x 3M-february-2018.txt.gz).
Visium HD (visium_hd)
Molecule structure (3' to 5'):
Requires 2 barcode whitelist files (part 1 and part 2). The final barcode is the concatenation of both parts.
Curio Bioscience (curio)
Molecule structure (3' to 5'):
The linker sequence anchors the barcode detection. Accepts 1 combined whitelist file (14bp barcodes).
Stereo-seq (stereoseq, stereoseq_nosplit)
Molecule structure (3' to 5'):
stereoseqmode: splits concatenated reads at TSO boundaries, producing a new FASTA with individual subreadsstereoseq_nosplitmode: processes reads without splitting
Custom platform (custom_sc)
Uses a molecule definition file (MDF) to describe arbitrary molecule structures. See MDF format.
The universal extractor:
1. Tries both forward and reverse complement orientations
2. Detects polyT tail to determine strand
3. Finds constant elements (primers, linkers) on the read
4. Extracts variable elements (barcodes, UMIs) from their expected positions
5. Corrects barcodes against whitelists when provided (VAR_FILE or VAR_LIST)
6. Handles linked elements: concatenated (|) parts are joined then corrected; duplicated (/) parts use majority vote