File Formats

GIFTwrap uses specific file formats and structures for its inputs and outputs. Below is a summary of the key file formats, followed by supplementary files that GIFTwrap also works with.

Input Files

Required Input Files

FastQs: Raw sequencing data files in FASTQ format. These files may be either compressed or uncompressed. The R1 and R2 files should follow typical conventions for 10X Genomics experiments. GIFTwrap can handle multiple FastQ files for the same sample, so there is no need to manually concatenate them.
Probe Definition File: A spreadsheet containing the custom probe definitions used in your GIFT-seq experiment. This file may be a .csv, .tsv, or .xlsx file. The columns are described in the following table:

Column Name	Description	Required?
`name`	The name of the probe. By convention this should follow the format of: `gene_name HGSVc`. For example: `TP53 c.215G>A`.	Yes
`lhs_probe`	The left-hand side sequence of the probe, this is the reverse complement of the right side of your gene sequence of interest.	Yes
`rhs_probe`	The right-hand side sequence of the probe, this is the reverse complement of the left side of your gene sequence of interest.	Yes
`gap_probe_sequence`	The expected "mutant" sequence of the gapfill sequenced (i.e. the reverse complement of the region of interest). This is only used to annotate outputs and is not involved in default analysis.	No
`original_gap_probe_sequence`	The "wild-type" sequence of the gapfill probe (i.e. the reverse complement of the region of interest), this is only used to annotate outputs and is not involved in default analysis.	No
`gene`	The gene name that is associated with the probe. If not provided, GIFTwrap will attempt to infer this from the `name` column.	No

Recommended Input Files

These are not essential for running GIFTwrap, but they can improve the output quality and analysis:

Cell Ranger WTA Outputs: To refine cell calling and QC of GIFT-seq data, it is recommended that you first run Cell Ranger on your standard Whole Transcriptome (i.e. standard 10X Flex). You can then provide an output sample_filtered_feature_bc_matrix directory or filtered_feature_bc_matrix.h5 file to GIFTwrap. It will then refine the cell barcode whitelist and perform additional filtering/QC steps to ensure that only cells with captured transcriptomes are used in the analysis.

Optional Input Files

These are files that are not typically used for working with GIFTwrap, but may be useful in some cases:

Technology Definition File: A python file describing your custom sequencing experiment. Review the Custom Technology Definition Tutorial for more information.

Output Files

Output Directory Format

Following processing of GIFT-seq data, a typical run will produce an output directory with the following structure:

output/
├── counts.1.filtered.h5  (if applicable)
├── counts.1.h5
├── counts.1.summary.tsv
├── counts.1.summary.pdf
├── fastq_metrics.tsv
├── manifest.tsv
├── barcodes.tsv.gz
├── probe_reads.tsv.gz
├── probe_reads.tsv.bak.gapfill
├── probe_reads.tsv.bak.umi
├── unmapped_reads_R1.fastq.gz  (if applicable)
├── unmapped_reads_R2.fastq.gz  (if applicable)
└── steps/
    ├── COUNT_GAPFILLS
    ├── CORRECT_UMIS
    ├── CORRECT_GAPFILLS
    └── COLLECT_COUNTS

Commonly Used Output Files

The final collected output is a counts file which is a custom HDF5 file format inspired by the 10X Genomics sample_filtered_feature_bc_matrix.h5 file. There are two versions of this file, filtered (if the WTA was passed) and unfiltered. Additionally, if the sample was multiplexed, you will notice that the file names have a suffix indicating the sample number (e.g. counts.N.filtered.h5 where N is the probe barcode number).

counts.N.filtered.h5: The final output file containing the counts of each probe for each cell, after filtering and quality control. See below for details on the structure of the h5 file.
counts.N.h5: The unfiltered counts file containing all probes and cells, without any quality control filtering applied.
fastq_metrics.tsv: Summary statistics of the parsing of the given fastq files
- TOTAL_READS: Total number of reads processed by the count step.
- PROBE_CONTAINING_READS: Total number of reads that contained a valid probe (including umi/cell barcode).
- POSSIBLE_PROBES: The total number of probes defined.
- PROBES_ENCOUNTERED: The number of probes that were encountered in the fastq files.
- EXACT: The number of reads that contained probes and required no error correction.
- CORRECTED_BARCODE: The number of reads that required cell barcode correction.
- CORRECTED_LHS: The number of reads that required left-hand side probe correction.
- CORRECTED_RHS: The number of reads that required right-hand side probe correction.
- FILTERED_NO_CELL_BARCODE: The number of reads that were filtered out due to no valid cell barcode.
- FILTERED_NO_PROBE_BARCODE: The number of reads that were filtered out due to no valid probe barcode. Only applicable for multiplex runs.
- FILTERED_NO_LHS: The number of reads that were filtered out due to no valid left-hand side probe.
- FILTERED_NO_RHS: The number of reads that were filtered out due to no valid right-hand side probe.
- FILTERED_NO_CONSTANT: The number of reads that were filtered out due to no valid constant sequence region. Only applicable for Flex.
counts.N.summary.tsv: Summary statistics about the final (filtered if available) output of the pipeline.
- TOTAL_CELLS: The total number of cells in the output.
- GAPFILL_CONTAINING_CELLS: The number of cells that contained at least one gapfill read.
- UMIS_PER_CELL_MEAN: The mean number of UMIs per cell.
- UMIS_PER_CELL_MEDIAN: The median number of UMIs per cell.
- UMIS_PER_CELL_STD: The standard deviation of the number of UMIs per cell.
- UMIS_PER_CELL_MIN: The minimum number of UMIs per cell.
- UMIS_PER_CELL_MIN_EXCLUDING_ZERO: The minimum number of UMIs per cell excluding cells with zero UMIs.
- UMIS_PER_CELL_MAX: The maximum number of UMIs per cell.
- CELLS_PER_GAPFILL_MEAN: The mean number of cells with gapfills per probe.
- CELLS_PER_GAPFILL_MEDIAN: The median number of cells with gapfills per probe.
- CELLS_PER_GAPFILL_STD: The standard deviation of the number of cells with gapfills per probe.
- CELLS_PER_GAPFILL_MIN: The minimum number of cells with gapfills per probe.
- CELLS_PER_GAPFILL_MAX: The maximum number of cells with gapfills per probe.
counts.N.summary.pdf: Basic analysis of the final (filtered if available) output of the pipeline. The report includes various figures with descriptions inline.

Supplementary Files

These files are generated during processing of GIFT-seq data to allow for automatic resumes and to retain intermediate filtering steps. They are not typically used for downstream analysis, but may be useful for debugging or understanding the processing steps.

manifest.tsv: A manifest file containing the list of probes and their associated metadata. This file is used to track the probes that were processed in the run. This also defines the index of each probe represented in the raw flat reads files generated while processing.
barcodes.tsv.gz: A gzipped file containing the list of cell barcodes that were encountered in the run. The barcodes are used to collapse the strings into integer indices for efficient storage and processing.
probe_reads.tsv.gz: A gzipped file containing the raw counts of each probe for each cell post-gapfill correction. This file is used to store the raw counts before any filtering or quality control steps are applied. Note that probes and cells are represented as integer indices, not strings. Therefore, analysis requires joining this file with the manifest.tsv and barcodes.tsv.gz files to get the actual probe and cell names.
unmapped_reads_R{1,2}.fastq.gz: If unmapped reads were requested to be stored, this file contains the unmapped reads from the R1 fastq file. This is useful for debugging or further analysis of unmapped reads.
probe_reads.tsv.bak.gapfill: A gzipped backup file containing the raw data of mapped reads prior to gapfill correction.
probe_reads.tsv.bak.umi: A gzipped backup file containing the raw data of mapped reads prior to UMI correction.
steps/: A directory containing empty lock files for each step of the pipeline. These files are used to track the progress of the pipeline and allow for automatic resuming of the pipeline if it is interrupted. The lock files are named after the steps in the pipeline, such as COUNT_GAPFILLS, CORRECT_UMIS, CORRECT_GAPFILLS, and COLLECT_COUNTS.

Custom HDF5 Counts File Format

The custom HDF5 counts file format is designed to be compatible with the 10X Genomics filtered feature barcode matrix format. It contains the following nested structure:

matrix/  # The counts matrix data
├── barcode  # String array of cell barcodes
├── probe  # String array of probe/gapfill pairs (i.e. the features)
├── data/  # The counts data, a sparse 2D array of shape (num_cells, num_probes)
├── total_reads/  # A sparse 2D array of shape (num_cells, num_probes) containing the total number of reads captured for each cell and probe
└── percent_supporting/  # A sparse 2D array of shape (num_cells, num_probes) containing the percentage of reads supporting each probe for each cell

cell_metadata/  # Metadata about the cells
├── columns  # String array of column names
└── ...  # Each column as an array as defined by the columns list

probe_metadata/  # Metadata about the probes/features
├── name  # String array of probe names
├── gene  # (Optional) String array of gene names associated with the probes
├── lhs_probe  # String array of left-hand side sequences of the probes
├── rhs_probe  # String array of right-hand side sequences of the probes
├── gap_probe_sequence  # String array of the expected "mutant" sequence of the gapfill probes
├── original_gap_probe_sequence  # String array of the "wild-type" sequence of the gapfill probes
└── index  # Integer array of the index of the probe in the original probe definition file

attributes/  # HDF5 attributes
├── plex  # The multiplexing index of the sample
├── project  # Name of the sequencing data
├── created_date  # The timestamp of when the file was created
├── n_cells  # The number of cells in the dataset
├── n_probes  # The number of probes in the dataset
└── n_probe_gapfill_combinations  # The number of probe-gapfill combinations in the dataset