Processing AIRR-seq data with the ASC-pipeline

Annotating and genotyping with the allele-based method can be done using a docker container. The docker is based on the immcantation docker and holds the pipeline for processing your IGH repertoire sequences.

IGH repertoire processing pipeline

The pipeline is assembled from several steps as described below:

Igblast (Ye et al. 2013) alignment against the allele similarity clusters reference set
- The IGH allele similarity cluster reference set includes functional allele under the modified names to represent the new clusters.
- Each amplicon length S1 (full length of V region) and S2 (BIOMED-2 primer like V region length) has its own reference set. The current reference set version can be found here for S1, and here for S2
Inferring novel alleles using the TIgGER package and re-aligning the sequences against the additional alleles.
In case the repertoire is not naive, a clonal inference using Change-O is preformed. A single clonal representative with the least number of mutation is chosen.
- This step is to reduce the affect of SHM and clonal expansion on the genotype inference.
The allele-based genotype is inferred for the V alleles, for the D and J the genotype is inferred using the TIgGER package. The repertoire is aligned once more with the personal germline set.

Run output structure

The output of the pipeline is a nested folder by the name of the sample with the intermediate files of each step and the final analysis files.

Breakdown of the files and folders

Sample.tsv.gz - The aligned repertoire with the personal germline in AIRR-tsv format

Sample_V_genotype.tsv - The allele-based genotype inference of the V alleles.

subject	gene	alleles	imgt_alleles	counts	absolute_fraction	absolute_threshold	genotyped_alleles	genotype_imgt_alleles
sample name	allele cluster	the present alleles in the repertoire	the imgt nomenclature of the alleles	the number of reads for each alleles	the absolute fraction of the alleles	the population driven allele thresholds for genotype presence	the alleles which entered the genotype	the imgt nomenclature of the alleles

Sample_geno.tsv - The genotype inference of the D and J genes.
source_germline - folder containing the initial germline set
novel_germline - if found, the germline set including the inferred novel alleles
personal_germline - the personal germline set for V, D, and J based on the genotype
analysis_files - the intermediate files.

Running the docker

The processing pipeline for allele-based genotype inference is available on dockerhub: peresay/suite

ASC-based pipeline

To process your AIRR-Seq IGH dataset you first need to pull the docker as such

docker pull peresay/suite

The you can run the group-pipeline, which infers genotypes using the ASC-based method.

there are several parameters that can be modified. For the full list please go to the help guide within the docker.

# Arguments
DATA_DIR=~/P1
FASTA_FILE=/data/P1_I1_S1.fasta
SAMPLE_NAME=P1_I1_S1
NPROC=4

# Run pipeline for all avialble parameters
docker run peresay/suite -h

# Run pipeline for fasta file in docker image
docker run -v $DATA_DIR:/data:z peresay/suite \
    group-pipeline -f $FASTA_FILE -s $SAMPLE_NAME -t $NPROC

AIRR_FILE=/data/P1_I1_S1.tsv
# Run pipeline for airr format file in docker image. 
docker run -v $DATA_DIR:/data:z peresay/suite \
    group-pipeline -f $AIRR_FILE -s $SAMPLE_NAME -t $NPROC

Ye, Jian, Ning Ma, Thomas L. Madden, and James M. Ostell. 2013. “IgBLAST: An Immunoglobulin Variable Domain Sequence Analysis Tool.” Nucleic Acids Research 41 (W1): W34–40. https://doi.org/10.1093/nar/gkt382.