Processing AIRR-seq data with the ASC-pipeline
Annotating and genotyping with the allele-based method can be done using a docker container. The docker is based on the immcantation docker and holds the pipeline for processing your IGH repertoire sequences.
IGH repertoire processing pipeline
The pipeline is assembled from several steps as described below:
Igblast (Ye et al. 2013) alignment against the allele similarity clusters reference set
- The IGH allele similarity cluster reference set includes functional allele under the modified names to represent the new clusters.
- Each amplicon length S1 (full length of V region) and S2 (BIOMED-2 primer like V region length) has its own reference set. The current reference set version can be found here for S1, and here for S2
Inferring novel alleles using the TIgGER package and re-aligning the sequences against the additional alleles.
In case the repertoire is not naive, a clonal inference using Change-O is preformed. A single clonal representative with the least number of mutation is chosen.
- This step is to reduce the affect of SHM and clonal expansion on the genotype inference.
The allele-based genotype is inferred for the V alleles, for the D and J the genotype is inferred using the TIgGER package. The repertoire is aligned once more with the personal germline set.
Run output structure
The output of the pipeline is a nested folder by the name of the sample with the intermediate files of each step and the final analysis files.
Breakdown of the files and folders
Sample.tsv.gz - The aligned repertoire with the personal germline in AIRR-tsv format
Sample_V_genotype.tsv - The allele-based genotype inference of the V alleles.
subject gene alleles imgt_alleles counts absolute_fraction absolute_threshold genotyped_alleles genotype_imgt_alleles sample name allele cluster the present alleles in the repertoire the imgt nomenclature of the alleles the number of reads for each alleles the absolute fraction of the alleles the population driven allele thresholds for genotype presence the alleles which entered the genotype the imgt nomenclature of the alleles Sample_geno.tsv - The genotype inference of the D and J genes.
source_germline - folder containing the initial germline set
novel_germline - if found, the germline set including the inferred novel alleles
personal_germline - the personal germline set for V, D, and J based on the genotype
analysis_files - the intermediate files.
Running the docker
The processing pipeline for allele-based genotype inference is available on dockerhub: peresay/suite
ASC-based pipeline
To process your AIRR-Seq IGH dataset you first need to pull the docker as such
docker pull peresay/suite
The you can run the group-pipeline, which infers genotypes using the ASC-based method.
there are several parameters that can be modified. For the full list please go to the help guide within the docker.
# Arguments
DATA_DIR=~/P1
FASTA_FILE=/data/P1_I1_S1.fasta
SAMPLE_NAME=P1_I1_S1
NPROC=4
# Run pipeline for all avialble parameters
docker run peresay/suite -h
# Run pipeline for fasta file in docker image
docker run -v $DATA_DIR:/data:z peresay/suite \
-f $FASTA_FILE -s $SAMPLE_NAME -t $NPROC
group-pipeline
AIRR_FILE=/data/P1_I1_S1.tsv
# Run pipeline for airr format file in docker image.
docker run -v $DATA_DIR:/data:z peresay/suite \
-f $AIRR_FILE -s $SAMPLE_NAME -t $NPROC group-pipeline