Allele Similarity Cluster (ASC)

The term Allele similarity clusters (ASC), defines alleles that have a degree of germline proximity. The proximity is defined as the Levenshtein distance between the coding region of the alleles’ germline sequences. A distance matrix of all alleles’ Levenshtein distance is constructed and the hierarchical tree is calculated. The tree leaves are then clustered by 95% similarity which creates the alleles clusters.

This IGH reference book uses the current IMGT IGHV reference available. We selected functional alleles and trimmed the sequences up to position 318. We discarded alleles which are truncated in their 5’ region as well as those that are less then 318 nucleotides.

Library amplicon length

Even though, we wish that all repertoires data available will cover the entire V region this is not always the case. Hence, we adapted our protocols to fit partial V coverage libraries. For the beginning we chose two library amplicon length, BIOMED-2 primers and Adaptive region coverage. The table below summaries the naming for each of the amplicon lengths and see Figure 1 for coverage illustration:

Library amplicon length Coverage Similar known protocol
S1 Full length - 1 to 318 (IMGT numbering) 5’ Race
S2 Starting within the framework 1 region BIOMED-2
S3 End of the V region Adaptive

Figure 1: V library amplicon length. Each row is a different V coverage, S1 for full length, S2 for BIOMED-2 primers, and S3 for adaptive coverage. The colors indicates the V regions according to IMGT numbering, where dark gray represents the IMGT gaps.

Genes to allele clusters

To illustrate the relation between the International Union of Immunological Societies (IUIS) and IMGT (Giudicelli and Lefranc 1999) named alleles to the clustered allele clusters we created the figure Figure 2.

Figure 2: Hierarchical clustering of the functional IGH germline set. The inner layer shows a dendrogram of the clustering, the dotted circles indicates the sequence similarity of 75% (orange) and 95% (blue). The dendrogram branches are colored by the 75% sequence similarity. The second, colored circle shows the clusters and alleles for the library amplicon length of S1, the second circle for length of S2, and the third for S3. The white color indicates alleles that cannot be distinguished in the library’s germline set.

Alleles’ group summary

Length Family Group Alleles
IUIS/IMGT 7 54 286
S1 8 47 279
S2 44 247
S3 11 130

Creating ASC based on reference set

The application below demonstrates the creation of the ASCs based on a given reference set. Either use the example human IGH reference set within the app, or upload a reference set of your own. The application includes four tabs; 1. The ASCs allele table, the columns are described below: - imgt_allele - the original IUIS/IMGT allele name - alleles_family - the ASC cluster number - alleles_group - the ASC cluster number - new_allele - the ASC given allele name - duplicated - if the allele has a duplicated allele within the set 2. An MDS plot of the input germline reference set. The two axis are the first components. Each dot in the plot is an allele. The colors of the dots are by the original genes (if present in the allele name). 3. ASC cluster plot - shows the hierarchical clustering of the input germline reference set. The out most circle is the allele names, the second layer are the ASC groups, each group is labeled and colored. The third circle is the clustering dendrogram, the braches are colored by the ASC families. The blue and orange dashed lines are the 95% and 75% similarity ASC threshold. 4. ASC match plot - shows the matching btween the IUIS/IMGT genes and the ASC clusters.

References

Giudicelli, Veronique, and Marie-Paule Lefranc. 1999. “Ontology for Immunogenetics: The IMGT-ONTOLOGY.” Bioinformatics 15 (12): 1047–54.