~

~

Introduction

SNP-based imputation approaches for human leukocyte antigen (HLA) typing take advantage of the extended haplotype structure within the major histocompatibility complex (MHC) to predict HLA classical alleles using dense SNP genotypes, such as those available on array-based panels in genome-wide association study (GWAS). These methods enable HLA analyses of classical alleles on existing SNP datasets genotyped in GWAS at no extra cost. Here, we describe the workflow of HIBAG, an imputation method with attribute bagging, for obtaining a sample’s HLA classical alleles using SNP data.

In this exercise, we will download a pre-built classifier from the HIBAG website, utilize it on a SNP dataset, and evaluate its predictive performance.

~

Figure 1: Computational workflow for HLA imputation using SNP data: the imputation with the model parameters, where the pre-built models are downloadable.

~

~

Imputation with pre-built models

1. Download the data package

Go to the HIBAG website (https://hibag.s3.amazonaws.com/2024_tutorial_for_mimb/index.html) and download the ZIP file “data_package.zip”, which contains the 1000 Genomes SNP data and HLA classical alleles. The SNP data in the exercise were customized for the Illumina ImmunoChip platform.

# download the data using command line
wget https://hibag.s3.amazonaws.com/2024_tutorial_for_mimb/data_package.zip
unzip data_package.zip

2. Format conversion

After downloading and uncompressing the ZIP file, use the SeqArray R package to convert the VCF file to a GDS file, since the HIBAG package can import the SNP data in a GDS file.

library(SeqArray)
## Loading required package: gdsfmt
# output a GDS file
seqVCF2GDS("IMMPUTE_KG_SNP.vcf.gz", "IMMPUTE_KG_SNP.gds")
## Mon Dec  4 15:22:58 2023
## Variant Call Format (VCF) Import:
##     file:
##         IMMPUTE_KG_SNP.vcf.gz (2.4M)
##     file format: VCFv4.2
##     genome reference: <unknown>
##     # of sets of chromosomes (ploidy): 2
##     # of samples: 930
##     genotype field: GT
##     genotype storage: bit2
##     compression method: LZMA_RA
##     # of samples: 930
## Output:
##     IMMPUTE_KG_SNP.gds
## Parsing 'IMMPUTE_KG_SNP.vcf.gz':
## + genotype/data   { Bit2 2x930x10268 LZMA_ra(6.72%), 361.7K }
## Digests:
##     sample.id  [md5: 43c31274f2366076adb93c6261f80d29]
##     variant.id  [md5: c92adf16d3e1ed625f4a57b3e75fefa8]
##     position  [md5: d46128ecb0825026ca4d69b365166eac]
##     chromosome  [md5: 4cb221272aa4b35790380139618a6abf]
##     allele  [md5: 1d448d1f16c2b07ebbe2b5f4486b961f]
##     genotype  [md5: e79fa2f38e2b1bc0ccc578d34a861be4]
##     phase  [md5: 6f5231f35977661f04054fba095d7c25]
##     annotation/id  [md5: 99a5b8b638328234edeca22f175b4021]
##     annotation/qual  [md5: 4d4c0b7c7303a227dde9cf486ddb5302]
##     annotation/filter  [md5: 57c09fbbdeb9b88ea387851a71a7d1f9]
##     annotation/info/PR  [md5: b5e1395323491a93d4b6444abb2073a5]
## Done.
## Mon Dec  4 15:23:00 2023
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'IMMPUTE_KG_SNP.gds' (449.3K)
##     # of fragments: 115
##     save to 'IMMPUTE_KG_SNP.gds.tmp'
##     rename 'IMMPUTE_KG_SNP.gds.tmp' (448.6K, reduced: 708B)
##     # of fragments: 56
## Mon Dec  4 15:23:00 2023

3. Download the pre-built models

Go to the website of HIBAG platform-specific models (https://hibag.s3.amazonaws.com/hlares_index.html) and download the two-field pre-built classifiers of Illumina ImmunoChip on the hg19 human reference genome “ImmunoChip-Broad-HLARES-HLA4-hg19.RData”. These models were built with multi-ethnic GSK HLARES samples, and can be used to impute two-field HLA types at the HLA–A, –B, –C, –DRB1, –DQA1, –DQB1 and –DPB1 loci. This file is also available in the data package “data_package.zip”.

4. Load models and SNP data

Start an R session and import the models and the SNP dataset of 930 individuals:

library(HIBAG)
## HIBAG (HLA Genotype Imputation with Attribute Bagging)
## Kernel Version: v1.5 (64-bit)
# load the pre-built models
model_lst <- get(load("ImmunoChip-Broad-HLARES-HLA4-hg19.RData"))
model_lst
## $A
## Gene: HLA-A
## Training dataset: 2901 samples X 1023 SNPs
##     # of HLA alleles: 83
##     # of individual classifiers: 500
##     total # of SNPs used: 1023
##     avg. # of SNPs in an individual classifier: 61.56
##         (sd: 5.65, min: 39, max: 78, median: 62.00)
##     avg. # of haplotypes in an individual classifier: 953.97
##         (sd: 151.59, min: 541, max: 1507, median: 949.00)
##     avg. out-of-bag accuracy: 95.82%
##         (sd: 0.45%, min: 94.28%, max: 97.06%, median: 95.84%)
## Matching proportion:
##         Min.     0.1% Qu.       1% Qu.      1st Qu.       Median      3rd Qu. 
## 8.238770e-08 9.695627e-08 1.078673e-06 4.675610e-05 2.301877e-04 9.584155e-04 
##         Max.         Mean           SD 
## 1.096047e-01 1.013148e-03 2.911058e-03 
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip 
## Information: Training set -- GSK HLARES (4-digit resolution) 
## 
## $B
## Gene: HLA-B
## Training dataset: 3886 samples X 964 SNPs
##     # of HLA alleles: 142
##     # of individual classifiers: 500
##     total # of SNPs used: 964
##     avg. # of SNPs in an individual classifier: 56.41
##         (sd: 4.05, min: 34, max: 68, median: 56.00)
##     avg. # of haplotypes in an individual classifier: 2010.78
##         (sd: 248.92, min: 1343, max: 3053, median: 1998.50)
##     avg. out-of-bag accuracy: 91.83%
##         (sd: 0.54%, min: 90.11%, max: 93.54%, median: 91.81%)
## Matching proportion:
##         Min.     0.1% Qu.       1% Qu.      1st Qu.       Median      3rd Qu. 
## 6.551217e-09 4.287264e-08 1.375979e-07 4.960921e-06 3.505774e-05 2.326189e-04 
##         Max.         Mean           SD 
## 2.504231e-01 7.186401e-04 8.469832e-03 
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip 
## Information: Training set -- GSK HLARES (4-digit resolution) 
## 
## $C
## Gene: HLA-C
## Training dataset: 2916 samples X 1002 SNPs
##     # of HLA alleles: 49
##     # of individual classifiers: 500
##     total # of SNPs used: 1002
##     avg. # of SNPs in an individual classifier: 53.87
##         (sd: 4.40, min: 37, max: 66, median: 54.00)
##     avg. # of haplotypes in an individual classifier: 1011.51
##         (sd: 167.48, min: 549, max: 1554, median: 998.00)
##     avg. out-of-bag accuracy: 98.09%
##         (sd: 0.30%, min: 97.08%, max: 98.91%, median: 98.10%)
## Matching proportion:
##         Min.     0.1% Qu.       1% Qu.      1st Qu.       Median      3rd Qu. 
## 9.960546e-08 1.425722e-07 7.710465e-07 2.115832e-05 9.684796e-05 4.163268e-04 
##         Max.         Mean           SD 
## 3.666371e-01 1.024207e-03 1.277583e-02 
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip 
## Information: Training set -- GSK HLARES (4-digit resolution) 
## 
## $DRB1
## Gene: HLA-DRB1
## Training dataset: 3713 samples X 883 SNPs
##     # of HLA alleles: 79
##     # of individual classifiers: 500
##     total # of SNPs used: 883
##     avg. # of SNPs in an individual classifier: 54.90
##         (sd: 3.63, min: 40, max: 63, median: 55.00)
##     avg. # of haplotypes in an individual classifier: 2019.67
##         (sd: 369.52, min: 1187, max: 3214, median: 1999.00)
##     avg. out-of-bag accuracy: 90.48%
##         (sd: 0.65%, min: 88.45%, max: 92.57%, median: 90.49%)
## Matching proportion:
##         Min.     0.1% Qu.       1% Qu.      1st Qu.       Median      3rd Qu. 
## 2.245261e-08 6.716674e-08 1.393374e-07 4.955265e-06 3.684551e-05 2.117110e-04 
##         Max.         Mean           SD 
## 1.205615e-01 4.348458e-04 3.912193e-03 
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip 
## Information: Training set -- GSK HLARES (4-digit resolution) 
## 
## $DQA1
## Gene: HLA-DQA1
## Training dataset: 2727 samples X 850 SNPs
##     # of HLA alleles: 19
##     # of individual classifiers: 500
##     total # of SNPs used: 850
##     avg. # of SNPs in an individual classifier: 47.96
##         (sd: 3.83, min: 33, max: 59, median: 48.00)
##     avg. # of haplotypes in an individual classifier: 800.41
##         (sd: 146.20, min: 456, max: 1412, median: 793.50)
##     avg. out-of-bag accuracy: 95.60%
##         (sd: 0.43%, min: 94.18%, max: 96.83%, median: 95.63%)
## Matching proportion:
##         Min.     0.1% Qu.       1% Qu.      1st Qu.       Median      3rd Qu. 
## 1.543247e-07 3.541336e-07 9.058163e-07 5.912263e-05 2.757513e-04 9.854377e-04 
##         Max.         Mean           SD 
## 2.495200e-01 1.175183e-03 8.615668e-03 
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip 
## Information: Training set -- GSK HLARES (4-digit resolution) 
## 
## $DQB1
## Gene: HLA-DQB1
## Training dataset: 2985 samples X 932 SNPs
##     # of HLA alleles: 27
##     # of individual classifiers: 500
##     total # of SNPs used: 932
##     avg. # of SNPs in an individual classifier: 49.53
##         (sd: 4.24, min: 34, max: 62, median: 50.00)
##     avg. # of haplotypes in an individual classifier: 943.40
##         (sd: 183.06, min: 599, max: 1640, median: 933.00)
##     avg. out-of-bag accuracy: 98.51%
##         (sd: 0.28%, min: 97.56%, max: 99.23%, median: 98.51%)
## Matching proportion:
##         Min.     0.1% Qu.       1% Qu.      1st Qu.       Median      3rd Qu. 
## 1.459782e-07 3.004167e-07 8.940742e-07 4.167288e-05 1.937619e-04 6.543891e-04 
##         Max.         Mean           SD 
## 1.782826e-01 7.630119e-04 5.270624e-03 
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip 
## Information: Training set -- GSK HLARES (4-digit resolution) 
## 
## $DPB1
## Gene: HLA-DPB1
## Training dataset: 2489 samples X 600 SNPs
##     # of HLA alleles: 49
##     # of individual classifiers: 500
##     total # of SNPs used: 600
##     avg. # of SNPs in an individual classifier: 40.04
##         (sd: 3.07, min: 30, max: 51, median: 40.00)
##     avg. # of haplotypes in an individual classifier: 691.18
##         (sd: 122.57, min: 402, max: 1051, median: 677.00)
##     avg. out-of-bag accuracy: 93.89%
##         (sd: 0.52%, min: 92.29%, max: 95.17%, median: 93.88%)
## Matching proportion:
##         Min.     0.1% Qu.       1% Qu.      1st Qu.       Median      3rd Qu. 
## 3.948463e-08 2.293549e-07 1.141199e-06 1.224186e-04 7.264846e-04 2.838569e-03 
##         Max.         Mean           SD 
## 3.006500e-01 2.108680e-03 7.148050e-03 
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip 
## Information: Training set -- GSK HLARES (4-digit resolution)
# load the SNP genotypes
geno <- hlaGDS2Geno("IMMPUTE_KG_SNP.gds", assembly="hg19")
## Open 'IMMPUTE_KG_SNP.gds'
## Import 8110 SNPs within the xMHC region on chromosome 6
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed, 1s
geno
## SNP genotypes: 
##     930 samples X 8110 SNPs
##     SNPs range from 28695149bp to 34097304bp on hg19
## Missing rate per SNP:
##     min: 0, max: 0, mean: 0, median: 0, sd: 0
## Missing rate per sample:
##     min: 0, max: 0, mean: 0, median: 0, sd: 0
## Minor allele frequency:
##     min: 0, max: 0.5, mean: 0.198875, median: 0.170968, sd: 0.14291
## Allelic information:
##        A/G        T/C        C/T        G/A        A/C        T/G        C/G 
##       1755       1652       1244       1207        366        354        320 
##        G/T        C/A        G/C        A/T        T/A       GT/G        0/C 
##        302        275        273        179        173          2          1 
##        0/G C/CAAATTAT      CAA/C       CT/C       GA/G      GCA/G    TAAAC/T 
##          1          1          1          1          1          1          1

5. Run the imputation for HLA-A

Run the HIBAG prediction algorithm for HLA–A:

# load the HLA-A model from an R object
model <- hlaModelFromObj(model_lst$A)

# run prediction with a single core
# hla_a <- hlaPredict(model, geno)
# this process takes ~15 minutes to run, so we use multiple cores to speed up the calculation:

# run with 8 cores via multithreading
pred <- hlaPredict(model, geno, cl=8)
## HIBAG model for HLA-A:
##     500 individual classifiers
##     1023 SNPs
##     83 unique HLA alleles: 01:01, 01:02, 01:25, ...
## Prediction:
##     based on the averaged posterior probabilities
## Model assembly: hg19, SNP assembly: hg19
## Matching the SNPs between the model and the test data:
##     match.type="--"   missing SNPs #                
##            Position         0 (0.0%) *being used [1]
##          Pos+Allele      507 (49.6%)             [2]
##     RefSNP+Position         0 (0.0%)                
##              RefSNP         0 (0.0%)                
##       [1]: useful if ambiguous strands on array-based platforms
##       [2]: suggested if the model and test data have been matched to the same reference genome
##     Model platform: Illumina 1M Duo / ImmunoChip
## # of SNP loci with flipped alleles: 509
## # of SNP loci with strand ambiguity (e.g., C/G): 67 (comparing allele frequencies)
## # of samples: 930
## CPU flags: 64-bit
## # of threads: 8
## Predicting (2023-12-04 15:23:10) 0%
## Predicting (2023-12-04 15:23:21) 10%
## Predicting (2023-12-04 15:23:33) 20%
## Predicting (2023-12-04 15:23:44) 30%
## Predicting (2023-12-04 15:23:56) 40%
## Predicting (2023-12-04 15:24:09) 50%
## Predicting (2023-12-04 15:24:21) 60%
## Predicting (2023-12-04 15:24:32) 70%
## Predicting (2023-12-04 15:24:43) 80%
## Predicting (2023-12-04 15:24:55) 90%
## Predicting (2023-12-04 15:25:07) 100%
summary(pred)
## Gene: HLA-A
## Range: [29910247bp, 29913661bp] on hg19
## # of samples: 930
## # of unique HLA alleles: 38
## # of unique HLA genotypes: 218
## Posterior probability:
##    [0,0.25)  [0.25,0.5)  [0.5,0.75)    [0.75,1] 
##    1 (0.1%)   36 (3.9%) 114 (12.3%) 779 (83.8%) 
## Matching proportion of SNP haplotype:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 1.175e-05 9.659e-05 5.542e-04 4.379e-04 7.793e-03 
## Dosages:
## $dosage - num [1:83, 1:930] 1.00 6.83e-05 1.37e-05 1.33e-17 6.09e-33 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:83] "01:01" "01:02" "01:25" "02:01" ...
##   ..$ : chr [1:930] "HG00096" "HG00097" "HG00099" "HG00100" ...
# data.frame for best-guess alleles
head(pred$value)
##   sample.id allele1 allele2      prob     matching
## 1   HG00096   01:01   29:02 0.9999180 0.0012829705
## 2   HG00097   03:01   24:02 0.9491243 0.0004386126
## 3   HG00099   01:01   68:01 0.9983373 0.0008045432
## 4   HG00100   01:01   01:01 0.9751905 0.0018377117
## 5   HG00101   11:01   32:01 0.9969865 0.0001925689
## 6   HG00102   01:01   01:01 0.9961324 0.0008196929
# write the imputation results to a csv file
write.csv(pred$value, file="result.csv", row.names=FALSE)

# or output to a VCF file for best-guess alleles and estimated allele dosages
hlaAlleleToVCF(pred, "result.vcf")
## Convert to dosage VCF format:
##     # of samples: 930
##     output: 'result.vcf'
##     # of unique HLA-A alleles: 38
##     [ 01:01,01:02,02:01,02:02,02:05,02:06,02:07,02:11,03:01,03:02,11:01,11:02,23:01,24:02,24:04,25:01,26:01,29:01,29:02,30:01,30:02,30:04,30:10,31:01,32:01,33:01,33:03,68:01,34:02,36:01,66:01,66:02,68:02,68:17,69:01,74:01,74:03,80:01 ]

The output VCF file looks like:

## ##fileformat=VCFv4.0
## ##fileDate=20231204
## ##source=HIBAG_v1.36.4
## ##reference=hg19
## ##contig=<ID=6,length=171115067>
## ##FILTER=<ID=PASS,Description="All filters passed">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##FORMAT=<ID=DS,Number=1,Type=Float,Description="Dosage of HLA allele">
## #CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG ...
## 6    29911954    HLA-A*01:01 A   P_0101  .   PASS    .   GT:DS   1/0:9.9992e-01  0/0:1.5387e-18  1/0 ...
## 6    29911954    HLA-A*01:02 A   P_0102  .   PASS    .   GT:DS   0/0:6.8251e-05  0/0:4.9487e-25  0/0 ...
## 6    29911954    HLA-A*02:01 A   P_0201  .   PASS    .   GT:DS   0/0:1.3328e-17  0/0:1.7710e-20  0/0 ...
## 6    29911954    HLA-A*02:02 A   P_0202  .   PASS    .   GT:DS   0/0:6.0871e-33  0/0:3.1643e-21  0/0 ...
## 6    29911954    HLA-A*02:05 A   P_0205  .   PASS    .   GT:DS   0/0:5.3106e-33  0/0:1.1394e-24  0/0 ...
## 6    29911954    HLA-A*02:06 A   P_0206  .   PASS    .   GT:DS   0/0:1.8076e-20  0/0:8.9196e-21  0/0 ...
## 6    29911954    HLA-A*02:07 A   P_0207  .   PASS    .   GT:DS   0/0:1.0875e-25  0/0:1.2243e-29  0/0 ...

6. Confidence scores

The posterior probabilities (the column “prob”, i.e., pred$value$prob) calculated in the imputation process are used as confidence scores, and the call threshold is suggested to be 0.5. The call rate (the fraction of individuals successfully imputed) is 96% in this example according to the posterior probabilities greater than 0.5. The column “matching” is a measure or proportion describing how the SNP profile matches the SNP haplotypes observed in the training set, i.e., the likelihood of SNP profile in a random-mating population consisting of training haplotypes. Matching proportion is not directly related to confidence score, but a very low value of “matching” indicates that it is underrepresented in the training set.

7. Prediction accuracies

The data package contains the actual HLA–A, –B, –C, –DRB1 and –DQB1 types of 930 individuals in the 1000 Genomes project. We use these HLA classical alleles to evaluate the HIBAG imputation at the HLA–A locus:

# read the actual HLA classical alleles
df_hla <- read.csv("IMMPUTE_KG_HLA_2field.csv")

# create an object of HLA genotypes with true alleles
true_h <- hlaAllele(df_hla$SampleID, df_hla$A.1, df_hla$A.2, locus="A")
## using the default genome assembly (assembly="hg19")
# evaludate the prediction without and with a call threshold
ct0 <- hlaCompareAllele(true_h, pred, call.threshold=0)
ct5 <- hlaCompareAllele(true_h, pred, call.threshold=0.5)

The function hlaCompareAllele() returns the details of comparison between the imputed alleles and the true ones, including overall prediction accuracies at the individual and allele levels, a confusion matrix, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for each allele. For example, the overall allele-level accuracies are 94.8% and 93.4% with and without call threshold respectively, i.e., the number of correctly imputed alleles over the total number of alleles:

ct0$overall$acc.haplo
## [1] 0.9344086
ct5$overall$acc.haplo
## [1] 0.9479283

8. Markdown report

The HIBAG package provides utilities to create a report with tables for the prediction evaluation in the text, html, tex or markdown format. The function hlaReport() is used to generate an output file “allele.md”. The columns in the table are the allele, accuracy, sensitivity, specificity, positive predictive value and negative predictive value, etc.

hlaReport(ct0, export.fn="allele.md", type="markdown")

The output markdown file looks like:

Overall accuracy: 93.4%, Call rate: 100.0%

Allele # Valid. Freq. Valid. CR (%) ACC (%) SEN (%) SPE (%) PPV (%) NPV (%) Miscall (%)
01:01 111 0.0597 100.0 99.6 97.3 99.8 96.4 99.8 03:01 (67)
01:02 5 0.0027 100.0 99.8 20.0 100.0 100.0 99.8 01:01 (100)
02:01 362 0.1946 100.0 97.4 96.4 97.7 90.9 99.1 02:07 (92)
02:02 22 0.0118 100.0 99.6 68.2 100.0 100.0 99.6 02:05 (100)
02:03 17 0.0091 100.0 99.1 0.0 100.0 99.1 02:01 (100)
02:04 1 0.0005 100.0 99.9 0.0 100.0 99.9 02:01 (100)
02:05 15 0.0081 100.0 99.5 100.0 99.5 62.5 100.0
02:06 41 0.0220 100.0 99.9 100.0 99.9 95.3 100.0
02:07 35 0.0188 100.0 99.1 85.7 99.3 71.4 99.7 02:01 (100)
02:10 2 0.0011 100.0 99.9 0.0 100.0 99.9 02:06 (100)
02:11 5 0.0027 100.0 99.6 0.0 99.9 0.0 99.7 02:01 (100)
02:14 1 0.0005 100.0 99.9 0.0 100.0 99.9 02:05 (100)
02:17 1 0.0005 100.0 99.9 0.0 100.0 99.9 02:01 (100)
02:20 1 0.0005 100.0 99.9 0.0 100.0 99.9 02:01 (100)
02:22 2 0.0011 100.0 99.9 0.0 100.0 99.9 02:01 (50)
02:24 1 0.0005 100.0 99.9 0.0 100.0 99.9 02:01 (100)
03:01 176 0.0946 100.0 99.6 99.4 99.6 96.7 99.9 31:01 (100)
03:02 5 0.0027 100.0 99.8 20.0 100.0 100.0 99.8 03:01 (100)
11:02 10 0.0054 100.0 99.7 60.0 99.9 85.7 99.8 11:01 (100)
24:04 2 0.0011 100.0 99.9 50.0 100.0 100.0 99.9 24:02 (100)
24:20 1 0.0005 100.0 99.9 0.0 100.0 99.9 24:02 (100)
24:24 1 0.0005 100.0 99.9 0.0 100.0 99.9 23:01 (100)
26:01 45 0.0242 100.0 98.5 93.3 98.7 63.2 99.8 25:01 (67)
31:01 66 0.0355 100.0 99.9 100.0 99.9 98.5 100.0
33:01 11 0.0059 100.0 100.0 100.0 100.0 100.0 100.0
11:01 162 0.0871 100.0 99.7 99.4 99.8 97.6 99.9 11:02 (100)
23:01 47 0.0253 100.0 99.9 97.9 99.9 97.9 99.9 26:01 (50)
24:02 234 0.1258 100.0 99.7 100.0 99.6 97.5 100.0
24:03 3 0.0016 100.0 99.8 0.0 100.0 99.8 24:02 (100)
24:08 1 0.0005 100.0 99.9 0.0 100.0 99.9 24:02 (100)
25:01 15 0.0081 100.0 99.8 86.7 99.9 86.7 99.9 26:01 (100)
26:02 1 0.0005 100.0 99.9 0.0 100.0 99.9 26:01 (100)
26:03 6 0.0032 100.0 99.7 0.0 100.0 99.7 26:01 (100)
26:08 3 0.0016 100.0 99.8 0.0 100.0 99.8 26:01 (100)
29:01 4 0.0022 100.0 99.8 100.0 99.8 61.5 100.0
29:02 54 0.0290 100.0 99.8 96.3 99.9 98.1 99.9 29:01 (100)
30:01 62 0.0333 100.0 100.0 100.0 100.0 100.0 100.0
30:02 51 0.0274 100.0 100.0 100.0 100.0 100.0 100.0
30:04 3 0.0016 100.0 100.0 100.0 100.0 100.0 100.0
30:10 1 0.0005 100.0 100.0 100.0 100.0 100.0 100.0
31:04 4 0.0022 100.0 99.8 0.0 100.0 99.8 26:01 (75)
32:01 41 0.0220 100.0 100.0 100.0 100.0 100.0 100.0
33:03 61 0.0328 100.0 99.8 100.0 99.8 96.1 100.0
34:02 14 0.0075 100.0 99.8 71.4 100.0 100.0 99.8 02:01 (50)
36:01 15 0.0081 100.0 100.0 100.0 100.0 100.0 100.0
66:01 11 0.0059 100.0 99.5 18.2 100.0 100.0 99.5 26:01 (100)
66:02 9 0.0048 100.0 99.9 77.8 100.0 100.0 99.9 68:01 (100)
68:01 51 0.0274 100.0 99.7 94.1 99.9 95.0 99.8 … (100)
68:02 35 0.0188 100.0 100.0 100.0 100.0 100.0 100.0
68:03 1 0.0005 100.0 99.9 0.0 100.0 99.9 … (100)
69:01 1 0.0005 100.0 100.0 100.0 100.0 100.0 100.0
74:01 25 0.0134 100.0 100.0 100.0 100.0 100.0 100.0
74:03 2 0.0011 100.0 99.9 50.0 100.0 100.0 99.9 02:05 (100)
80:01 4 0.0022 100.0 100.0 100.0 100.0 100.0 100.0

9. Call rate and threshold

The relationships among imputation accuracy, call rate and call threshold are shown in the following figures. The prediction accuracy increases with a higher call threshold. The function hlaReportPlot() generates a figure for call rate and call threshold respectively:

library(ggplot2)

hlaReportPlot(pred, true_h, fig="call.threshold")

hlaReportPlot(pred, true_h, fig="call.rate")

10. Repeat the steps for other genes

Repeat Step 5 to 9 for HLA–B, –C, –DRB1 and –DQB1 respectively. The overall accuracies of five HLA genes are shown in Table 1, ranging from 90% to 97%, with and without call threshold.

~

~

Table 1: Summary of the two-field prediction accuracies and call rates for the HLARES pre-built classifiers, using 1000 genomes HLA data as validation samples.

HLA-A HLA-B HLA-C HLA-DRB1 HLA-DQB1
# of validation samples 930 930 930 930 930
Accuracy, no call threshold
93.4% 91.1% 97.2% 90.8% 91.2%
Accuracy, call threshold=0.5
94.8% 96.1% 97.7% 96.4% 91.6%
Call rate (%) 96.0 84.8 97.7 79.5 98.4

~

~

Session information

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.4.4   HIBAG_1.36.4    SeqArray_1.41.4 gdsfmt_1.36.1  
## [5] setwidth_1.0-4 
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.7              utf8_1.2.4              generics_0.1.3         
##  [4] bitops_1.0-7            digest_0.6.33           magrittr_2.0.3         
##  [7] evaluate_0.22           grid_4.3.2              fastmap_1.1.1          
## [10] jsonlite_1.8.7          GenomeInfoDb_1.36.4     fansi_1.0.5            
## [13] scales_1.2.1            Biostrings_2.68.1       textshaping_0.3.7      
## [16] jquerylib_0.1.4         cli_3.6.1               rlang_1.1.1            
## [19] crayon_1.5.2            XVector_0.40.0          munsell_0.5.0          
## [22] withr_2.5.2             cachem_1.0.8            yaml_2.3.7             
## [25] tools_4.3.2             parallel_4.3.2          dplyr_1.1.3            
## [28] colorspace_2.1-0        GenomeInfoDbData_1.2.10 BiocGenerics_0.46.0    
## [31] vctrs_0.6.4             R6_2.5.1                stats4_4.3.2           
## [34] lifecycle_1.0.3         zlibbioc_1.46.0         S4Vectors_0.38.2       
## [37] IRanges_2.34.1          ragg_1.2.6              pkgconfig_2.0.3        
## [40] RcppParallel_5.1.7      bslib_0.5.1             pillar_1.9.0           
## [43] gtable_0.3.4            glue_1.6.2              systemfonts_1.0.5      
## [46] highr_0.10              xfun_0.40               tibble_3.2.1           
## [49] GenomicRanges_1.52.1    tidyselect_1.2.0        rstudioapi_0.15.0      
## [52] knitr_1.45              farver_2.1.1            htmltools_0.5.6.1      
## [55] rmarkdown_2.25          labeling_0.4.3          compiler_4.3.2         
## [58] RCurl_1.98-1.12