~
~
SNP-based imputation approaches for human leukocyte antigen (HLA) typing take advantage of the extended haplotype structure within the major histocompatibility complex (MHC) to predict HLA classical alleles using dense SNP genotypes, such as those available on array-based panels in genome-wide association study (GWAS). These methods enable HLA analyses of classical alleles on existing SNP datasets genotyped in GWAS at no extra cost. Here, we describe the workflow of HIBAG, an imputation method with attribute bagging, for obtaining a sample’s HLA classical alleles using SNP data.
In this exercise, we will download a pre-built classifier from the HIBAG website, utilize it on a SNP dataset, and evaluate its predictive performance.
~
Figure 1: Computational workflow for HLA imputation using SNP data: the imputation with the model parameters, where the pre-built models are downloadable.
~
~
Go to the HIBAG website (https://hibag.s3.amazonaws.com/2024_tutorial_for_mimb/index.html) and download the ZIP file “data_package.zip”, which contains the 1000 Genomes SNP data and HLA classical alleles. The SNP data in the exercise were customized for the Illumina ImmunoChip platform.
# download the data using command line
wget https://hibag.s3.amazonaws.com/2024_tutorial_for_mimb/data_package.zip
unzip data_package.zip
After downloading and uncompressing the ZIP file, use the SeqArray R package to convert the VCF file to a GDS file, since the HIBAG package can import the SNP data in a GDS file.
library(SeqArray)
## Loading required package: gdsfmt
# output a GDS file
seqVCF2GDS("IMMPUTE_KG_SNP.vcf.gz", "IMMPUTE_KG_SNP.gds")
## Mon Dec 4 15:22:58 2023
## Variant Call Format (VCF) Import:
## file:
## IMMPUTE_KG_SNP.vcf.gz (2.4M)
## file format: VCFv4.2
## genome reference: <unknown>
## # of sets of chromosomes (ploidy): 2
## # of samples: 930
## genotype field: GT
## genotype storage: bit2
## compression method: LZMA_RA
## # of samples: 930
## Output:
## IMMPUTE_KG_SNP.gds
## Parsing 'IMMPUTE_KG_SNP.vcf.gz':
## + genotype/data { Bit2 2x930x10268 LZMA_ra(6.72%), 361.7K }
## Digests:
## sample.id [md5: 43c31274f2366076adb93c6261f80d29]
## variant.id [md5: c92adf16d3e1ed625f4a57b3e75fefa8]
## position [md5: d46128ecb0825026ca4d69b365166eac]
## chromosome [md5: 4cb221272aa4b35790380139618a6abf]
## allele [md5: 1d448d1f16c2b07ebbe2b5f4486b961f]
## genotype [md5: e79fa2f38e2b1bc0ccc578d34a861be4]
## phase [md5: 6f5231f35977661f04054fba095d7c25]
## annotation/id [md5: 99a5b8b638328234edeca22f175b4021]
## annotation/qual [md5: 4d4c0b7c7303a227dde9cf486ddb5302]
## annotation/filter [md5: 57c09fbbdeb9b88ea387851a71a7d1f9]
## annotation/info/PR [md5: b5e1395323491a93d4b6444abb2073a5]
## Done.
## Mon Dec 4 15:23:00 2023
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
## open the file 'IMMPUTE_KG_SNP.gds' (449.3K)
## # of fragments: 115
## save to 'IMMPUTE_KG_SNP.gds.tmp'
## rename 'IMMPUTE_KG_SNP.gds.tmp' (448.6K, reduced: 708B)
## # of fragments: 56
## Mon Dec 4 15:23:00 2023
Go to the website of HIBAG platform-specific models (https://hibag.s3.amazonaws.com/hlares_index.html) and download the two-field pre-built classifiers of Illumina ImmunoChip on the hg19 human reference genome “ImmunoChip-Broad-HLARES-HLA4-hg19.RData”. These models were built with multi-ethnic GSK HLARES samples, and can be used to impute two-field HLA types at the HLA–A, –B, –C, –DRB1, –DQA1, –DQB1 and –DPB1 loci. This file is also available in the data package “data_package.zip”.
Start an R session and import the models and the SNP dataset of 930 individuals:
library(HIBAG)
## HIBAG (HLA Genotype Imputation with Attribute Bagging)
## Kernel Version: v1.5 (64-bit)
# load the pre-built models
model_lst <- get(load("ImmunoChip-Broad-HLARES-HLA4-hg19.RData"))
model_lst
## $A
## Gene: HLA-A
## Training dataset: 2901 samples X 1023 SNPs
## # of HLA alleles: 83
## # of individual classifiers: 500
## total # of SNPs used: 1023
## avg. # of SNPs in an individual classifier: 61.56
## (sd: 5.65, min: 39, max: 78, median: 62.00)
## avg. # of haplotypes in an individual classifier: 953.97
## (sd: 151.59, min: 541, max: 1507, median: 949.00)
## avg. out-of-bag accuracy: 95.82%
## (sd: 0.45%, min: 94.28%, max: 97.06%, median: 95.84%)
## Matching proportion:
## Min. 0.1% Qu. 1% Qu. 1st Qu. Median 3rd Qu.
## 8.238770e-08 9.695627e-08 1.078673e-06 4.675610e-05 2.301877e-04 9.584155e-04
## Max. Mean SD
## 1.096047e-01 1.013148e-03 2.911058e-03
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip
## Information: Training set -- GSK HLARES (4-digit resolution)
##
## $B
## Gene: HLA-B
## Training dataset: 3886 samples X 964 SNPs
## # of HLA alleles: 142
## # of individual classifiers: 500
## total # of SNPs used: 964
## avg. # of SNPs in an individual classifier: 56.41
## (sd: 4.05, min: 34, max: 68, median: 56.00)
## avg. # of haplotypes in an individual classifier: 2010.78
## (sd: 248.92, min: 1343, max: 3053, median: 1998.50)
## avg. out-of-bag accuracy: 91.83%
## (sd: 0.54%, min: 90.11%, max: 93.54%, median: 91.81%)
## Matching proportion:
## Min. 0.1% Qu. 1% Qu. 1st Qu. Median 3rd Qu.
## 6.551217e-09 4.287264e-08 1.375979e-07 4.960921e-06 3.505774e-05 2.326189e-04
## Max. Mean SD
## 2.504231e-01 7.186401e-04 8.469832e-03
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip
## Information: Training set -- GSK HLARES (4-digit resolution)
##
## $C
## Gene: HLA-C
## Training dataset: 2916 samples X 1002 SNPs
## # of HLA alleles: 49
## # of individual classifiers: 500
## total # of SNPs used: 1002
## avg. # of SNPs in an individual classifier: 53.87
## (sd: 4.40, min: 37, max: 66, median: 54.00)
## avg. # of haplotypes in an individual classifier: 1011.51
## (sd: 167.48, min: 549, max: 1554, median: 998.00)
## avg. out-of-bag accuracy: 98.09%
## (sd: 0.30%, min: 97.08%, max: 98.91%, median: 98.10%)
## Matching proportion:
## Min. 0.1% Qu. 1% Qu. 1st Qu. Median 3rd Qu.
## 9.960546e-08 1.425722e-07 7.710465e-07 2.115832e-05 9.684796e-05 4.163268e-04
## Max. Mean SD
## 3.666371e-01 1.024207e-03 1.277583e-02
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip
## Information: Training set -- GSK HLARES (4-digit resolution)
##
## $DRB1
## Gene: HLA-DRB1
## Training dataset: 3713 samples X 883 SNPs
## # of HLA alleles: 79
## # of individual classifiers: 500
## total # of SNPs used: 883
## avg. # of SNPs in an individual classifier: 54.90
## (sd: 3.63, min: 40, max: 63, median: 55.00)
## avg. # of haplotypes in an individual classifier: 2019.67
## (sd: 369.52, min: 1187, max: 3214, median: 1999.00)
## avg. out-of-bag accuracy: 90.48%
## (sd: 0.65%, min: 88.45%, max: 92.57%, median: 90.49%)
## Matching proportion:
## Min. 0.1% Qu. 1% Qu. 1st Qu. Median 3rd Qu.
## 2.245261e-08 6.716674e-08 1.393374e-07 4.955265e-06 3.684551e-05 2.117110e-04
## Max. Mean SD
## 1.205615e-01 4.348458e-04 3.912193e-03
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip
## Information: Training set -- GSK HLARES (4-digit resolution)
##
## $DQA1
## Gene: HLA-DQA1
## Training dataset: 2727 samples X 850 SNPs
## # of HLA alleles: 19
## # of individual classifiers: 500
## total # of SNPs used: 850
## avg. # of SNPs in an individual classifier: 47.96
## (sd: 3.83, min: 33, max: 59, median: 48.00)
## avg. # of haplotypes in an individual classifier: 800.41
## (sd: 146.20, min: 456, max: 1412, median: 793.50)
## avg. out-of-bag accuracy: 95.60%
## (sd: 0.43%, min: 94.18%, max: 96.83%, median: 95.63%)
## Matching proportion:
## Min. 0.1% Qu. 1% Qu. 1st Qu. Median 3rd Qu.
## 1.543247e-07 3.541336e-07 9.058163e-07 5.912263e-05 2.757513e-04 9.854377e-04
## Max. Mean SD
## 2.495200e-01 1.175183e-03 8.615668e-03
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip
## Information: Training set -- GSK HLARES (4-digit resolution)
##
## $DQB1
## Gene: HLA-DQB1
## Training dataset: 2985 samples X 932 SNPs
## # of HLA alleles: 27
## # of individual classifiers: 500
## total # of SNPs used: 932
## avg. # of SNPs in an individual classifier: 49.53
## (sd: 4.24, min: 34, max: 62, median: 50.00)
## avg. # of haplotypes in an individual classifier: 943.40
## (sd: 183.06, min: 599, max: 1640, median: 933.00)
## avg. out-of-bag accuracy: 98.51%
## (sd: 0.28%, min: 97.56%, max: 99.23%, median: 98.51%)
## Matching proportion:
## Min. 0.1% Qu. 1% Qu. 1st Qu. Median 3rd Qu.
## 1.459782e-07 3.004167e-07 8.940742e-07 4.167288e-05 1.937619e-04 6.543891e-04
## Max. Mean SD
## 1.782826e-01 7.630119e-04 5.270624e-03
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip
## Information: Training set -- GSK HLARES (4-digit resolution)
##
## $DPB1
## Gene: HLA-DPB1
## Training dataset: 2489 samples X 600 SNPs
## # of HLA alleles: 49
## # of individual classifiers: 500
## total # of SNPs used: 600
## avg. # of SNPs in an individual classifier: 40.04
## (sd: 3.07, min: 30, max: 51, median: 40.00)
## avg. # of haplotypes in an individual classifier: 691.18
## (sd: 122.57, min: 402, max: 1051, median: 677.00)
## avg. out-of-bag accuracy: 93.89%
## (sd: 0.52%, min: 92.29%, max: 95.17%, median: 93.88%)
## Matching proportion:
## Min. 0.1% Qu. 1% Qu. 1st Qu. Median 3rd Qu.
## 3.948463e-08 2.293549e-07 1.141199e-06 1.224186e-04 7.264846e-04 2.838569e-03
## Max. Mean SD
## 3.006500e-01 2.108680e-03 7.148050e-03
## Genome assembly: hg19
## Platform: Illumina 1M Duo / ImmunoChip
## Information: Training set -- GSK HLARES (4-digit resolution)
# load the SNP genotypes
geno <- hlaGDS2Geno("IMMPUTE_KG_SNP.gds", assembly="hg19")
## Open 'IMMPUTE_KG_SNP.gds'
## Import 8110 SNPs within the xMHC region on chromosome 6
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 1s
geno
## SNP genotypes:
## 930 samples X 8110 SNPs
## SNPs range from 28695149bp to 34097304bp on hg19
## Missing rate per SNP:
## min: 0, max: 0, mean: 0, median: 0, sd: 0
## Missing rate per sample:
## min: 0, max: 0, mean: 0, median: 0, sd: 0
## Minor allele frequency:
## min: 0, max: 0.5, mean: 0.198875, median: 0.170968, sd: 0.14291
## Allelic information:
## A/G T/C C/T G/A A/C T/G C/G
## 1755 1652 1244 1207 366 354 320
## G/T C/A G/C A/T T/A GT/G 0/C
## 302 275 273 179 173 2 1
## 0/G C/CAAATTAT CAA/C CT/C GA/G GCA/G TAAAC/T
## 1 1 1 1 1 1 1
Run the HIBAG prediction algorithm for HLA–A:
# load the HLA-A model from an R object
model <- hlaModelFromObj(model_lst$A)
# run prediction with a single core
# hla_a <- hlaPredict(model, geno)
# this process takes ~15 minutes to run, so we use multiple cores to speed up the calculation:
# run with 8 cores via multithreading
pred <- hlaPredict(model, geno, cl=8)
## HIBAG model for HLA-A:
## 500 individual classifiers
## 1023 SNPs
## 83 unique HLA alleles: 01:01, 01:02, 01:25, ...
## Prediction:
## based on the averaged posterior probabilities
## Model assembly: hg19, SNP assembly: hg19
## Matching the SNPs between the model and the test data:
## match.type="--" missing SNPs #
## Position 0 (0.0%) *being used [1]
## Pos+Allele 507 (49.6%) [2]
## RefSNP+Position 0 (0.0%)
## RefSNP 0 (0.0%)
## [1]: useful if ambiguous strands on array-based platforms
## [2]: suggested if the model and test data have been matched to the same reference genome
## Model platform: Illumina 1M Duo / ImmunoChip
## # of SNP loci with flipped alleles: 509
## # of SNP loci with strand ambiguity (e.g., C/G): 67 (comparing allele frequencies)
## # of samples: 930
## CPU flags: 64-bit
## # of threads: 8
## Predicting (2023-12-04 15:23:10) 0%
## Predicting (2023-12-04 15:23:21) 10%
## Predicting (2023-12-04 15:23:33) 20%
## Predicting (2023-12-04 15:23:44) 30%
## Predicting (2023-12-04 15:23:56) 40%
## Predicting (2023-12-04 15:24:09) 50%
## Predicting (2023-12-04 15:24:21) 60%
## Predicting (2023-12-04 15:24:32) 70%
## Predicting (2023-12-04 15:24:43) 80%
## Predicting (2023-12-04 15:24:55) 90%
## Predicting (2023-12-04 15:25:07) 100%
summary(pred)
## Gene: HLA-A
## Range: [29910247bp, 29913661bp] on hg19
## # of samples: 930
## # of unique HLA alleles: 38
## # of unique HLA genotypes: 218
## Posterior probability:
## [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1]
## 1 (0.1%) 36 (3.9%) 114 (12.3%) 779 (83.8%)
## Matching proportion of SNP haplotype:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 1.175e-05 9.659e-05 5.542e-04 4.379e-04 7.793e-03
## Dosages:
## $dosage - num [1:83, 1:930] 1.00 6.83e-05 1.37e-05 1.33e-17 6.09e-33 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:83] "01:01" "01:02" "01:25" "02:01" ...
## ..$ : chr [1:930] "HG00096" "HG00097" "HG00099" "HG00100" ...
# data.frame for best-guess alleles
head(pred$value)
## sample.id allele1 allele2 prob matching
## 1 HG00096 01:01 29:02 0.9999180 0.0012829705
## 2 HG00097 03:01 24:02 0.9491243 0.0004386126
## 3 HG00099 01:01 68:01 0.9983373 0.0008045432
## 4 HG00100 01:01 01:01 0.9751905 0.0018377117
## 5 HG00101 11:01 32:01 0.9969865 0.0001925689
## 6 HG00102 01:01 01:01 0.9961324 0.0008196929
# write the imputation results to a csv file
write.csv(pred$value, file="result.csv", row.names=FALSE)
# or output to a VCF file for best-guess alleles and estimated allele dosages
hlaAlleleToVCF(pred, "result.vcf")
## Convert to dosage VCF format:
## # of samples: 930
## output: 'result.vcf'
## # of unique HLA-A alleles: 38
## [ 01:01,01:02,02:01,02:02,02:05,02:06,02:07,02:11,03:01,03:02,11:01,11:02,23:01,24:02,24:04,25:01,26:01,29:01,29:02,30:01,30:02,30:04,30:10,31:01,32:01,33:01,33:03,68:01,34:02,36:01,66:01,66:02,68:02,68:17,69:01,74:01,74:03,80:01 ]
The output VCF file looks like:
## ##fileformat=VCFv4.0
## ##fileDate=20231204
## ##source=HIBAG_v1.36.4
## ##reference=hg19
## ##contig=<ID=6,length=171115067>
## ##FILTER=<ID=PASS,Description="All filters passed">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##FORMAT=<ID=DS,Number=1,Type=Float,Description="Dosage of HLA allele">
## #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG ...
## 6 29911954 HLA-A*01:01 A P_0101 . PASS . GT:DS 1/0:9.9992e-01 0/0:1.5387e-18 1/0 ...
## 6 29911954 HLA-A*01:02 A P_0102 . PASS . GT:DS 0/0:6.8251e-05 0/0:4.9487e-25 0/0 ...
## 6 29911954 HLA-A*02:01 A P_0201 . PASS . GT:DS 0/0:1.3328e-17 0/0:1.7710e-20 0/0 ...
## 6 29911954 HLA-A*02:02 A P_0202 . PASS . GT:DS 0/0:6.0871e-33 0/0:3.1643e-21 0/0 ...
## 6 29911954 HLA-A*02:05 A P_0205 . PASS . GT:DS 0/0:5.3106e-33 0/0:1.1394e-24 0/0 ...
## 6 29911954 HLA-A*02:06 A P_0206 . PASS . GT:DS 0/0:1.8076e-20 0/0:8.9196e-21 0/0 ...
## 6 29911954 HLA-A*02:07 A P_0207 . PASS . GT:DS 0/0:1.0875e-25 0/0:1.2243e-29 0/0 ...
The posterior probabilities (the column “prob”, i.e.,
pred$value$prob
) calculated in the imputation process are
used as confidence scores, and the call threshold is suggested to be
0.5. The call rate (the fraction of individuals successfully imputed) is
96% in this example according to the posterior probabilities greater
than 0.5. The column “matching” is a measure or proportion describing
how the SNP profile matches the SNP haplotypes observed in the training
set, i.e., the likelihood of SNP profile in a random-mating population
consisting of training haplotypes. Matching proportion is not directly
related to confidence score, but a very low value of “matching”
indicates that it is underrepresented in the training set.
The data package contains the actual HLA–A, –B, –C, –DRB1 and –DQB1 types of 930 individuals in the 1000 Genomes project. We use these HLA classical alleles to evaluate the HIBAG imputation at the HLA–A locus:
# read the actual HLA classical alleles
df_hla <- read.csv("IMMPUTE_KG_HLA_2field.csv")
# create an object of HLA genotypes with true alleles
true_h <- hlaAllele(df_hla$SampleID, df_hla$A.1, df_hla$A.2, locus="A")
## using the default genome assembly (assembly="hg19")
# evaludate the prediction without and with a call threshold
ct0 <- hlaCompareAllele(true_h, pred, call.threshold=0)
ct5 <- hlaCompareAllele(true_h, pred, call.threshold=0.5)
The function hlaCompareAllele()
returns the details of
comparison between the imputed alleles and the true ones, including
overall prediction accuracies at the individual and allele levels, a
confusion matrix, sensitivity, specificity, positive predictive value
(PPV) and negative predictive value (NPV) for each allele. For example,
the overall allele-level accuracies are 94.8% and 93.4% with and without
call threshold respectively, i.e., the number of correctly imputed
alleles over the total number of alleles:
ct0$overall$acc.haplo
## [1] 0.9344086
ct5$overall$acc.haplo
## [1] 0.9479283
The HIBAG package provides utilities to create a report with tables
for the prediction evaluation in the text, html, tex or markdown format.
The function hlaReport()
is used to generate an output file
“allele.md”. The columns in the table are the allele, accuracy,
sensitivity, specificity, positive predictive value and negative
predictive value, etc.
hlaReport(ct0, export.fn="allele.md", type="markdown")
The output markdown file looks like:
Overall accuracy: 93.4%, Call rate: 100.0%
Allele | # Valid. | Freq. Valid. | CR (%) | ACC (%) | SEN (%) | SPE (%) | PPV (%) | NPV (%) | Miscall (%) |
---|---|---|---|---|---|---|---|---|---|
01:01 | 111 | 0.0597 | 100.0 | 99.6 | 97.3 | 99.8 | 96.4 | 99.8 | 03:01 (67) |
01:02 | 5 | 0.0027 | 100.0 | 99.8 | 20.0 | 100.0 | 100.0 | 99.8 | 01:01 (100) |
02:01 | 362 | 0.1946 | 100.0 | 97.4 | 96.4 | 97.7 | 90.9 | 99.1 | 02:07 (92) |
02:02 | 22 | 0.0118 | 100.0 | 99.6 | 68.2 | 100.0 | 100.0 | 99.6 | 02:05 (100) |
02:03 | 17 | 0.0091 | 100.0 | 99.1 | 0.0 | 100.0 | – | 99.1 | 02:01 (100) |
02:04 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 02:01 (100) |
02:05 | 15 | 0.0081 | 100.0 | 99.5 | 100.0 | 99.5 | 62.5 | 100.0 | – |
02:06 | 41 | 0.0220 | 100.0 | 99.9 | 100.0 | 99.9 | 95.3 | 100.0 | – |
02:07 | 35 | 0.0188 | 100.0 | 99.1 | 85.7 | 99.3 | 71.4 | 99.7 | 02:01 (100) |
02:10 | 2 | 0.0011 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 02:06 (100) |
02:11 | 5 | 0.0027 | 100.0 | 99.6 | 0.0 | 99.9 | 0.0 | 99.7 | 02:01 (100) |
02:14 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 02:05 (100) |
02:17 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 02:01 (100) |
02:20 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 02:01 (100) |
02:22 | 2 | 0.0011 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 02:01 (50) |
02:24 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 02:01 (100) |
03:01 | 176 | 0.0946 | 100.0 | 99.6 | 99.4 | 99.6 | 96.7 | 99.9 | 31:01 (100) |
03:02 | 5 | 0.0027 | 100.0 | 99.8 | 20.0 | 100.0 | 100.0 | 99.8 | 03:01 (100) |
11:02 | 10 | 0.0054 | 100.0 | 99.7 | 60.0 | 99.9 | 85.7 | 99.8 | 11:01 (100) |
24:04 | 2 | 0.0011 | 100.0 | 99.9 | 50.0 | 100.0 | 100.0 | 99.9 | 24:02 (100) |
24:20 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 24:02 (100) |
24:24 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 23:01 (100) |
26:01 | 45 | 0.0242 | 100.0 | 98.5 | 93.3 | 98.7 | 63.2 | 99.8 | 25:01 (67) |
31:01 | 66 | 0.0355 | 100.0 | 99.9 | 100.0 | 99.9 | 98.5 | 100.0 | – |
33:01 | 11 | 0.0059 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
11:01 | 162 | 0.0871 | 100.0 | 99.7 | 99.4 | 99.8 | 97.6 | 99.9 | 11:02 (100) |
23:01 | 47 | 0.0253 | 100.0 | 99.9 | 97.9 | 99.9 | 97.9 | 99.9 | 26:01 (50) |
24:02 | 234 | 0.1258 | 100.0 | 99.7 | 100.0 | 99.6 | 97.5 | 100.0 | – |
24:03 | 3 | 0.0016 | 100.0 | 99.8 | 0.0 | 100.0 | – | 99.8 | 24:02 (100) |
24:08 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 24:02 (100) |
25:01 | 15 | 0.0081 | 100.0 | 99.8 | 86.7 | 99.9 | 86.7 | 99.9 | 26:01 (100) |
26:02 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | 26:01 (100) |
26:03 | 6 | 0.0032 | 100.0 | 99.7 | 0.0 | 100.0 | – | 99.7 | 26:01 (100) |
26:08 | 3 | 0.0016 | 100.0 | 99.8 | 0.0 | 100.0 | – | 99.8 | 26:01 (100) |
29:01 | 4 | 0.0022 | 100.0 | 99.8 | 100.0 | 99.8 | 61.5 | 100.0 | – |
29:02 | 54 | 0.0290 | 100.0 | 99.8 | 96.3 | 99.9 | 98.1 | 99.9 | 29:01 (100) |
30:01 | 62 | 0.0333 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
30:02 | 51 | 0.0274 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
30:04 | 3 | 0.0016 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
30:10 | 1 | 0.0005 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
31:04 | 4 | 0.0022 | 100.0 | 99.8 | 0.0 | 100.0 | – | 99.8 | 26:01 (75) |
32:01 | 41 | 0.0220 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
33:03 | 61 | 0.0328 | 100.0 | 99.8 | 100.0 | 99.8 | 96.1 | 100.0 | – |
34:02 | 14 | 0.0075 | 100.0 | 99.8 | 71.4 | 100.0 | 100.0 | 99.8 | 02:01 (50) |
36:01 | 15 | 0.0081 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
66:01 | 11 | 0.0059 | 100.0 | 99.5 | 18.2 | 100.0 | 100.0 | 99.5 | 26:01 (100) |
66:02 | 9 | 0.0048 | 100.0 | 99.9 | 77.8 | 100.0 | 100.0 | 99.9 | 68:01 (100) |
68:01 | 51 | 0.0274 | 100.0 | 99.7 | 94.1 | 99.9 | 95.0 | 99.8 | … (100) |
68:02 | 35 | 0.0188 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
68:03 | 1 | 0.0005 | 100.0 | 99.9 | 0.0 | 100.0 | – | 99.9 | … (100) |
69:01 | 1 | 0.0005 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
74:01 | 25 | 0.0134 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
74:03 | 2 | 0.0011 | 100.0 | 99.9 | 50.0 | 100.0 | 100.0 | 99.9 | 02:05 (100) |
80:01 | 4 | 0.0022 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | – |
The relationships among imputation accuracy, call rate and call
threshold are shown in the following figures. The prediction accuracy
increases with a higher call threshold. The function
hlaReportPlot()
generates a figure for call rate and call
threshold respectively:
library(ggplot2)
hlaReportPlot(pred, true_h, fig="call.threshold")
hlaReportPlot(pred, true_h, fig="call.rate")
Repeat Step 5 to 9 for HLA–B, –C, –DRB1 and –DQB1 respectively. The overall accuracies of five HLA genes are shown in Table 1, ranging from 90% to 97%, with and without call threshold.
~
~
Table 1: Summary of the two-field prediction accuracies and call rates for the HLARES pre-built classifiers, using 1000 genomes HLA data as validation samples.
HLA-A | HLA-B | HLA-C | HLA-DRB1 | HLA-DQB1 | |
---|---|---|---|---|---|
# of validation samples | 930 | 930 | 930 | 930 | 930 |
Accuracy, no call threshold | |||||
93.4% | 91.1% | 97.2% | 90.8% | 91.2% | |
Accuracy, call threshold=0.5 | |||||
94.8% | 96.1% | 97.7% | 96.4% | 91.6% | |
Call rate (%) | 96.0 | 84.8 | 97.7 | 79.5 | 98.4 |
~
~
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.4.4 HIBAG_1.36.4 SeqArray_1.41.4 gdsfmt_1.36.1
## [5] setwidth_1.0-4
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.4 generics_0.1.3
## [4] bitops_1.0-7 digest_0.6.33 magrittr_2.0.3
## [7] evaluate_0.22 grid_4.3.2 fastmap_1.1.1
## [10] jsonlite_1.8.7 GenomeInfoDb_1.36.4 fansi_1.0.5
## [13] scales_1.2.1 Biostrings_2.68.1 textshaping_0.3.7
## [16] jquerylib_0.1.4 cli_3.6.1 rlang_1.1.1
## [19] crayon_1.5.2 XVector_0.40.0 munsell_0.5.0
## [22] withr_2.5.2 cachem_1.0.8 yaml_2.3.7
## [25] tools_4.3.2 parallel_4.3.2 dplyr_1.1.3
## [28] colorspace_2.1-0 GenomeInfoDbData_1.2.10 BiocGenerics_0.46.0
## [31] vctrs_0.6.4 R6_2.5.1 stats4_4.3.2
## [34] lifecycle_1.0.3 zlibbioc_1.46.0 S4Vectors_0.38.2
## [37] IRanges_2.34.1 ragg_1.2.6 pkgconfig_2.0.3
## [40] RcppParallel_5.1.7 bslib_0.5.1 pillar_1.9.0
## [43] gtable_0.3.4 glue_1.6.2 systemfonts_1.0.5
## [46] highr_0.10 xfun_0.40 tibble_3.2.1
## [49] GenomicRanges_1.52.1 tidyselect_1.2.0 rstudioapi_0.15.0
## [52] knitr_1.45 farver_2.1.1 htmltools_0.5.6.1
## [55] rmarkdown_2.25 labeling_0.4.3 compiler_4.3.2
## [58] RCurl_1.98-1.12