A reliability analysis: human trafficking curriculum assessment tool (HT-CAT) for health care provider human trafficking trainings | BMC Medical Education

Setting
HEAL Trafficking, the international health care nonprofit focused on trafficking as a public health issue, created a compendium of available human trafficking training resources for health care professionals between 2015 and 2019. The compendium was based on multiple literature reviews [18], as well as repeated open calls to the HEAL network for members to add relevant trainings. The trainings were all web-based media such as online modules and webinars rather than literature such as review articles or other educational material. This led to a compendium of 58 English-language trainings meant to inform, educate, or offer guidance to healthcare professionals on human trafficking.
Assessment
HT-CAT was designed to evaluate a basic 101 level of human trafficking trainings for health professionals, so exclusion criteria were applied to the compendium. From the initial frame of 58 trainings, exclusion criteria resulted in the elimination of trainings based on their specialized nature (for example, focusing more on domestic violence, sexual assault response, trauma, etc.), educational resources primarily for non-health audiences, technical issues (such as defunct links), financial barriers to access, and for redundancy. This yielded 24 human trafficking trainings for analysis. The researchers created a version of HT-CAT in Qualtrics with key domains (Table 1) and items (Additional File A) to match HT-CAT. The HT-CAT domains include “Design,” “Overview,” “Health Impact,” “Identification & Assessment,” and “Response & Followup.” After familiarizing themselves with HT-CAT, each of these 24 trainings was reviewed by three or more reviewers, with an average of four reviews (range 3–6 reviews). The review taskforce occurred between June 2019 and December 2019. The results of these reviews were exported from Qualtrics into a spreadsheet for analysis. The individual trainings are presented in a de-identified fashion using arbitrary identifiers (e.g., “Training 1”). The list of trainings reviewed can be found in the Additional File B.
Statistical analysis
The overarching purpose of this analysis was to examine interrater reliability to determine the extent to which different individuals would extract similar information when reviewing the same training curricula, with a focus on describing variation across different trainings (i.e., did raters on a specific training agree, and did this vary across trainings) as well as specific items within each training (i.e., on which items were raters most/least likely to agree). A variety of statistical tools for assessing interrater reliability were employed, each of which is informed by the level of measurement for the quantities being assessed and are similar to those employed in comparable efforts [21, 22].
For each training, the intraclass correlation (ICC) was used to assess agreement between different raters on the basis of summed domain scores. This metric is commonly used to assess agreement between multiple coders for numeric variables, where values in excess of 0.75 are considered an indicator of excellent agreement [23]. The ICC was calculated using the function icc from the R package irr, using a two-way model [24]. To consider between training interrater reliability on the basis of scoring for individual assessment items we used Fleiss’ kappa via irr::kappam.fleiss [22]. This statistic is applicable given that each individual item represents a binary score, and there are multiple raters used for each training. Kappa values in excess of 0.4, 0.6, and 0.8 suggest moderate, substantial, and almost perfect agreement, respectively [25].
In order to assess reliability on specific items (i.e., across all trainings, for a specific item and for any given set of n raters, to what extent was there agreement on how the specific item should be scored), we split the data into 34 item specific matrices, where each row represented a specific training, and the scoring for the designated raters was spread across columns. We used Krippendorf’s alpha to examine variation in agreement between specific items, calculated via the R package krippendorfsalpha [26]. This IRR statistic is useful given that the number of raters varies from training to training, where values in excess of 0.8 suggest satisfactory agreement [27].
For each analysis, we present relevant IRR statistics and uncertainties for each training, or for each assessment item. These quantities are then pooled using a Dersimonian-Laird random effects meta-analysis in order to obtain an overall estimate of reliability. This analysis was executed using the function rma with method = “DL” in the R package metafor [28]. Practically, to conduct the DerSimonian-Laird random effects model, both the effect size and the variance of the effect size were required for each reliability statistic. For Fleiss’ Kappa and the ICC, the variance was estimated via the bootstrap, a resampling method that involves repeatedly sampling with replacement from the data and re-estimating the reliability metric on each replicate to estimate the standard error. This procedure was performed using the R package boot (Canty and Ripley, 2024) [29]. The use of bootstrapped standard errors was necessary primarily because the function irr::kappam.fleiss, which estimates Fleiss’ Kappa, does not provide an analytic standard error. Therefore, the bootstrap offers a robust mechanism for computing uncertainties regardless of the underlying distribution [30].
For the DerSimonian-Laird random effects model applied to Krippendorff’s alpha, the analytic standard error provided by the R package krippendorfsalpha was used. This package also offers confidence intervals, which were converted into standard errors for the analysis. Although consistency in variance estimation across the different reliability metrics would be desirable, the data configuration used to produce the Krippendorf’s alpha (i.e., an item-specific dataset where every row has ratings for a specific item from a specific training) had an unequal number of observations across columns due to variation in the number of raters. This configuration produces missing observations, making bootstrapping infeasible for estimating standard error in this context. Instead, the analytic standard errors provided by the package were used to conduct the DerSimonian-Laird random effects model for Krippendorff’s alpha.
link