ABSTRACT
Background: Artificial intelligence could provide more accurate magnetic resonance imaging (MRI) predictors of successful clinical outcomes in targeted spine care.
Objective: To analyze the level of agreement between lumbar MRI reports created by a deep learning neural network (RadBot) and the radiologists' MRI reading.
Methods: The compressive pathology definitions were extracted from the radiologist lumbar MRI reports from 65 patients with a total of 383 levels for the central canal: (0) no disc bulge/protrusion/canal stenosis, (1) disc bulge without canal stenosis, (2) disc bulge resulting in canal stenosis, and (3) disc herniation/protrusion/extrusion resulting in canal stenosis. For both, neural foramina were assessed with either (0) neural foraminal stenosis absent or (1) neural foramina stenosis present. Reporting criteria for the pathologies at each disc level and, when available, the grading of severity were extracted, and the Natural Language Processing model was used to generate a verbal and written report. The RadBot report was analyzed similarly as the MRI report by the radiologist. MRI reports were investigated by dichotomizing the data into 2 categories: normal and stenosis. The quality of the RadBot test was assessed by determining its sensitivity, specificity, and positive and negative predictive value as well as its reliability with the calculation of the Cronbach alpha and Cohen kappa using the radiologist MRI report as a gold standard.
Results: The authors found a RadBot sensitivity of 73.3%, a specificity of 88.4%, a positive predictive value of 80.3%, and a negative predictive value of 83.7%. The reliability analysis revealed the Cronbach alpha as 0.772. The highest individual values of the Cronbach alpha were 0.629 and 0.681 when compared to the MRI report by the radiologist, rending values of 0.566 and 0.688, respectively. Analysis of interobserver reliability rendered an overall kappa for the RadBot of 0.627. Analysis of receiver operating characteristics (ROC) showed a value of 0.808 for the area under the ROC curve.
Conclusions: Deep learning algorithms, when used for routine reporting in lumbar spine MRI, showed excellent quality as a diagnostic test that can distinguish the presence of neural element compression (stenosis) at a statistically significant level (P < .0001) from a random event distribution. This research should be extended to validated and directly visualized pain generators to improve the accuracy and prognostic value of the routine lumbar MRI scan for favorable clinical outcomes with intervention and surgery.
Level of Evidence: 3.
Clinical Relevance: Validity, clinical teaching, and evaluation study.
- artificial intelligence
- deep neural network learning
- magnetic resonance imaging
- spinal pathologies
- reliability analysis
INTRODUCTION
Minimally invasive and endoscopic transforaminal decompression techniques have become popular in spinal surgery due to technological advances.1–6 There has been a substantial increase in the number of these types of procedures being carried out in ambulatory surgery centers.7 The advantages of endoscopic transforaminal decompression are fewer postoperative complications, a shorter interval for return to work and social reintegration,2,8–11 faster postoperative narcotic independence, and an overall reduced utilization of painkillers.2,12 The latter problem is of significance in light of the opiate abuse epidemic in the United States,13 more rigorous medical necessity assessment,14 and a demand for value-based health care measures to serve the aging baby-boomer population.13–15 In this context, a conclusive preoperative diagnostic work-up of lumbar radiculopathy is crucial, as decompression is often limited to a small area of 1 affected neuroforamen and lateral recess.16–18
In this article, the authors report on the feasibility of using a deep learning algorithm for routine reporting in spine magnetic resonance imaging (MRI). The ultimate objective of this research is to improve the accuracy and predictive value of the MRI scan when applied to the preoperative planning of targeted minimally invasive and endoscopic spinal surgeries. These targeted procedures often ignore the majority of pathologies reported on routine lumbar MRI scans of patients with injuries or degenerative conditions of the spine and focus treatment only on the validated painful pathologies. The preoperative MRI scan is an integral part of the diagnostic work-up besides history, physical examination, electrodiagnostic studies, and confirmative diagnostic spinal injections.15–19 The need to improve the diagnostic accuracy of the routine MRI scan has been well recognized by surgeons who reported on the correlation between intraoperatively observed findings as gold standard references and reflected on the use of the MRI scan as a predictor of the need for appropriate treatment and its clinical outcomes.20–24 The MRI scan, in many respects, has become the ultimate gatekeeping test in the medical necessity determination of many spinal surgeries. Diagnostic inaccuracies related to false-negative diagnoses, therefore, have a significant impact on patient care and often lead to overutilization in other subspecialties of spine care, such as pain management. From a cost-benefit point of view, these inappropriate points-of-care interactions often translate into wasted treatments if considered ineffective by patients who continue to look for care but should be treated definitively by addressing the structural problems associated with their primary spinal pain generator. Therefore, improving the value of the MRI scan as a predictor of clinical outcomes with appropriate surgical treatments is not only central but also critical to applying the value-based approach to spine care. In this study, the authors report on the results of the sensitivity, specificity, and positive and negative predictive value; Cronbach alpha reliability; and interobserver Cohen kappa analysis of MRI reports produced by deep learning neural network algorithms when compared to routine reporting provided by the radiologist.
MATERIALS AND METHODS
The premise of this research and development is based on the ability for deep learning neural network models to identify features in MRI data that represent varying intensities or severities of degenerative pathologies or injuries in patients. The feasibility of this artificial intelligence (AI) approach was demonstrated in another study included in this journal's special focus issue. In this investigation, the same team of authors is now reporting on the statistics of the accuracy and reliability analysis with the AI approach to lumbar MRI reporting, which was considered the gold standard for the comparison analysis. All patients in this consecutive case series provided informed consent, and institutional review board approval was obtained (CEIFUS 106-19). Written informed consent was obtained from the patient for publication of this report and any accompanying images.
Patients and Training Data
The deep learning neural network models analyzed 65 lumbar MRI scans from the same number of patients, comprising a total of 383 levels. The DICOM data were ordered by the first author and were obtained from 1 MRI imaging center in patients with painful lumbar degenerative spine disease or injuries. The data set included the disc levels T12–L1, L1–L2, L2–L3, L3–L4, L4–L5, and L5–S1 for each patient. The average age of the 65 patients was 42.2 years with a standard deviation of 11.8 years. There were 51.5% male and 48.5% female patients. The MRI imaging centers provided radiology reports prepared and approved by board-certified radiologists. Each radiologist was required to present a reading for the presence or absence of annular bulging25 (circumferential, paracentral, posterior), disc herniation26 (extrusion, protrusion, sequestration, fragmentation), central canal stenosis27–29 (compromise of the thecal sac with presence or absence of ventral epidural fat), and foraminal stenosis30 (compromise of the left, right, or both neural foramina and nerve roots) for each intervertebral level.
Extraction of MRI Data
For each disc location, the following classes were extracted from the radiologist report for the central canal: (0) no disc bulge/protrusion/canal stenosis, (1) disc bulge without canal stenosis, (2) disc bulge resulting in canal stenosis, and (3) disc herniation/protrusion/extrusion resulting in canal stenosis. One of the following classes was also extracted for each of the left and right neural foramina: (0) neural foraminal stenosis absent or (1) neural foraminal stenosis present. An example is shown in Table 1, where at the L3–L4 location in the side-by-side comparison, the radiologist read was converted to class (3) for the central canal, (0) for the left neural foramina, and (1) for the right neural foraminal and matched by the algorithm model. For the purpose of the reliability analysis, these findings were dichotomized into 2 simple categories: normal and stenosis.
Statistical Analysis
For the clinical outcome analysis, descriptive statistics (mean and standard deviation), cross-tabulation statistics of sensitivity, specificity, positive and negative predictive value, and measures of association were computed for 2-way tables using IBM SPSS Statistics software (version 27.0). The Pearson χ2 and the likelihood-ratio χ2 tests were used as statistical measures of association. The Multus RadBot MRI sensitivity of accurately grading and detecting symptomatic nerve root compression (true positive rate) (TP) was calculated on the basis of the grading by the board-certified radiologist as the percentage of patients (MRI positives) among the stenosis patients who were correctly identified by the Multus RadBot as having symptomatic neural compression confirmed by a board-certified radiologist. False negatives (FN) were patients with neural compression identified by the radiologist whose Multus RadBot MRI grading was negative for stenosis (MRI negatives). Therefore, diagnostic Multus RadBot MRI sensitivity for predicting a successful clinical outcome from endoscopic transforaminal decompression procedure was calculated as follows:
The Multus RadBot MRI specificity (true negative [TN] rate) of accurately detecting the absence of symptomatic nerve root compression as demonstrated by the radiologist's MRI reading was calculated as the percentage of patients correctly identified as not having symptomatic neural compression. False positives (FP) were defined as Multus RadBot MRI positives without the radiologist having identified the neural compression. Therefore, diagnostic Multus RadBot MRI specificity of predicting a neural element compression was calculated as follows:
The positive and negative predictive values of the Multus RadBot reading of the lumbar MRI scan for agreeing with the reading of the board-certified radiologist with the presence or absence of compressive pathology (normal or stenosis) were calculated as follows:
Intraobserver reliability between the reading provided by the radiologist and the neural network deep learning algorithm (Multus RadBot) was done by Cronbach alpha computation and Cohen kappa analysis as a measure of agreement between the radiologist's grading of the lumbar MRI scan and the Multus RadBot's assessment of foraminal and central stenosis. The Cohen kappa was calculated from the observed and expected frequencies on the diagonal of a square contingency table. The overall quality of the Multus RadBot algorithm as a diagnostic test was assessed with the receiver operating characteristics (ROC) with determination of the area under the curve employing the left-upper-corner method using a dichotomization protocol of classifying MRI scan readings per intervertebral disc level as either normal or stenotic.31–34 The confidence intervals for the likelihood ratios were calculated using the “log method.”35,36
RESULTS
The level frequency distribution observed in the 65 patients is summarized in Table 1. The radiologist detected the presence of neural element compressive pathology (stenosis) in 60.6% of scanned levels, whereas the Multus RadBot AI algorithm determined the presence of stenosis in 64.2%, of scanned levels (Tables 2 and 3). As listed in Table 4, the most common levels reported as stenotic by the radiologist were L2–L3 (79.7%), L3–L4 (79.7%), and L4–L5 (77.8%). The frequency distribution read out by the Multus RadBot (Table 5) was similar, with some variation at L2–L3 (59.4%), L3–L4 (87.5%), and L4–L5 (93.8%), suggesting that pathology at the L2–L3 level was underdiagnosed versus overdiagnosed at the L4–L5 level. These differences were statistically significant (P < .0001). The ROC analysis showed a value of 0.808 for the area under the ROC curve (AUC), indicating that the Multus RadBot is an excellent diagnostic test that can detect the presence of neural element compression (stenosis) at a statistically significant level (P < .0001) from a random event distribution (Figure).
The cross tabulation between the Multus RadBot and radiologist's readings of the lumbar MRI scan using the radiologist's report as a gold standard revealed a Multus RadBot sensitivity of 73.3%, a specificity of 88.4% (Table 6), a positive predictive value of 80.3%, and a negative predictive value of 83.7% (Table 7), with all the differences in these 2 cross tabulations being statistically significant. The reliability analysis revealed the Cronbach alpha as 0.772. When cross tabulated by intervertebral disc level differences, in reliability by level were found (Table 8). Through a process of elimination, it was determined that Multus RadBot's performance was most reliable at the L2–L3 and L3–L4 levels with the highest individual values for the Cronbach alpha of 0.629 and 0.681 when compared to the MRI report by the radiologist, rending values of 0.566 and 0.688, respectively (Table 8). Kappa analysis of interobserver reliability rendered an overall kappa for the Multus RadBot of 0.627, suggesting that the Multus RadBot AI algorithm performed at a high reliability level (Table 9). Again, the diagnostic recognition of the Multus RadBot was the most reliable at the L2–L3 and L3–L4 levels on kappa analysis, showing kappa values of 0.738, and 0.606, respectively.
DISCUSSION
The results of this study highlighted a small “difference in opinion” in the interpretation of routine lumbar MRI scans between the radiologist's report and AI deep neural network learning algorithm. While it is unclear whether the observed discrepancies arose out of the AI or the radiologist's reporting, it is obvious to see how such reporting discrepancies may impact patient selection for targeted spinal procedures, such as the endoscopic transforaminal surgery. As the determination of medical necessity in injured patients and in patients with painful degenerative conditions of the spine today hinges frequently on the exact verbatim reading in the MRI report, revisiting the accuracy of the MRI scan is of high relevance to patients and their physicians alike. False-positive readings may subject the patient to unwanted or unneeded treatments at high expense, and false-negative interpretations may deny justified care. The consequences of this diagnostic dilemma play out every day, affecting individualized spine care of those patients with an estimated 2.06 million episodes of low back injury per year in the United States.37
The authors purposely chose a simplified way of analyzing the level of agreement between our AI and the radiologist's MRI reading by applying the following assumptions: (1) the MRI report by the radiologist was employed as the gold standard in this reliability and accuracy analysis, and (2) the authors categorized the MRI findings in a straightforward 2-category manner (normal anatomy or stenosis present) to facilitate the study of the AI algorithm's performance as a diagnostic test by employing accepted statistical methods of chi-square testing to determine the sensitivity, specificity, positive and negative predictive value, the overall test reliability with the ROC and AUC method or the calculations of the Cronbach alpha and Cohen kappa. The numbers obtained with these methods suggest that the our AI deep learning network as a diagnostic tool has excellent performance characteristics. Typically, Cohen kappa values of 0.6 and alpha over 0.7 and ROC values higher than 0.8 are considered the hallmarks of a highly useful diagnostic test.38,39 It is not entirely clear to the authors why our AI deep learning neural network was most accurate at the L2–L3 and L3–L4 levels. The most reasonable explanation is that pathologies at the other levels, but particularly at the L4–L5 level, are much more common, thus contributing to more significant variability in how these pathologies are read by the radiologist or interpreted by the Multus RadBot.
The authors are entirely aware of the limitation of their simplified statistical analysis by assuming that the MRI report provided by the reading radiologist was flawless. The authors could have chosen to have the radiologist's report reread by another 1 or 2 radiologists to incorporate that in the reliability discussion. However, the authors purposely decided against it so as not to create an artificial scenario that does not exist in the “real world,” where routine lumbar MRI scans are read by 1 board-certified radiologist with little additional scrutiny. Clinical decision making affecting individual patients' lives are made like that every day. Therefore, the authors did not want to deviate from their simple side-by-side, Multus RadBot versus radiologist analysis approach. It goes without saying, though, that MRI raters on all sides of the medical necessity equation may use different radiological classification systems during the preoperative and diagnostic decision algorithms.15,29,40,41 The first author has demonstrated this clinical dilemma affecting hundreds of his patients who were classified by the radiologist as false negatives but ultimately underwent successful transforaminal endoscopic decompression with excellent and good Macnab outcomes in over 88.3% of patients.23 In his study of 1839 patients, the first author found a diagnostic gap of approximately 18% (330 patients),24 which initially led to the denial of appropriate spine care by the patients' medical insurance. However, patients who persevered eventually underwent seemingly inappropriate endoscopic surgical decompression for their sciatica, back, and leg pain with a 94.6% success rate.23 This type of spine care, deemed as medically not necessary based on traditional image-based clinical decision criteria done in patients responsive to successful endoscopic decompression, stimulated the authors of this study to look further into improving the preoperative diagnostic process in patients with sciatica due to herniated disc or stenosis leading up to targeted surgical decompression. Interestingly, this 18% diagnostic gap is commensurate with the Multus RadBot's percentage gain in reporting consistency in terms of sensitivity, specificity, and positive predictive value of the lumbar MRI scan with intervention reported by clinical studies where numbers are in the 60%–70% range.18
While the authors are encouraged by the excellent diagnostic performance parameters of the Multus RadBot's self-learning deep neural network models, they are also keenly aware of the underlying limitation of their study because of the underlying reporting bias inherent to the MRI reporting provided by the radiologists. Affective (unconscious emotional reaction) and cognitive (distortions of thinking) biases in the clinical diagnostic decision-making process may have impacted the radiologist's choice of words when dictating the findings he saw on the individual axial and sagittal MRI scan images.42 Cognitive biases, such as hindsight or outcome bias, are virtually unavoidable in a retrospective reclassification of clinical parameters, as knowledge of the outcome by the stakeholders in the patient care equation has been recognized to inflate the predictability of an event after it happened.43,44 Hindsight cognitive biases may also have impacted the extent of disagreement in preoperative lumbar MRI grading by the radiologist.45 Intuition bias may have played a role in the radiologist's wording of the MRI report while loosely adhering to radiographic stenosis classification systems.45 The Multus RadBot is not subject to these biases for which reasons the authors expect higher reliability numbers incorrectly identifying painful spinal pathology with further refinements of the technology when directly visualized intraoperative observations of painful spinal pathologies are used as a gold standard rather than a radiologist report of another imaging modality. The first author has successfully used this approach in a prior study of the positive predictive value of the routine lumbar MRI scan.
CONCLUSIONS
This study set out to better understand how to utilize the lumbar MRI scan as a prognosticator of favorable clinical outcomes when selecting patients for targeted spine care, such as the endoscopic transforaminal decompression procedure, aiming to cure patients of the predominant pain generator causing pain and disability in the functional context at the time when the spine care is delivered. To employ the routine lumbar MRI scan as a more accurate prognosticator for successful spine care with high patient satisfaction, this AI deep learning neural network, in the authors' opinion, needs to be further refined by focusing the segmentation models on MRI image findings of intraoperatively verified and validated pain generators responsive to treatment. The authors are in the process of completing a pilot study on this very problem. Surgical translational research on intraoperatively visualized spinal pathology should focus on analyzing the effectiveness of MRI prognosticators with spine surgical interventions, such as endoscopy, using state-of-the-art measures of central, lateral recess, and neural foraminal stenosis on MRI to further determine how they impact the prognosis of surgical treatment for neurogenic claudication and lumbar radiculopathy.
Footnotes
Disclosures and COI: The views expressed in this article represent those of the authors and no other entity or organization. The first author has no direct (employment, stock ownership, grants, patents), or indirect conflicts of interest (honoraria, consultancies to sponsoring organizations, mutual fund ownership, paid expert testimony). He is not currently affiliated with or under any consulting agreement with any MRI vendor that the clinical research data conclusion could directly enrich. This manuscript is not meant for or intended to push any other agenda other than reporting the research data related on automated recognition of common painful spine pathologies by deep neural network learning. The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
- This manuscript is generously published free of charge by ISASS, the International Society for the Advancement of Spine Surgery. Copyright © 2020 ISASS