Abstract
Background Treating surgeon's visual assessment of axial MRI images to ascertain the degree of stenosis has a critical impact on surgical decision-making. The purpose of this study was to prospectively analyze the impact of surgeon experience on inter-observer and intra-observer reliability of assessing severity of spinal stenosis on MRIs by spine surgeons directly involved in surgical decision-making.
Methods Seven fellowship trained spine surgeons reviewed MRI studies of 30 symptomatic patients with lumbar stenosis and graded the stenosis in the central canal, the lateral recess and the foramen at T12-L1 to L5-S1 as none, mild, moderate or severe. No specific instructions were provided to what constituted mild, moderate, or severe stenosis. Two surgeons were “senior” (>fifteen years of practice experience); two were “intermediate” (>four years of practice experience), and three “junior” (< one year of practice experience). The concordance correlation coefficient (CCC) was calculated to assess inter-observer reliability. Seven MRI studies were duplicated and randomly re-read to evaluate inter-observer reliability.
Results Surgeon experience was found to be a strong predictor of inter-observer reliability. Senior inter-observer reliability was significantly higher assessing central(p<0.001), foraminal p=0.005 and lateral p=0.001 than “junior” group.Senior group also showed significantly higher inter-observer reliability that intermediate group assessing foraminal stenosis (p=0.036). In intra-observer reliability the results were contrary to that found in inter-observer reliability.
Conclusion Inter-observer reliability of assessing stenosis on MRIs increases with surgeon experience. Lower intra-observer reliability values among the senior group, although not clearly explained, may be due to the small number of MRIs evaluated and quality of MRI images.
Level of evidence: Level 3.
- lumbar spinal stenosis
- radiological grading
- inter-observer reliability
- intra-observer reliability
- surgeon experience
Introduction
Lumbar stenosis is most commonly diagnosed in patients who are older than 65 years. In our aging population, this demographic is predicted to grow by 59% by 2025. Lumbar spinal stenosis is known to be the most common indication for spine surgery in these older adults.1 In recent decades, surgery for spinal stenosis was also found to be the most rapidly increasing type of lumbar spine surgery.2 Chen et al. reported that 21% of patients with lumbar spinal stenosis underwent surgery within three years of the initial diagnosis.3
Magnetic resonance imaging (MRI) of the lumbar spine is the imaging modality of choice to diagnose lumbar spine stenosis. MRIs are utilized to determine the extent and location of neural compression at the predicted symptomatic level and any degenerative changes at adjacent levels which may affect the clinical decision-making and potential treatments. Treatment decisions are often based on interpretation of both clinical symptoms and severity of disease on imaging. Despite being a common diagnosis on MRIs, there is a lack of universally accepted diagnostic criterion or grading system for lumbar spinal stenosis.
Schizas et al. described a 7-grade classification system based on morphology of the dural sac as seen on T2-weighted axial MRI images.4 Dural cross sectional area less than 70mm2 has been previously suggested to represent critical stenosis5 and has been used in multiple studies. However, these morphologic measurements are unlikely to be used to diagnose and grade spinal stenosis in a surgical clinic, because a significant number of asymptomatic subjects have spinal stenosis on MRIs.6 In addition, multiple studies done previously have failed to establish a correlation between any morphometric grading of stenosis on MRI and clinical disability.7-10 Surgeons usually rely on only a visual assessment of axial MRI images to ascertain the degree of stenosis. This MRI assessment of the severity of lumbar spinal stenosis by the treating surgeon has a critical impact on surgical decision-making. For this reason, multiple previous studies have evaluated the reliability of assessing severity of lumbar spinal stenosis on MRI. Most of these were retrospective studies with MRI evaluation done by a group of radiologists and spine surgeons who were not involved in the decision making regarding surgery.11-13 There is also paucity of data regarding the impact of surgeon experience on the assessment of severity of lumbar spinal stenosis.
Previous studies have demonstrated that surgeon experience is critical with respect to technical performance during surgical procedures.14-15 While MRI imaging is typically the preferred imaging modality in the evaluation of a patient with lumbar neurogenic claudication, little is known about the ability of surgeons to assess the extent of stenosis with respect to surgeon experience. To the best of our knowledge, there has not been a prospective study to evaluate inter-observer and intra-observer reliability of MRI studies for lumbar spinal stenosis based on surgeon experience. The purpose of our study was to prospectively analyze the impact of the surgeon’s experience on inter-observer and intra-observer reliability of assessing the severity of spinal stenosis on MRIs by spine surgeons directly involved in surgical decision-making and subsequent surgery.
Materials and Methods
There were no funding sources for this study and no potential conflicts of interest. Following IRB approval, thirty patients who consecutively presented to our academic tertiary care referral center with symptoms of lumbar spinal stenosis were consented and then prospectively enrolled in the study between July 2009 and March 2010. All the patients had exhibited greater than six weeks of symptoms of lumbar spinal stenosis. Each patient either presented to the office with a previously performed MRI, or one was performed shortly after their first evaluation. No standard imaging protocol was used, and the MRIs of the patients already performed were used “as is”.
The MRI studies were compiled, blinded of any identifying information, and distributed to seven fellowship-trained spine surgeons for review. Seven MRI studies were duplicated and randomly re-read to evaluate intra-observer reliability. Therefore, each surgeon read a total of 37 MRI studies. The evaluating surgeons viewed the blinded MRI images, and completed a “lumbar spine stenosis response sheet”. Each surgeon was asked to grade the severity of stenosis in each region of lumbar spinal canal: the central canal, the lateral recess and the foramen at each region of the lumbar spinal canal as none, mild, moderate or severe. No specific instructions were provided to what constituted mild, moderate or severe stenosis. No clinical information was given to the evaluating surgeons. Two surgeons were considered “senior” with greater than fifteen years of clinical experience (BEF, MP), two were “intermediate” with at least four years of clinical experience (RAT, WFL), and three “junior” with less than one year of clinical experience (IAM, UM, SM). Working together in a practice or having similar training may impact inter-observer reliability, it is important clarify that the two senior authors had not worked together. However, the two intermediate experienced authors were working together in the same practice, and the three junior authors had trained as fellows/residents with the intermediate experienced authors.
Statistical Methods
The concordance correlation coefficient (CCC) was calculated to assess the agreement among different surgeons. The CCC is closely related to and sometimes identical to another commonly used measure of agreement, the intraclass correlation coefficient (ICC). Variance component models were applied to account for all resources of variation, particularly the within-subject correlation. More details for estimating the CCC from repeated measures were utilized from Carrasco et al.16 The CCCs for “junior”, “intermediate” and “senior” surgeons were estimated separately and the differences between them were evaluated by Wald-type test. Both independence and compound symmetry (CS) within-subject correlation structures yielded similar results and only the estimates using the CS within-subject correlation structure are reported here. Intra-observer agreement was assessed by using the weighted kappa (κ) statistic. All analyses were performed using in SAS 9.4 (SAS Institute Inc. Cary, NC).
Interpretation of the data was performed using the interpretation by Landis and Koch, with values between 0.81 and 1.00 suggesting almost perfect agreement, values between 0.61 and 0.80 with substantial agreement, values between 0.41 and 0.60 with moderate agreement, values between 0.21 and 0.40 with fair agreement, and values 0.20 and below with poor agreement.17
Results
There were 16 women and 14 men enrolled, with an age range of 43 to 86 years and a mean age of 62.8 years. All MRI studies were determined to be of adequate quality. Inter-observer and intra-observer reliability data of assessing stenosis in central, foraminal and lateral recess regions grouped according to surgeon clinical experience are shown in Table 1 and Table 2, respectively.
Surgeon experience was found to be a strong predictor of inter-observer reliability. The “senior” group had significantly higher inter-observer reliability for assessing central (CCC = 0.731 with 95% CI = (0.652 - 0.795); p<0.001), foraminal (CCC = 0. 511 with 95% CI = (0. 404, 0. 605); p=0.005), and lateral recess stenosis (CCC =0. 511 with 95% CI =(0. 404, 0. 605); p=0.001) as compared to the “junior” group (central: CCC = 0. 341 with 95% CI = (0. 259, 0. 419); foraminal: CCC = 0. 334 with 95% CI = (0. 245, 0. 418); lateral recess: CCC = 0. 410 with 95% CI = (0. 315, 0. 498)). The “senior” group also showed significantly higher inter-observer reliability in assessing foraminal stenosis (p=0.036) as compared to the “intermediate” group (CCC = 0. 368 with 95% CI = (0. 244, 0. 479)), but there was no significant difference between the “senior” and the “intermediate” groups in assessment of central and lateral recess stenosis. On overall grading of stenosis, inter-observer reliability was significantly higher in both the “senior” (CCC = 0. 598 with 95% CI = (0. 538, 0. 653); p<0.001) and the “intermediate” group of surgeons (CCC = 0. 485 with 95% CI = (0. 422, 0. 544); p=0.002) when compared to the “junior” group (CCC = 0. 362 with 95% CI = (0. 306, 0. 416)). Additionally, the overall grading of stenosis inter-observer reliability in the “senior” group was also significantly higher than the “intermediate” group (p=0.004).
Surgeon experience was also found to be a strong predictor of intra-observer reliability in assessing stenosis (Table 2). However, the results were contrary to that found on assessment of inter-observer reliability. In terms of intra-observer reliability, we found substantial agreement among “junior” and “intermediate” surgeons of assessing stenosis in allthree regions of the spinal canal as well as the overall assessment of stenosis. In the “senior” group, there was fair agreement for assessing central stenosis (κ=0.265 with 95% CI = (0.056, 0.474)), and moderate agreement for assessing lateral recess stenosis (κ=0. 443 with 95% CI = (0. 267, 0. 618)), foraminal stenosis (κ=0. 539 with 95% CI = (0. 361, 0. 717)) and overall stenosis (κ=0. 459 with 95% CI = (0. 347, 0. 571)). Intra-observer reliability of assessing central stenosis, lateral recess stenosis and overall stenosis was significantly better in the “junior” (central: κ=0. 664 with 95% CI = (0. 542, 0. 784), p<0.001; lateral recess: κ=0. 679 with 95% CI = (0. 573, 0. 786), p=0.012; overall: κ=0. 695 with 95% CI = (0. 629, 0. 761), p<0.001) and the “intermediate” groups (central: κ=0. 591 with 95% CI = (0. 365, 0. 818), p=0.019; lateral recess: κ=0. 715 with 95% CI = (0. 599, 0. 831), p=0. 006; overall: κ=0. 716 with 95% CI = (0. 640, 0. 791), p<0.001) as compared to the “senior” group.
Among the “senior” surgeons, one surgeon reported data that showed substantial agreement assessing stenosis in all three regions of the spinal canal as well as overall assessment of stenosis with values similar to “junior” and “intermediate” surgeons. The other “senior” surgeon showed poor agreement in assessment of central stenosis, moderate agreement for assessment of lateral recess stenosis, and fair agreement for assessment of foraminal stenosis and overall stenosis. This reduced the intra-observer reliability values for the “senior” surgeons as a group.
Discussion
In this study we found inter-observer reliability of the interpretation of severity of stenosis in all three regions of the lumbar spinal canal, without providing clear criterion for such grading, was higher among surgeons with more clinical experience than among surgeons with less clinical experience (senior> intermediate>junior surgeons). On the contrary, intra-observer reliability was higher among “junior” and “intermediate” surgeons as compared to the “senior” surgeons. We cannot clearly reconcile the difference in inter-observer and intra-observer reliability values among the “senior” group. There were a small number of readings where there was significant disagreement when MRIs were re-read by one of the “senior” surgeons. With only 7 MRI studies being re-read for assessing intra-observer reliability, this disagreement reduced intra-observer reliability values significantly in the “senior” group. The quality of the MRIs may also have played a role in causing disagreements in certain cases. Despite this discrepancy, we believe the data draws an important conclusion that inter-observer reliability of assessing stenosis on MRIs increases with surgeon experience. It is possible that with increasing experience in spine patient evaluation, MRI reading, surgery and observed patient outcomes, surgeons develop a more consistent and comparable assessment of what may be classified as mild, moderate or severe stenosis. Additionally, this was an exploratory study in nature to examine potential trends. The sample size was determined mainly for a practical point of view and only allowed us to detect large effect and correlations. At the significance level of 0.05, we had 80% power to detect an effect size of 1.05 using two-sample t-test and r=0.48 using a Fishers’s Z test. Possibly a larger sample size may have provided a more consistent outcome in terms of both inter- and intra-observer reliability.
Speciale et al.11 looked at the reliability of seven evaluators from different specialties (two orthopedic spine surgeons, two neurosurgeons, and three radiologists) reading 15 MRIs of patients with lumbar spinal stenosis and comparing these to the cross-sectional spinal canal area. They concluded that the inter-observer reliability was overall fair, and was highest among radiologists (k=0.40), followed by neurosurgeons (k=0.21), and finally orthopedists (k=0.15). They also found that the determination of central stenosis was highly predictive of decreased spinal canal area. The effect of evaluators’ experience on reliability of assessment was not reported.
Lurie et al.13 investigated the reliability of assessment of 58 lumbar spine MRI studies randomly sampled from the Spine Patient Outcome Research Trial (SPORT), looking for evidence of spinal stenosis. These were evaluated by three radiologists and one orthopedic surgeon. They found higher inter-observer reliability in determining central stenosis (k=0.73) and worse results with lateral recess stenosis (k=0.49). Lonne et al.12 reported analysis of MRI images of 84 patients who underwent surgery for lumbar spinal stenosis. The severity of stenosis was morphologically graded from A-D by two neuroradiologists. They found an inter-observer agreement of 0.65 and intra-observer agreement of 0.77 on morphological grading.
The reliability scores in our study were clubbed based on surgeon experience, which has not been done previously. We believe that results of this study have a critical clinical impact due to the fact that all the surgeons had similar clinical training which establishes the standardization of terms, interpretations, clinical decision making and assessment of outcomes among surgeons. Our study focused solely on the surgeon who was involved in the evaluation and treatment of the patients whose MRIs was being evaluated and not on a radiologist’s reliability reading the MRI who is not going to treat the patients.
The advantages of our study are that we prospectively evaluated a consecutively collected group of patients with a limited range of spinal pathology and symptoms of stenosis. The studies were read by clinicians from a single specialty with similar clinical training with differing clinical experience. These clinicians were ultimately involved in patient care which makes the interpretations more significant. The disadvantages are the limited number of patients and MRIs included in the study, and the variety of locations and types of MRIs gathered; and therefore, the variety of quality and detail of the MRIs. In addition, the MRIs were evaluated from the practice of the “intermediate” experienced surgeons and the possibility of confounding the reading with knowledge of the patient presentation. However, with the large volume of patients seen in their clinics and large number among them being non-surgical, it is unlikely they were recognized.
The lack of definite prior instruction about what constituted mild, moderate or severe stenosis may be considered either a disadvantage or an advantage of this study. We relied on the ‘a priori’ opinions of the evaluators to determine the degree of stenosis based on their clinical experience. In a critique of the study by Carrino et al.,18 Jarvik and Deyo19 discussed the issue of overly-dichotomizing interpretations, namely having too specific of data categorized and analyzed as a difference, there would be increased variability and decreased reliability. To counter this, we allowed each evaluator to interpret each level and area of stenosis based on what they considered mild, moderate or severe stenosis. We believe this more closely approximates their assessment of the MRI in a real clinical situation based on their overall experience and training. As patients were enrolled in the study, we discovered many of them already had MRIs performed at outside facilities. The degree of detail of these studies was of varying quality. At the time of the initial evaluation, it was determined whether the MRIs were of sufficient quality to be interpreted in the context of stenosis. All of MRIs used for this study were considered adequate.
In conclusion the results from this study bear clinical importance, as the interpretation of MRI imaging is an important part of surgical decision making for patients with symptomatic lumbar stenosis. Our study suggests that more experienced (“senior”) surgeons can more reliably evaluate MRI studies, suggesting the process of critically looking at imaging studies may be a vital step in improving clinical outcomes. Further studies are required to determine if this difference in reliability in the assessment of MRIs based on surgeon experience leads to a clinically better outcome.
Disclosures & COI
There were no sources of funding from any source for this study. There were no grants for this paper, and no authors have any personal or institutional financial interest in drugs, materials, or devices described in this submission. IRB approval was provided by SUNY Upstate Medical University.
- Copyright © 2017 ISASS - This manuscript is generously published free of charge by ISASS, the International Society for the Advancement of Spine Surgery