ABSTRACT
Background: A validated classification remains the key to an appropriate treatment algorithm of traumatic thoracolumbar fractures. Considering the development of many classifications, it is remarkable that consensus about treatment is still lacking. We conducted a systematic review to investigate which classification can be used best for treatment decision making in thoracolumbar fractures.
Methods: A comprehensive search was conducted using PubMed, Embase, CINAHL, and Cochrane using the following search terms: classification (mesh), spinal fractures (mesh), and corresponding synonyms. All hits were viewed by 2 independent researchers. Papers were included if analyzing the reliability (kappa values) and clinical usefulness (specificity or sensitivity of an algorithm) of currently most used classifications (Magerl/AO, thoracolumbar injury classification and severity score [TLICS] or thoracolumbar injury severity score, and the new AO spine).
Results: Twenty articles are included. The presented kappa values indicate moderate to substantial agreement for all 3 classifications. Regarding the clinical usefulness, > 90% agreement between actual treatment and classification recommendation is reported for most fractures. However, it appears that over 50% of the patients with a stable burst fracture (TLICS 2, AO-A3/A4) in daily practice are operated, so in these cases treatment decision is not primarily based on classification.
Conclusion: AO, TLICS, and new AO spine classifications have acceptable accuracy (kappa > 0.4), but are limited in clinical usefulness since the treatment recommendation is not always implemented in clinical practice. Differences in treatment decision making arise from several causes, such as surgeon and patient preferences and prognostic factors that are not included in classifications yet. The recently validated thoracolumbar AO spine injury score seems promising for use in clinical practice, because of inclusion of patient-specific modifiers. Future research should prove its definite value in treatment decision making.
Level of Evidence: 2.
Clinical Relevance: Without the appropriate treatment, the impact of traumatic thoracolumbar fractures can be devastating. Therefore it is important to achieve consensus in the treatment of thoracolumbar fractures.
INTRODUCTION
Thoracolumbar fractures are common injuries. Without appropriate treatment, their outcome can be devastating. Commonly, treatment decision is based upon accurate radiological diagnosis and concomitant use of a fracture classification system. Several classifications have been introduced during the past years. With the improvement of imaging (eg, computed tomography [CT], magnetic resonance imaging [MRI]), it has become possible to better understand the pathology of the thoracolumbar spine fractures, and to recognize fracture patterns. These fracture patterns give insight into fracture morphology, trauma mechanism, and determination of stability, and have led to various classification systems. Classifications aim to create a common language with standardization and optimization of treatment. Currently the most used classifications are Magerl/AO and the thoraco-lumbar injury classification and severity score (TLICS).1–3
AO classification is primary based on the pathomorphological characteristics of the injury. It is often used for fracture classification, but does not include a reliable estimation of prognosis for the determination of the best treatment.1,2 In 2005, Vaccaro et al initially developed the thoraco-lumbar injury severity score (TLISS),3 which was slightly modified to the TLICS in 2007.4 As the name already states, this classification system includes a scoring system based on 3 variables, with subsequent treatment algorithms. Recently, the new AO spine classification has been published, which tries to simplify the comprehensive Magerl/AO classification and incorporates features of both TLICS and Magerl/AO classifications.5,6 Table 1 shows a description of the Magerl/AO, TLICS, and new AO spine classification systems.
Considering the existence of various classification systems, and the quantity of research that has been done to classify thoracolumbar fractures, it is remarkable that consensus about treatment is still lacking. A validated classification of fractures remains the key to an appropriate treatment algorithm. In an attempt to achieve worldwide consensus in the treatment decision of traumatic thoracolumbar fractures, there is need for a classification that should have 2 important characteristics: (1) it needs to create a worldwide common language concerning the recognition of injury types (accuracy) and (2) the treatment recommendation by the classification should be highly correlated to the actual treatment (clinical usefulness).
For this reason, we have performed a systemic review with the following research question : “In traumatic thoracolumbar spine fractures, which classification can be used best for treatment decision making?” This question considered participants, intervention, comparisons, outcomes, and study design (PICOS). We looked for participants who were treated for traumatic thoracolumbar spine fractures. We compared the Magerl/AO, new AO spine, and TLIC classifications with respect to treatment decision-making. We defined the outcome parameters, looking for accuracy (expressed in interobserver and intraobserver validity and kappa values) and clinical usefulness (expressed in the sensitivity or specificity of an algorithm); these parameters were the result of a systematic review.
MATERIALS AND METHODS
A systematic literature review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement (PRISMA). The checklist of the PRISMA guidelines are attached in Appendix 1.
Search
With help from the medical research librarian (M.H.) a comprehensive search of the English literature was conducted using PubMed, Embase, CINAHL, and the Cochrane Database. The literature was searched without any data limitations. Search terms included “classification (mesh),” with subsequent corresponding synonyms (ao spine, ao classification, tlics, tliss, classification*, systematics, taxonom*); and “spinal fractures (mesh)” with subsequent synonyms (spinal fracture*, spine fracture*, thoracolumbar fracture*, thoracic fracture*, lumbar fracture*, vertebra fracture*, vertebral fracture*). The full search process is shown in Appendix 2.
Study Selection
All hits (PubMed: 1128, Embase: 2775, CINAHL: 279, Cochrane: 134; in total 4312 hits) were imported to Refworks. Two independent researchers (IC and MS) viewed all references and included full text papers. In case of a difference of opinion, a third author (PW) was consulted.
Literature was included or excluded based on the following criteria. Inclusion: thoracolumbar fractures; English language; analysis of AO, new AO spine, TLICS or TLISS classification; measurements: intra- and/or interobserver validity (kappa values,) or clinical usefulness expressed in specificity/sensitivity of an algorithm, or any other way the applicability was scored. Exclusion: congress papers, instructional course lectures, reviews, cadaver studies, cervical spine fractures, all other classification methods except those mentioned in the inclusion, children, osteoporotic or other pathological fractures, the expression of the clinical usefulness associated with treatment-related outcome.
The Prisma evidence based medicine checklist for diagnostic articles was used for the qualitative analysis. See Table 2 for the qualitative analysis of the included literature.
Outcome Parameters
Accuracy is defined as the interobserver and intraobserver reliability of the classification systems. The reliability is expressed in kappa values, which are commonly used and accepted for the measurement of data collection accuracy.24 As a summery measure for the kappa coefficients, mean kappa values were used. Some studies showed different kappa values depending on level of expertise and function of the observer. In that case we chose the kappa values represented by the attending spine surgeons, as these were most representative for clinical decision making.
For clinical usefulness, we decide to focus on the applicability of the current classifications. This is quantified by the percentage of agreement between classification recommendation and actual treatment, and shows the correlation between classification recommendations and decision making.
Statistical Methods
Pooling of data could not be performed, because case cohorts and number of observers were too variable in the included studies. As raw data were not available, it was not possible to perform a meta-analysis of kappa values.
RESULTS
Twenty-eight articles were selected for full text rating. Four more were selected by cross reference. After full text screening, the following 12 articles were excluded. The full text article of Yacoub et al25 was not available. One article by Mirza et al26 was a review of previous literature. Five articles were not applicable to the research question of this review: Salgado et al,27 Pizones et al,28 and Winklhofer et al29 studied the influence of the MRI on the classification system, instead of analyzing the reliability of the classifications itself; Shen et al30 investigated the prognostic factors of failure of conservatively treated burst fractures; and Pneumoticos et al31 compared TLICS 1-3 and TLICS 4 conservatively treated thoracolumbar spine fractures. Five articles contained subgroup analysis of the same cohort published earlier (Joaquim et al 201432 presented the same cohort as Joaquim et al 201322; Ratliff et al33 and Raja Rampersaud et al34 performed subgroup analysis of the same cohort published by Harrop et al3; and Sadiqi et al35 and Schroeder et al36 had subgroup analysis of the cohort studied by Kepler et al 17).
Seventeen papers were eligible for the first subquestion regarding the interobserver and intraobserver reliability. Eight articles could be used for answering the second question, considering the clinical usefulness of the classification methods. The inclusion and exclusion processes are summarized in the PRISMA flow diagram (Figure 1).
Reliability
Seventeen articles described the interobserver and/or intraobserver validity of at least one of the classification systems. The data for the interobserver and intraobserver kappa values are shown in Tables 3 and 4 respectively. Regarding the proposed guidelines of Landis and Koch,37 the kappa values indicated moderate to substantial agreement (kappa > 0.4) for all 3 classification methods.
A wide range of kappa values have been described in literature. These values were influenced by various factors, including the number of observers, cases and options, prevalence, blinding and work-up bias. When taking that into account, all current classifications had acceptable reliability. Kappa values of the new AO spine classification seem slightly better than the Magerl/AO classification. In Figure 2 the mean interobserver kappa values of the total TLICS score were expressed against the number of observers. The larger the number of observers (and cases), the lower the kappa value. The study of Patel et al14 shows an outlier with a mean interobserver kappa value of 0.51 in 21 observers. This is higher than expected, but could be explained by the fact that the attending observers had been involved in the development of the TLISS system and that they had trained the remaining observers on the use of the system. The kappa values of the validity studies were slightly higher than the kappa values of the independent prospective cohort studies.
Clinical Usefulness
Eight articles were included with data concerning the clinical usefulness of the classification methods. Pishnamaz et al9 used the AO classification and illustrated the treatment strategy depending on AO fracture type. There was a large difference between German and Dutch spine surgeons regarding treatment of burst fractures (AO type A3). Whereas in Germany 96.2% of the A3 fractures were treated surgically, in the Netherlands only 41.2% of these burst fractures were operated. They stated that despite the internationally used classification systems, there is insufficient evidence to install a standard treatment algorithm for fractures of the thoracolumbar spine.
Rajasekaran et al23 published their results concerning the usefulness of the new AO spine classification in 2016. Forty-one AO Spine members classified 30 sets of images of patients with thoracolumbar spine trauma of varying severity. Cases were assessed independently and the reviewers were asked to answer questions regarding fracture classification, type of treatment, and need for further investigations. The presented kappa values were not correlated with the observers, but with the diagnostics, as they were measured in plain radiographs, CT, and MRI. Hence, these data could not be included in the aforementioned reliability section. However, they also looked for the decision on fracture management. After the first assessment with plain radiographs, 72% of the patients were indicated for surgical treatment. This percentage increased significantly to 81.7% with CT images. Additional MRI, however, did not alter treatment strategy.
Six papers debated the applicability of the treatment algorithm of the TLICS/TLISS system. Vaccaro et al,18 who proposed the TLISS, showed in their study in 2006 a > 96% agreement on the treatment recommendation of the TLISS within a group of 5 observers scoring 71 clinical cases. Harrop et al,3 who were also part of the development team for the TLISS classification, published the results of 48 observers who assessed 56 cases. They reported an agreement of > 90% among the surgeons on the preferred management of the fracture and the TLISS-graded management. In 2007, Whang et al4 presented validity data of the assessment of 25 cases by 5 observers. They distinguished between TLICS and TLISS, but did not report any significant differences between these 2 (almost) equal classifications. A correct prediction was achieved in > 90% of the cases, with a sensitivity of 89% and a specificity of 95%. In 2010, Andrei et al21 collected the data of a retrospective surgical cohort and presented the safety and applicability of the TLICS system. Forty-nine patients were included. In 47 of the 49 patients (95.9%), the TLICS accurately matched surgical decision making. There were 2 patients with a TLICS score of 2 points who underwent surgical treatment. Both patients were diagnosed with an L1 burst fracture without neurological injury. Operative treatment was recommended by the surgeon because of concerns about the comminution and the possibility of progressive deformity. In addition to this study, in 2013 Joaquim et al22 performed an analysis of a large retrospective cohort (N = 458), in which 310 patients were treated conservatively; 99% of these patients had a TLICS < 4. There were 9 failures, defined as patients that received surgical treatment in a second stage. Three missed B-type fractures required surgery because of progressive deformity and severe pain. One patient needed surgery after 6 months because of severe L5 radiculopathy (unknown if this was related to the fracture). Five patients with burst fractures underwent surgery because of persistent pain or progressive kyphosis. Only 2 of these had pain improvement postoperatively. Furthermore, the author stated that of the 125 patients with burst fractures without neurological deficit (TLICS 2), 96% were successfully treated without surgery. The second group consisted of 148 patients who all received surgical treatment. Twenty-four complications (16.2%) were reported, varying from instrumental removal and urinary infection to death (N = 1). Surgical treatment matched the TLICS recommendation only in 46.6% of the cases. The 53.4% mismatches were all stable burst fractures (TLICS 2). No details about complications or other clinical implications in the subgroup of surgically treated patients with stable burst fractures were described.
Recently, Park et al15 described a modified TLICS score, and measured the clinical usefulness of this modified TLICS and the original TLICS classification. The analysis was performed on 134 fractures, and images were independently interpreted by 2 observers. Thirty-one patients were treated surgically. Two of these patients had a TLICS < 4 (6%) and 58% (n = 18) of the surgically treated patients scored a TLICS 4. Of the 103 conservatively treated patients, only one scored TLICS 5 by both observers.
In summary, literature concerning the clinical usefulness of the classification methods is sparse. Joaquim et al22 reported the largest series, in which it appeared that over 50% of patients suffering a stable burst fracture (TLICS 2, AO type A3/A4) were surgically treated, and treatment decision was based on other patient- and fracture-related factors, mainly persistent pain and progressive kyphosis.
DISCUSSION
With the available literature, we would postulate that the accuracy of all 3 reviewed classifications is sufficient for use in clinical practice.2–5,8–20 Although kappa values are in favor of the TLICS, we also believe that the accuracy of the AO spine classification is sufficient for use in clinical practice. Since these classifications have a different design, and the kappa values are calculated from different numbers of variables (more fracture morphology options in the AO spine classification), it is difficult to directly compare the kappa values of the TLICS versus the AO. With interobserver kappa values of 0.36–0.77 for the new AO spine, and 0.24–0.88 for TLICS management, they do not all reach the kappa value of > 0.55, which is necessary for a classification system to be clinically reliable, according to Sanders et al.38 But in answer to that criterion, Oner et al39 stated that this is too stringent for assessing the reliability of a spinal fracture classification system. As kappa values depend on the number of options (fracture types), observers, and cases, one could state that in studies with many observers and many fracture types a kappa value of < 0.55 may be deemed acceptable. Blinding of observers regarding treatment decision and outcome is important to have the lowest risk of bias in kappa values. Therefore, study design is paramount. The absolute kappa values found in the literature should therefore always be seen in relation to the quality of the studies, and numbers of cases and observers.
Although the research question regarding clinical usefulness seems very important to understand which classification can best be used in clinical practice, it is by far the hardest one. Regarding the clinical usefulness, scientific evidence remains poor. TLICS is the only current classification system that contains a point allocation with treatment recommendation in practice.
Very often, treatment decisions are not based on the classification alone. Current literature shows > 90% agreement for the quite obvious treatment decision in simple compression fractures (conservative) and clearly unstable B and C type fractures (surgical).3,4,18,21,22 However, in only 50% of the cases regarding stable burst fractures (AO A3/A4 and TLICS 2), treatment recommendation of the TLICS classification is followed by the surgeons, as shown by Joaquim et al.22 Despite the evidence considering the safety of conservative treatment40,41 and well-known negative clinical implications of surgery (eg, complications, limitation in spinal movement), in most patients with stable burst fractures an operation was performed.
In 2016, Bakhsheshian et al42 published a review of evidence-based management of stable thoracolumbar burst fractures. They concluded that a high level of evidence demonstrated similar functional outcomes with conservative management when compared with open surgical operative management. However, some burst fractures treated conservatively had a poor outcome with progressive kyphosis and persistent pain, which could be the reason for uncertainty in the clinical management of burst fractures.
For an appropriate treatment algorithm, the prognostic factors responsible for worse outcome of these burst fractures should be elucidated, which we will discuss in the following sections.
However, next to prognostic factors, we should be aware that differences in treatment strategies also arise from other causes. These differences probably start with a lack of a worldwide uniform accepted definition of “right” or “optimal” care, as these definitions are mainly opinion based. These opinions are often formed by the surgeon's culture and skills and institutional possibilities. Additionally, patient preferences and individual risk factors may play an important role.
The Uncertainties Concerning Posterior-Ligament-Complex Integrity and Burst Fractures
Burst fractures and the posterior-ligament-complex (PLC) integrity remain the most important uncertainties. Clarification regarding these parameters would improve uniform decision. Schnake et al43 and Leferink et al44 showed in their studies that worse outcomes may be due to the fact that burst fractures are actually missed B-type fractures. The difference between burst and B-type fractures relies on the integrity of the PLC. So, reliable assessment of the PLC integrity is crucial in these cases.
Unfortunately, evidence about the role of standard MRI in addition to plain radiographs and CT is contradictory. Oner et al,8 Salgado et al27, Pizones et al,28 and Winklhofer et al29 presented results in which MRI seemed to improve reliability and influence treatment strategies compared to CT. But Rajasekaran et al23 performed a study with similar sensitivity for CT and MRI, and reported no change in treatment decision after additional MRI. This would indicate that plain radiographs and CT suffice for classification and treatment decision. Literature regarding the reliability of MRI in PLC status resulted in fair to moderate kappa values (kappa ± .4),45 and demonstrated relatively high negative predictive values and relatively low positive predictive values for PLC injuries.46
Despite the controversial evidence regarding MRI and PLC, agreement on the PLC status is important, especially in burst fractures. We suggest in burst fractures routine MRI could be of additional value. Without any edema in the PLC on MRI, the integrity is established. But without an additional MRI, it is probably safer to value PLC as undetermined in most burst fractures, leading to recommendation of surgical treatment.
Prognostic Patient- and Fracture-Related Parameters
In addition to the PLC status, anterior comminution remains an important risk factor for worse outcome in burst fractures, although definite evidence concerning its role is still lacking. Recently, Spiegl et al47 discussed a key role for the intervertebral disc in determining the long-term clinical and radiological outcome of burst fractures. Incorporation of the intervertebral disc pathology into the existing classification systems might be a valuable prognostic factor. In addition to these factors, previous studies also stated that several other parameters might influence outcome in thoracolumbar fractures. Shen et al30 published results of a radiological and binary logistic regression analysis. They showed that visual analog scale pain scores and interpedicular distance could be significant risk factors for failure of nonoperative treatment of burst fractures. Furthermore, lower bone quality and bone regeneration (eg, osteoporosis), higher age, and fracture localization at the thoracolumbar junction seem to be responsible for worse radiological outcome.30,48
Clinical usefulness of the current classifications is still limited as outcome is influenced by the abovementioned patient- and fracture-related parameters. Including these parameters in future classification systems may enhance prognostic value and thus clinical usefulness of such classifications.
As an extension of the new AO spine classification, Vaccaro et al20 recently introduced the thoracolumbar AO spine injury score. This score contains a treatment algorithm, not only based on the classification of the fracture morphology, but including a point allocation for neurological status and patient-specific modifiers (eg, PLC status and ankylosing spondylitis). In 2016, Kepler et al49 and Vaccaro et al50 presented a validation study of this AO spine injury score. In these studies, input from surgeons worldwide was used to determine the initial treatment recommendation.
In this respect, with the addition of patient-specific modifiers, the thoracolumbar AO spine injury score shows insight that other patient- and fracture-related parameters are important in the search for a worldwide applicable and accepted classification and a treatment algorithm for thoracolumbar spine fractures.
CONCLUSION
Current TLICS and new AO spine classifications have acceptable accuracy regarding their reproducibility, but are limited in clinical usefulness since the treatment recommendation is not always implemented in clinical practice, mainly in burst fractures. Differences in treatment decision making arise from several causes, such as surgeon and patient preferences, culture, and prognostic factors that are not included in classifications yet.
The recently validated thoracolumbar AO spine injury score including patient-specific modifiers seems promising for use in clinical practice. However, we would suggest further evaluation of the clinical usefulness of this score, and consider adding more relevant parameters associated with a worse outcome.
Appendix
APPENDIX 2. SEARCH PROCESS
Searches were conducted March 16, 2017, on PubMed, Embase, Cochrane Database, and CINAHL.
PubMed: 1128 hits. See Table A1.
Embase: 2775 hits.
Cochrane: 134 hits.
CINAHL: 279 hits.
Footnotes
Disclosures and COI: The authors have no disclosures, and did not receive any funding. The manuscript submitted does not contain information about medical devices or drugs.
- This manuscript is generously published free of charge by ISASS, the International Society for the Advancement of Spine Surgery. Copyright © 2020 ISASS