Abstract
The pace of US Food and Drug Administration-approved medical devices that incorporate artificial intelligence (AI) or machine learning as part of the device is accelerating. As of September 2021, 350 such devices have been approved for commercial sale in the United States. As much as AI has become ubiquitous in our lives—keeping our cars between the lines on the highway, converting speech to text on the fly, recommending movies, books, or restaurants, and so much more, AI also appears destined to become a routine aspect in daily spine surgery. Neural network types of AI programs have achieved extraordinary pattern recognition and predictive abilities—far surpassing human capabilities—and thus appears well suited to back pain and spine surgery diagnostic and treatment pattern recognition and prediction tasks. These AI programs are also data hungry. As luck would have it, surgery generates an estimated 80 MB per patient per day collected in a variety of datasets. When aggregated, this represents a 200+ billion patient record data ocean of diagnostic and treatment patterns. Such Big Data, when combined with a new generation of convolutional neural network (CNN) AI, set the stage for a cognitive revolution in spine surgery. However, there are important issues and concerns. Spine surgery is a mission-critical task. Because AI programs lack explainability and are absolutely reliant on correlative, not causative, data relationships, the emerging role of AI and Big Data in spine surgery will likely come first in productivity tools and later in narrowly defined spine surgery tasks. The purpose of this article is to review the emergence of AI in spine surgery applications and examine spine surgery heuristics and “expert” decision models within the context of AI and Big Data.
- Artificial intelligence
- big data
- spine surgery heuristics
- machine learning
- FDA AI/ML guidelines
- approved AI/ML devices
Introduction
On 11 March 1997, the US Food and Drug Administration (FDA) approved the first medical device (a sleep monitoring system) that used artificial intelligence (AI) or machine learning (ML). Over the next 13 years, only 7 more were approved. After 2010, however, the pace accelerated rapidly (Figure 1). As of the most recent published information, 22 September 2021, the US FDA has approved for commercial sale in the United States 350 devices that incorporate AI/ML software.1
Coincident with the rise of AI/ML submissions, the FDA issued guidance in a proposed regulatory framework on 2 April 2019, in which they described the FDA’s potential approach to premarket review for AI and ML-driven software modifications.2 Most recently, on 12 January 2021, the FDA issued an AI/ML action plan related to medical devices and which described a multipronged approach to advance the agency’s oversight of AI/ML-based medical software.3 AI is also playing a role in accelerating pharmaceutical and biological product development, improving practice and payer management and, likely, the scope and quality of future clinical research.4
AI is a category of software that autonomously learns and processes, indeed requires, large volumes of data. These are probabilistic algorithms that process significant quantities of multimodal data and provide, in effect, P values for surgeon-defined or clinically defined outcomes. When combined with Big Data, AI is an interesting and potentially transformational innovation for medicine. It could also deliver a cognitive revolution in medicine.5
While most discussions of AI and Big Data are grounded in the engineering of these decision support tools, the purpose of this article is to review the emergence of AI-based analysis and decision tools for spine surgery applications and, in the process, examine spine surgery heuristics and “expert” decision models within the context of AI and Big Data.
Artificial vs Human Intelligence
In the earliest days, in the 1950s, the vision for the AI category of software was machine “common sense”—essentially embedding the ability to perceive, understand, and judge in a manner that is shared by nearly all humans—into machines.6 That vision later changed to AI—implying fabricated, synthetic, and reproducible-at-scale intelligence.
As these software algorithms, operating on increasingly powerful processors, surpassed human memory capacities and processing speeds, terminology changed and pointedly differentiated between the human ability to think, reason, and apply common sense and the machine’s ability to process inputs, learn from those inputs, and deliver probabilistic outcome data. The terminology changed to carve out the space for human intelligence and a different space for machine intelligence.
DeepMind and the CNN Moment
The ancient board game Go was considered to be incomputable due to the number of possible moves—which is 10 to the 172nd power. AI took a significant leap forward when AlphaGo, developed by DeepMind founders Demis Hassabis, Shane Legg, and Mustafa Suleyman in 2010, defeated Lee Sedol in 2016, the world champion Go player. DeepMind’s AlphaGo beat the world’s best Go player by innovating moves that had not occurred to humans in the 2,500-year history of the game. This event fundamentally altered the course of AI technology investment.7Since AlphaGo, China in particular has invested more than $18 billion in AI, filed more than 21,000 AI patents (US inventors filed 30,000), and declared that global AI leadership was a new national priority.
Since then, DeepMind’s CNN AI programs have been applied to recognizing protein structure patterns, acquiring computer programming skills, analyzing and correcting ancient Greek texts, and solving knot theory, among other high-end scientific problems. Here, for example, are DeepMind’s achievements after 2016:
2018: AlphaFold accurately predicted 25 out of 43 proteins, winning the 13th Critical Assessment of Techniques for Protein Structure Prediction contest.
2022: AlphaCode autonomously coded programs at a rate comparable with that of an average programmer.
2022: Ithaca predicted and restored missing text of damaged ancient Greek inscriptions, identified their original location, and established the date they were created. It achieved 62% accuracy in restoring damaged texts (vs 25% for humans), achieved 71% accuracy in identifying original text location, and dated texts to within 30 years of their ground-truth date ranges. Ithaca is causing historians to re-evaluate significant periods in Greek history.8
2022: Knot Theory AI used invariants, one of the properties of the mathematics of knots (of which there are millions), to “teach” the neural network. Then, using saliency maps, an AI technique, it uncovered heretofore unknown knot mathematical relationships and properties.9
DeepMind employed a CNN type of AI along with a range of mathematical weighting systems to achieve these remarkable pattern recognition and prediction outcomes. The diagnostic, treatment, and rehabilitation patterns that make up the heuristics of spine surgery are well suited to a DeepMind type of probabilistic, CNN AI system. What previously unknown back pain diagnostic pattern or treatment plan could these tools uncover?
Spine Surgery Heuristics
The heuristics, or operating rules, of spine surgery developed from years of study, training, and practice can be summarized as 4 basic decision points:
whether to operate or not
choosing between surgical approaches and techniques
avoiding complications and risk
adjusting the entire enterprise for individual patient comorbidities
These heuristics are unique to each surgeon’s judgment, ability to reason deductively, and aptitude for applying values and experiences.
Surgeons encounter patients linearly, 1 at a time. Assuming a 35-year career from residency to retirement, the average spine surgeon encounters approximately 40,000 back pain patients and operates on approximately 7000 of them. In psychological terms, surgeons view the “black box” of spine disease using “spotlight vision”—1 patient at a time, biased for regional patient characteristics and informed by their personal educational and lived experiences. No individual spine surgeon treats a representative sample of spine surgery patients.
Surgeons learn through the patients they treat to make complex diagnostic judgments, employ personal surgical heuristic models, and then, under duress of time, cost, workload, and incomplete information (inaccessible prior patient records, language barriers, etc), and deliver advanced, complex treatment. Ask 10 spine surgeons how they would treat a single difficult case, and multiple answers will emerge—again, based on each surgeon’s unique spotlight view of treating back pain surgically. Saravi et al described this phenomenon in their recent article on AI-driven prediction modeling10:
Interestingly, the increase in performed surgeries is not directly proportional to improved patient outcomes. Impaired quality of life, persistent pain, and functional problems are reported in up to 40% of patients undergoing low back pain surgery and 20–24% undergoing revision surgeries. Indications influencing the decision as to whether a patient should undergo surgery are not entirely based on guidelines but rather on discussions between the surgeon and patient, as well as the expertise and skills of the surgeon. Furthermore, there are no clear guidelines on surgical techniques for treating degenerative spinal diseases; as such, it remains unclear as to whether one treatment approach might perform better in particular cases than another.
How AI Plus Big Data Changes Spine Surgery Heuristics
AI plus Big Data (which opens the door to the law of large numbers) is like a lantern, illuminating 360° of a patient’s back pain pathology. The law of large numbers is a theorem upon which several industries (eg, life insurance, auto insurance, casinos) are based. It postulates that “the average of the results obtained from a large number of trials should be close to the expected value and tends to become ever closer to the expected value as more trials are performed.” In other words, the larger the n, the closer the actual outcomes approximate expected outcomes. This phenomenon, particularly when applied to biological processes (eg, mortality), requires a very large n.
When AI scientists describe the coming cognitive revolution in medicine, that is essentially what they are referring to—the ability of surgeons who use these tools to see unexpected, previously unknown, and clinically relevant diagnostic and treatment patterns. For the practicing spine surgeon 10 years from now, for example, it may well be possible to use probabilistic AI-based models and P value generators, which access integrated networks of 200+ billion patient records—including genetic records—and thereby function as the surgeon’s expert virtual assistant, residing perhaps on a smartphone, and feeding P values to the surgeon from triage to surgery and on through rehabilitation.
Drawing on a law of large numbers analogy, AI plus Big Data puts every spine surgeon in the position of being the “house” in the casino of spine surgery—odds accruing to the surgeon’s (and by extension the clinic’s, hospital’s, and patient’s) favor. (Casinos and life and health insurance companies are examples of commercial enterprises based on the law of large numbers which—I am suggesting—is also the emerging future of spine care.) In other words, in the next decade, the comment “my patient did not have an optimal outcome due to unforeseen and underlying comorbidities” could be nearly extinct.
Big Data and Predictive Value
These new AI forms of neural networks and deep learning algorithms require significant quantities of data, and their effectiveness is correlated to the amount of data they can access for training and pattern discernment. One of the reasons these tools appear well suited for spine surgery is that the process of treating back pain generates vast streams of data per patient, more than 80 MB of information and data yearly.11 Where do these data go?
I identified 7 existing and prospective databases which, in the aggregate, hold approximately 200 billion patient records, which, if ever aggregated, cleaned, and relevantly tagged, would be a data mining ocean for these new AI forms and other systems (Table 1).
Currently available AI algorithms (notably, but not exclusively, CNNs) can mine this data ocean and apply powerful abilities to identify patterns and predict outcomes—much as DeepMind did for the game Go, which had 10127 power decision points. Importantly, these new tools are able to process multi-input/mixed data streams. Furthermore, their predictive value is grounded in the law of large numbers. More data and more training iterations produce more clinically relevant predictive values. At this Gutenberg moment for spine surgery, what is possible?
Saravi et al in their 2022 article in the Journal of Personalized Medicine, found answers to that question in a review of the literature. Using the terms “spine surgery” and “machine learning” or “artificial intelligence” and searching PubMed, Web of Science, and Google Scholar, the research group found 64 AI-based predictive studies where, in broad terms, either neural network prediction models or ML prediction models were used. The most commonly sought prediction models by the 5 dozen research groups were patient-reported outcome measures, complications, discharge dispositions, and length of stay predictions. The least common was future fracture odds (Figure 2).
A total of 28 different AI or statistical analysis algorithms were employed in the prediction studies. The 5 most common were artificial neural network, logistical regression, random forest, decision learning tree, and support vector machine (Figure 3). The median number of patients in these prediction studies was 1053. The largest n was 1,106,234, and the lowest was 27.
In a 2019 study, Ogink et al12 used the American College of Surgeons National Surgical Quality Improvement Program (n = 28,600) to predict nonhome discharge rates after lumbar spinal stenosis surgery. Using a CNN type of AI, Ogink et al reported a 0.74 accuracy measure, which was confirmed by another study validating the model. That is encouraging, and given that the program used is a CNN type of AI, a larger n and more time to learn should push accuracy rates higher.
Kim et al13 applied neural network algorithms to predict complications following posterior lumbar fusion surgery (n = 22,629). Their conclusions were that such neural network types of AI algorithms are capable of outperforming the American Society of Anesthesiologists classification and logistic regression system for predicting several types of lumbar fusion surgery complications.
Overall, in the absence of a data ocean, much like the one hypothesized above, the ns found in this review are the best that are currently available. For the complex biological and mechanical problems that bear on the process of spine surgery, these ns are interesting, instructive, and offer tantalizing clues as to where AI plus Big Data might be directed initially.
AI as a Productivity Tool
Two productivity tools, one of which represents the first practical step to autonomous diagnostics in the clinic and the other of which represents the first practical conversion of unstructured, colloquial patient/physician conversations into electronic medical records data, with coding, at scale, are available (or soon will be) for US spine surgeons.
AI medical scribing
AI triage
AI Digital Medical Scribing
The Health Information Technology for Economic and Clinical Health Act, which was part of the American Recovery and Reinvestment Act of 2009, incentivized physicians, clinics, and hospitals to employ electronic health record systems.
Over the next decade, about 72% of office-based physicians and 96% of nonfederal acute care hospitals14 adopted a certified electronic health record. As a result, physician time required for patient documentation increased significantly, patient workflow was often adversely affected, and physician stress and burnout increased.
Most hospitals and clinics currently employ medical scribes, typically unlicensed medical assistants (also premedical students and registered or licensed nurses), to ease the physician documentation workload by transcribing patient-physician interactions. Approximately 100,000 medical scribes are currently employed in the United States to perform those functions. The annual cost of a human medical scribe ranges from $30,000 to $50,000, not including an estimated $6000 in training costs.15 Thus far, 3 companies are seeking to disintermediate the $4 billion medical scribe market with AI-based digital scribe technologies (Table 2).
Each currently available AI scribing system provides voice-enabled ambient recording (audio, video, or both) at the point of care (office or telehealth), which then uses multifactor authentication and data encryption algorithms to, in a Health Insurance Portability and Accountability Act–compliant manner, transcribe those unstructured, colloquial, and imprecise natural language conversations into electronic medical record (EMR)–ready data.
There are numerous digital scribing systems on the market, but these are the first true AI-based systems that not only convert conversation information into data but also autonomously link to EMR systems. Additionally, 2 of the systems (Robin Healthcare and DeepScribe) tag the data with relevant reimbursement codes.
The form of AI these systems employ is a blend of AI and artificial general intelligence (AGI) and uses CNN engines. In use, these systems have demonstrated the ability to reduce human scribing workloads. The key challenges for these new systems include the following:
Standardizing the physician documentation process without losing important diagnostic nuances. In other words, how closely can these systems match the physician’s ability to discern a patient’s true condition?
Error-free links to existing EMR systems.
Validated coding accuracy.
AI Triage Bots
AI Triage Bots—which is software that operates as the surgeon’s agent to perform patient assessment without specific human direction—is arguably the first step to autonomous back pain diagnostics.
So far, diagnostic Bot performance is a mixed bag, but lessons are emerging and better AI Triage Bots for the spine are moving from development to market. A small number of early published studies offer intriguing insights into both the promise and the unexpected difficulties of AI-based Triage Bots.
Lath et al19 compared the accuracy and safety of the MayaMD AI-powered triage and diagnostic system with human doctors. In their study design, Lath et al presented 5 different clinical vignettes with 4 triage options to 12 human health care providers, which resulted in 60 clinical vignettes. The same 60 clinical vignettes were fed into the MayaMD AI triage tool.
MayaMD’s library and core algorithm were taught using accepted evidence-based knowledge, specifically, more than 7000 diagnoses, 8500 initial inputs (symptoms, physical signs, and laboratory results), 40,000 inferences, and 2220 medications and interventions. Finally, a panel of 3 physicians (2 internal medicine specialists and 1 surgeon) reviewed the 60 clinical vignettes and provided their holistic view of the most appropriate triage option. MayaMD accurately matched physician consensus in 55 out of 60 case vignettes (91.7%). The individual health care providers matched physician consensus in 45 of 60 case vignettes (75%). The research team concluded that MayaMD performed significantly better than individual clinicians when determining a triage decision for a clinical vignette (91.7% vs 75%, P = 0.04).
Fan et al20 analyzed a widely deployed self-diagnostic AI bot in China. The team compiled data for 47,684 chatbot sessions by 16,519 users over 6 months. They found that the AI chatbot was used by all age groups, including older adults. Users consulted the chatbot for a wide range of medical conditions. Two prominent issues emerged from this study: first, a considerable number of users dropped out in the middle of the consulting sessions, and second, some users pretended to have health concerns and used the chatbot for extraneous therapeutic purposes.
Studies of non-AI chatbots21 have also found that generalized health chatbots are well used and reduce the urgency of some cases but, overall, have not had a measurable effect on physician workload or patient outcomes. More targeted AI bots, supported by well-validated, evidence-based diagnostic systems and a significant n of diagnostic and patient experience, are currently in beta use at several spine and neurosurgery clinics. One system, which was tested at OrthoCarolina and other locations around the United States, used data from each patient’s EHR combined with answers to a self-administered questionnaire and then, using more than 1 validated, evidence-based diagnostic tool, performed autonomous triage of prospective spine surgery patients. The initial results have been encouraging.
Categories of AI and the Law of Large Numbers
Current and future AI tools can be organized into 3 conceptual categories:
Artificial narrow intelligence
Weak AI
Task-specific, such as chess games, accounting functions, speech recognition, patient triage, and consumer behavior weighting
Virtual assistants such as Siri, Alexa, etc
Can deliver strong P value with limited data for specific tasks
AGI
Strong AI
Not task-specific, such as DeepMind CNN AI applied to the game Go, autonomous computer programming, digital image analysis, diagnostic pattern analysis, and autonomous rule generation
CNN (deep learning)
Requires large datasets to deliver reliable P values
Demonstrated ability to uncover new knowledge
Artificial super intelligence
Strongest AI
Surpasses human intelligence and ability in all areas
Requires access to most all data
No examples exist
Self-learning
No hard boundaries exist between or within these categories. Specific AI applications may have elements of both artificial narrow intelligence and AGI, for example.
Two Existential Issues: Explainability and Causal vs Correlational Decision Models
AI has become ubiquitous in our lives—keeping our cars between the lines on the highway, converting speech to text on the fly, recommending movies, books, or restaurants, and so much more. One of AI’s greatest strengths—the ability to discern patterns faster and more accurately than humans—is also a weakness to the extent that the patterns it “reads” are the basis for recommendations. Could an AI program’s output be the result of a coincident, temporary anomaly, which is often referred to as “noise”? Who would or could know? These AI systems are not built to understand context, look for causality, or even apply clear scientific principles to the data collected.
The principal subject of this article has been a new approach to decision making in spine surgery—one that will initially augment existing surgeon-specific heuristic models—but eventually, by linking specific patient data to integrated multimodal data oceans and then employing AI neural networks to optimize individual patient diagnosis and treatment plans, transform, and possibly upend, surgeon-specific heuristic models.
As discussed earlier, AI began as a mission to imbue “common sense”—that is, the ability to perceive, understand, and judge in a manner that is shared by nearly all humans—into machines. In a way, this implies the ability to contextualize information, explain, and even answer the question “why”?
Humans, as we know, typically start asking the “Why?” at 2 or 3 years of age and, often, never stop. AI programs, despite processing inputs at incredible speeds and power, incorporating dynamic weighting, autonomous learning, and even self-writing software, cannot explain how they arrived at their decisions or P values. That is a problem in such a mission-critical function as spine surgery.
Holzinger et al, in their 2019 study, stated the issue this way22:
Currently, explanations of why predictions are made, or how model parameters capture underlying biological mechanisms are elusive. A further constraint is that humans are limited to visual assessment or review of explanations for a (large) number of axioms.This result in one of the main questions: Can we deduce properties without experiments—directly from pure observations? (Peters et al., 2017).
Understanding, interpreting, or explaining are often used synonymously in the context of explainable-AI (Doran, Schulz, & Besold, 2017), and various techniques of interpretation have been applied in the past.... In the context of explainable-AI the term “understanding” usually means a functional understanding of the model, in contrast to a low-level algorithmic understanding of it, that is, to seek to characterize the model’s black-box behavior, without trying to elucidate its inner workings or its internal representations....
We argue that in medicine explainable AI is urgently needed for many purposes including medical education, research and clinical decision making (Holzinger, 2018). If medical professionals are complemented by sophisticated AI systems and in some cases future AI systems even play a huge part in the decision making process, human experts must still have the means—on demand—to understand and to retrace the machine decision process.
Computer scientists are working on both post hoc and ante hoc systems that might be layered over AI models in order to deliver some form of explanation to the surgeon. But it would seem at this still early phase in the development of AI and Big Data systems that these expert decision models need to also become comprehensible, understandable, and explainable.
For now, in order to become trusted decision models for spine surgeons, AI and Big Data will, I would expect, enter the spine surgeon’s practice initially as expert advisers, not autonomous decision-makers, and will find the widest use in narrowly defined spine surgery tasks.
Footnotes
Funding The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests The authors report no conflicts of interest in this work.
Disclosures Robin Young is the CEO and Founder of Pearl Diver. He reports consulting fees from Globus Medical, Robert Reid Ltd, LifeNet Foundation, and Camber Spine, and he has received support for attending meetings/travel from Castellvi Spine Meeting.
- This manuscript is generously published free of charge by ISASS, the International Society for the Advancement of Spine Surgery. Copyright © 2023 ISASS. To see more or order reprints or permissions, see http://ijssurgery.com.