New research suggests a system for determining the relative accuracy of predictive AI in a hypothetical medical setting and when the system should defer to a human clinician
Artificial intelligence (AI) has great potential to improve the way people work in a range of industries. But to integrate AI tools into the workplace in a safe and responsible way, we need to develop more robust methods for understanding when they can be most useful.
So when is artificial intelligence more accurate and when is human? This question is particularly important in healthcare, where predictive AI is increasingly being used in high-risk tasks to assist clinicians.
Today at Nature Medicinewe have published our joint work with Google Research, which proposes CoDoC (Complementarity-driven Deferral-to-Clinical Workflow), an AI system that learns when to rely on predictive AI tools or defer to a clinician for more accurate interpretation of medical images.
CoDoC explores how we might leverage human-AI collaboration in hypothetical medical environments to deliver the best outcomes. In an example scenario, CoDoC reduced the number of false positives by 25% for a large, de-identified UK mammography dataset, compared to commonly used clinical workflows – without missing any true positives.
This work is a collaboration with various health care organizations, including the Stop TB Partnership of the United Nations Office for Project Services. To help researchers leverage our work to improve the transparency and security of AI models for the real world, we also open source CoDoC code on GitHub.
CoDoC: Additional tool for human-AI collaboration
Building more reliable AI models often requires re-engineering the complex inner workings of predictive AI models. However, for many healthcare providers, redesigning a predictive AI model is simply not possible. CoDoC can potentially help improve AI prediction tools for its users without requiring them to modify the underlying AI tool itself.
When developing CoDoC, we had three criteria:
- Non-machine learning specialists, such as healthcare providers, should be able to develop the system and run it on a single computer.
- Training would require a relatively small amount of data – typically, only a few hundred examples.
- The system could be compatible with any proprietary AI models and would not need access to the inner workings of the model or the data it was trained on.
Determining when predictive AI or a clinician is more accurate
With CoDoC, we propose a simple and usable AI system to improve reliability by helping predictive AI systems “know when they don’t know”. We looked at scenarios where a clinician might have access to an AI tool designed to help interpret an image, for example, examining a chest X-ray to see if a TB test is needed.
For any theoretical clinical context, CoDoC’s system requires only three inputs for each case in the training data set.
- The predictive AI outputs a confidence score between 0 (certain no disease present) and 1 (certain disease present).
- The interpretation of the medical image by the clinician.
- The ground truth about whether disease was present, as, for example, established through a biopsy or other clinical follow-up.
Note: CoDoC does not require access to medical images.
CoDoC learns to determine the relative accuracy of the predictive AI model compared to the interpretation of clinicians and how this relationship scales with the confidence scores of the predictive AI.
Once trained, the CoDoC could be introduced into a hypothetical future clinical workflow involving both an AI and a clinician. When a new patient image is evaluated by the predictive AI model, the associated confidence score is fed into the system. The CoDoC then assesses whether accepting the AI’s decision or deferring to a clinician will ultimately result in the most accurate interpretation.
Increased accuracy and efficiency
Our comprehensive testing of CoDoC with multiple real-world data sets – including only historical and de-identified data – showed that the combination of the best human expertise and intelligent AI results in greater accuracy than either alone.
In addition to achieving a 25% reduction in false positives for a mammography dataset, in hypothetical simulations where an AI was allowed to act autonomously in some cases, CoDoC was able to reduce the number of cases that needed to be read by a clinician by two thirds. We also showed how CoDoC could hypothetically improve the screening of chest radiographs for further screening for TB.
Responsible AI development for healthcare
Although this work is theoretical, it demonstrates the adaptability of our AI system: CoDoC was able to improve performance in medical imaging interpretation across various demographics, clinical settings, medical imaging equipment used, and disease types.
CoDoC is a promising example of how we can leverage the benefits of artificial intelligence combined with human strengths and expertise. We work with external partners to rigorously evaluate our research and the potential benefits of the system. To bring technology like CoDoC safely into real-world medical settings, healthcare providers and manufacturers will also need to understand how clinicians interact differently with AI and validate systems with specific medical AI tools and settings.
Learn more about CoDoC: