Biases in medical AI products often fly under the FDA's radar

ANWhile artificial intelligence is entering healthcare with great promise, clinical AI tools are prone to bias and underperforming in the real world, from inception to deployment, including stages of acquiring, labeling or annotating datasets, training and validation algorithms. These biases may reinforce existing disparities in diagnosis and treatment.

To explore the extent to which biases are identified in the FDA’s review process, we analyzed virtually all healthcare AI products approved between 1997 and October 2022. Our audit of data submitted to the FDA to release clinical AI products to market reveals major flaws in how this technology is regulated.

our analysis

The FDA approved 521 AI products between 1997 and October 2022: 500 under the 510(k) path, meaning the new algorithm mimics existing technology; 18 on the new route, which means that the algorithm does not mimic existing models, but comes with checks that make it safe; three were submitted with premarket approval. Because the FDA only includes abstracts for the first two, we looked at the rigor of the submission data underlying 518 approvals to understand how well the submissions represented how bias might play out. ‘equation.


In FDA submissions, companies are often asked to share performance data that demonstrates the effectiveness of their AI product. One of the biggest challenges for the industry is that the 510(k) process is far from formulaic, and the FDA’s ambiguous position must be deciphered on a case-by-case basis. Historically, the agency has not explicitly requested supporting datasets; in fact, there are 510(k) approved products for which no data on possible sources of bias has been provided.

We see four areas where bias can creep into an algorithm used in medicine. This is based on best practices in computer science for training any type of algorithm and the realization that it is important to consider the degree of medical training possessed by the people creating or translating the raw data into something. that can train an algorithm (the data annotators, in AI parlance). These four areas that can skew the performance of any clinical algorithm – patient cohorts, medical devices, clinical sites, and the annotators themselves – are not routinely considered (see table below).


Percentages of 518 FDA-approved AI products that submitted data covering sources of bias

Aggregate reports Stratified reports
patient cohort less than 2% completed multiracial/gender validation less than 1% of endorsements with performance figures by gender and race
medical device 8% performed multi-vendor validation less than 2% reported performance numbers across manufacturers
clinical site less than 2% performed multi-site validation less than 1% approvals with performance numbers across all sites
annotators less than 2% reported note-taker/reader profiles less than 1% reported performance numbers among annotators/readers

Aggregated performance occurs when a vendor reports that it has tested different variables, but only provides the performance as an aggregate, not the performance of each variable. Stratified performance provides more information and means that a provider provides performance for each variable (cohort, device, or other variable).

In fact, it is the extreme exception to the rule if a clinical AI product is subjected to data confirming its effectiveness.

A proposal for basic submission criteria

We are proposing new mandatory transparency minimums that must be included for the FDA to review an algorithm. These cover site performance from datasets and patient populations; performance measures in patient cohorts, including race, age, sex, and comorbidities; and the different devices the AI ​​will work on. This granularity must be provided for training and validation datasets. Results on the reproducibility of an algorithm under conceptually identical conditions using externally validated patient cohorts should also be provided.

It’s also important to know who is labeling the data and with what tools. Basic qualification and demographic information about the annotators – are they certified physicians, medical students, certified foreign physicians or non-medical professionals employed by a private data labeling company? — must also be included as part of a submission.

Proposing a basic performance standard is a profoundly complex task. The intended use of each algorithm determines the threshold level of performance required – high risk situations require a higher standard of performance – and is therefore difficult to generalize. As the industry strives to better understand performance standards, AI developers need to be transparent about the assumptions made in the data.

Beyond Recommendations: Technology Platforms and Broader Industry Conversations

It takes up to 15 years to develop a drug, five years to develop a medical device and, in our experience, six months to develop an algorithm, which is designed to go through many iterations not only in those six months, but also through its lifecycle. whole. In other words, algorithms fall far short of the rigorous traceability and auditability required for drug and medical device development.

If an AI tool is to be used in decision-making processes, it should be held to similar standards as physicians who undergo not only initial training and certification, but also continuing education, recertification, and quality assurance during their medical practice. .

The Coalition for Health AI (CHAI) recommendations raise awareness of the issue of clinical bias and the effectiveness of AI, but technology is needed to enforce them. Identifying and overcoming the four buckets of bias requires a platform approach with visibility and rigor at scale – thousands of algorithms are piling up at the FDA for review – that can compare and contrast submissions for predicates, as well as evaluate de novo applications. Report workbooks will not help with versioning data, models, and annotations.

How can this approach be? Consider the progression of software design. In the 1980s, it took considerable expertise to create a graphical user interface (the visual representation of software) and it was a lonely and isolated experience. Today, platforms like Figma encapsulate the knowledge needed to code an interface and, just as importantly, connect the stakeholder ecosystem so everyone can see and understand what’s going on.

Doctors and regulators should not be expected to learn to code, but be given a platform that makes it easy to open, inspect and test the various ingredients that make up an algorithm. It should be easy to evaluate algorithmic performance using local data and retrain in place if necessary.

CHAI draws attention to the need to look into the black box that is AI through some sort of metadata nutrition label that lists essential facts so clinicians can make informed decisions about using a particular algorithm. without being experts in machine learning. That might make it easier to know what to watch, but it doesn’t take into account the inherent evolution – or involution – of an algorithm. Doctors need more than a glimpse of how it worked when it was first developed: they need ongoing human intervention supplemented by automated controls, even after a product. A Figma-like platform should make it easier for humans to manually review performance. The platform can also automate some of this, comparing doctors’ diagnoses with what the algorithm predicts.

In technical terms, what we are describing is called a Machine Learning Operations Platform (MLOps). Platforms in other areas such as Snowflake have shown the power of this approach and how it works in practice.

Ultimately, this discussion of biases in clinical AI tools needs to encompass not just large tech companies and elite academic medical centers, but also community and rural hospitals, veterans hospitals, startups, groups advocating for underrepresented communities, professional associations health authorities, as well as international counterparts of the FDA.

No voice is more important than the others. All stakeholders must work together to create fairness, safety and effectiveness of clinical AI. The first step towards this goal is to improve transparency and approval standards.

Enes Hosgor is the founder and CEO of Gesund, a company that promotes fairness, safety and transparency in clinical AI. Oguz Akin is a radiologist and director of Body MRI at Memorial Sloan Kettering in New York and a professor of radiology at Weill Cornell Medical College.

First Notice Bulletin: If you enjoy reading opinion and perspective essays, receive a weekly summary of top opinions in your inbox every Sunday. Register here.



Font Size
lines height