The AI App Store Dilemma: When Medicine's Two-Tier AI Reality Meets Zero (Or Not Much) Oversight
- Stuart Akerman, MD
- 4 hours ago
- 8 min read

I had an unsettling thought while scrolling through a website for yet another AI healthcare platform promising to "revolutionize" my practice. As an early adopter and I'll admit, this won't be a surprise to anyone that reads my content or speaks to me regularly - I'm a bit of a fanboy when it comes to new AI tools - I've tested scribes, workflow assistants, and diagnostic aids with genuine enthusiasm. But something's been nagging at me for months, and it finally clicked.
We've quietly abandoned evidence-based medicine when it comes to AI tools, and hardly anyone's talking about it.
Think about your smartphone for a second. Doesn't matter if you're Team Android or Team iPhone - you know you can download pretty much any app you want from your marketplace. Some are brilliant. Some are hot garbage. Some start great and turn into ad-riddled nightmares after a few updates. The quality varies wildly, and we've all learned to navigate that as consumers.
Now imagine applying that same marketplace model to the tools we use in clinical practice.
Here's what really got me the more I thought about it: Most of the AI tools I use every single day require absolutely no regulatory oversight whatsoever. Zero. None.
The Two-Tier Reality Nobody Explains
Let's start with what most physicians don't realize. The FDA has approved over 1,000 AI/ML-enabled medical devices. These are diagnostic and treatment tools—AI that reads CTs, interprets EKGs, analyzes pathology slides. In GI, we're seeing AI for polyp detection, Barrett's esophagus screening, and adenoma characterization. These tools go through FDA clearance pathways like 510(k) or De Novo classification.
But here's the thing—most of the AI tools we actually use daily don't fall into that category:
AI clinical scribes that document our patient encounters
Revenue cycle management tools that handle our coding and billing
Prior authorization automation platforms
Scheduling and patient communication assistants
Workflow optimization software
None of these require FDA approval. They're not classified as medical devices because they don't diagnose or treat. They're just software. And they're being marketed to us through emerging "AI marketplaces" with the same casual "try it free for 30 days" approach we use for Netflix.
So we've got two completely different worlds:
Tier 1: FDA-regulated diagnostic/treatment AI with some oversight
Tier 2: Unregulated administrative/workflow AI with zero mandatory oversight
Both show up in the same marketplaces. Both get pitched to us the same way. And most of us don't know which category our tools fall into—or why it should matter.
Why "FDA Approved" Doesn't Mean What You Think
Even for those 1,000+ FDA-approved AI devices, we need to be clear about what that actually means. Most get cleared through the 510(k) pathway, which just requires showing "substantial equivalence" to something already on the market. It's not proving the tool is better—just that it's reasonably similar and safe.
FDA clearance tells you the device probably won't harm anyone and it works about as well as whatever came before it. It doesn't tell you it's more accurate than competing tools, that it's been validated across diverse patient populations, or that it's the best option for your specific use case.
But at least there's some regulatory framework. Some validation requirements. Some post-market surveillance.
For Tier 2—the scribes and workflow tools—there's literally nothing. No required validation. No performance benchmarks. No adverse event reporting. An AI scribe could launch tomorrow with zero testing of how often it mischaracterizes clinical information, and as long as it doesn't claim to diagnose anything, it's fair game. Why not try a pubmed search for AI scribes - see how many high quality hits you receive for randomized-controlled trials.
"It's Just Documentation—What's the Big Deal?"
I can already hear the pushback: "Come on, it's just a documentation tool. It's not making clinical decisions."
Except it kind of is.
My AI scribe isn't just transcribing words—it's interpreting what I say, deciding what's clinically relevant, structuring information, and making editorial choices. That becomes my legal medical record. It gets used by consulting physicians, becomes evidence in malpractice cases, and determines billing codes.
If my scribe consistently misses certain clinical details, or struggles with specific terminology, or has trouble with how I document exam findings—how would I even know? I'm not checking every word against what I actually said. That defeats the whole purpose. To be clear - this is no different than physical scribes or transcription services; this is not a concept specific to AI scribes; we often have higher expectations from computers though.
And workflow AI influences clinical decisions more than we admit. Tools that prioritize my schedule, flag "high-risk" patients, or automate prior auth aren't just administrative conveniences. They're shaping which patients I see, when, and with what level of urgency. Those are clinical decisions, even if nobody calls them that.
The line between "administrative" and "clinical" is a lot blurrier than regulatory categories suggest.
The Marketplace Problem
We're already seeing AI marketplaces for physicians—Epic's App Orchard, Google Health platforms, Microsoft Azure clinical tools. The pitch is attractive: democratized access to cutting-edge AI. Browse, compare, download, deploy.
It's exactly how we manage apps on our phones. And that's the problem.
Marketplaces are built for consumer choice and convenience, not clinical rigor. And that creates predictable issues:
Popularity doesn't equal quality. The AI scribe with 50,000 users might just have better marketing than the one with 5,000. It doesn't mean it's more accurate.
Five-star reviews measure the wrong things. Physicians might love a tool because it's fast and user-friendly, even if it misses critical details occasionally. Those errors don't show up in app store ratings—they show up in patient charts.
Updates change everything. When my documentation AI updates its language model, that could alter how it interprets clinical information. Does anyone validate new versions before they go live in my workflow?
"Seamless integration" is oversold. I've watched colleagues spend more time wrestling with poorly designed interfaces than they save from automation. The tool technically works, just not well in the chaos of actual practice.
The Evidence Standard We're Ignoring
Here's what bothers me most. In GI, if I want to change how I manage Barrett's esophagus surveillance, I'm looking at guidelines based on large cohort studies, long-term outcome data, and society recommendations. If I'm considering a new medication for IBD, I'm reading Phase III trial data, reviewing safety profiles, and comparing it to existing options.
We demand evidence for everything. We even have a mantra built around this - Evidence-Based Medicine (EBM)
But we often deploy AI tools in our practices based on a sales demo, a colleague's recommendation, and a free trial period. We didn't ask for validation data. We don't always compare accuracy metrics across competing products. What about trying to find peer-reviewed publications on the specific tool?
And honestly? Most of that data doesn't exist or isn't publicly available.
The question isn't whether AI is useful—it clearly is. The question is: How do we make sure we're using the best AI tools when there's no standardized framework for evaluation?
What We Actually Need
I think we're missing something fundamental: a physician-driven evaluation system for clinical AI that works across both regulatory tiers.
Imagine something like Consumer Reports for medical AI, but with actual clinical validation instead of just user reviews. What would that look like?
Real-world performance data that's actually accessible. Not cherry-picked metrics from the company's marketing materials. Actual accuracy rates, error patterns, and performance across different practice settings. For AI scribes: what's the documentation accuracy? How often does it miss critical information? For diagnostic AI: what are real-world sensitivity and specificity rates across different hospitals? Maybe it really excels in a specific population, while exhibiting deficiencies in another.
Transparent head-to-head comparisons. If there are five AI scribes available, show me which one performs best for gastroenterology documentation. Which integrates most smoothly with Epic? Which has the lowest error rate for complex cases? If there are any GI fellows out there reading this article - this is low-hanging fruit for a publishable research paper.
Ongoing post-market monitoring. If an AI tool starts degrading in accuracy after a software update, I want to know before I stake clinical decisions on it. Make performance tracking public and continuous.
Peer reviews that matter. Not "5 stars, easy to use" but structured assessments from other physicians: Does it actually save time or just shift work around? Does it perform equally well across different patient populations? What are the failure modes?
Questions We Should Ask For Every New Product
On validation:
What was the training dataset? Does it reflect my patient population?
What are the actual accuracy metrics, not just marketing claims? If they can;t be made available, this is a red flag.
Has this been tested in real-world settings or just curated datasets?
Are there peer-reviewed publications on this specific tool? (this is a must have for any hardware/diagnostic)
If it doesn't require FDA approval, has the company published validation data anyway?
On real-world performance:
How does it handle unusual cases or edge scenarios?
What types of errors does it make, and how often?
Has performance been tested as the algorithm updates?
On workflow integration:
Does this actually fit my practice, or just the idealized demo version?
How much time will I spend checking the AI's work?
Can I easily spot when it makes mistakes?
On accountability:
If this contributes to patient harm, who's liable?
Can I see ongoing performance metrics after deployment?
Is there a transparent process for reporting problems?
How available/responsive is the company when issues or concerns arise?
***Am I trapped in a contract, or am I free to leave if my expectations aren't met?
What Needs to Change
I want to be crystal clear - I'm not anti-AI. I've seen these tools genuinely improve efficiency and sometimes even accuracy. But we need to grow beyond the "move fast and break things" mentality when clinical care is involved.
Here's what I think needs to happen:
Create voluntary standards for unregulated AI. If workflow and documentation tools don't require FDA oversight, the industry should establish voluntary certification standards that become competitive advantages. Make quality a selling point. Think of it like LEED certification for energy efficiency. Not necessary, but tells the consumer about the quality, and that a certain standard is being met.
Demand real-world evidence for everything. Whether FDA-regulated or not, companies should publish post-market performance data. The tools that actually work in diverse settings should be able to prove it.
Build physician-led evaluation networks. Our medical societies and academic centers should collaborate on independent AI assessments. Create standardized testing. Share results openly. This needs to be transparent though - institutions and societies should recuse themselves from testing when their are relationships with vendors.
Make marketplaces accountable. If Epic, Google, and Microsoft are building AI stores for clinicians, they should curate based on quality, not just commercial relationships. Feature tools with the best evidence. Flag (or flat-out remove!) those with concerning performance. This may push vendors to keep quality consistent and high, if they risk removal from the marketplace for underperformance.
Establish AI stewardship programs. Just like antibiotic stewardship. Protocols for evaluating tools before adoption. Ongoing monitoring. Regular reassessment of whether we're using the best options available. Think of this as a "stamp of approval".
Take Home Points:
Tuesday morning's realization has me rethinking how I evaluate AI tools, and how we need to be asking more questions going forward.
So here's how it is making me change: Before I add another AI tool to my workflow, I'm asking the hard questions. I'm looking for validation data whether it's required or not. I'm comparing alternatives. I'm treating AI tools with the same evidence-based scrutiny I apply to medications and procedures.
And here's my challenge to all of us: Let's not let convenience erode our standards.
We can have innovation and validation. We can have marketplaces and quality control. But it requires us—physicians, health systems, AI companies, and regulators—to commit to something better than what we have now.
Whether an AI app requires FDA approval or not, if it's influencing patient care, "good enough" isn't good enough.
The AI revolution is already here. We're living in a world where some tools face minimal oversight and others face none, and both are available for download with a credit card and a thirty-day trial.
The question is whether we're going to demand better—or just keep downloading apps and hoping they work.
How are you evaluating AI tools in your practice? As rigorously as you'd evaluate a new drug? Or more like a new app on your phone? I'm genuinely curious how others are navigating this.
Sources:
FDA AI/ML-Enabled Medical Devices: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices
_____________________________________________________________________________________________________
DISCLAIMER: Please note that this blog is intended for Informational Use only and is not intended to replace personal evaluation and treatment by a medical provider. The information provided on this website is not intended as a substitute for medical advice or treatment. Please consult your doctor for any information related to your personal care.














