cross-posted from: https://programming.dev/post/36289727
Comments
Our findings reveal a robustness gap for LLMs in medical reasoning, demonstrating that evaluating these systems requires looking beyond standard accuracy metrics to assess their true reasoning capabilities.6 When forced to reason beyond familiar answer patterns, all models demonstrate declines in accuracy, challenging claims of artificial intelligence’s readiness for autonomous clinical deployment.
A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common. The results suggest that these systems are more brittle than their benchmark scores suggest.
I think we mostly agree but saying “anything related to intelligence, skills, and knowledge” is too broad.
Its not a popular idea here on lemmy, but IMO gen AI can save some time in some circumstances.
Problems arise when one relies on them for reasoning, and it can be very difficult to know when youre doing that.
For example, I’m a consultant and work in a specialised area of law, heavily regulated. I’ve been doing this for 20 years.
Gen AI can convert my verbal notes into a pretty impressive written file note ready for presentation to a client or third party.
However, it’s critically important to know when it omits something, which is often.
In summary, in situations like this gen AI can save me several hours of drafting a complex document.
However, a layperson could explain a problem and gen AI could produce a convincing analysis, but the lay person wouldn’t know what has been omitted or overlooked.
If i dont know anything about nutrition, and ask for a meal plan tailored to my needs, I would have no way to evaluate whether something has been overlooked.