cross-posted from: https://programming.dev/post/36289727

Comments

Our findings reveal a robustness gap for LLMs in medical reasoning, demonstrating that evaluating these systems requires looking beyond standard accuracy metrics to assess their true reasoning capabilities.6 When forced to reason beyond familiar answer patterns, all models demonstrate declines in accuracy, challenging claims of artificial intelligence’s readiness for autonomous clinical deployment.

A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common. The results suggest that these systems are more brittle than their benchmark scores suggest.

  • panda_abyss@lemmy.ca
    link
    fedilink
    arrow-up
    3
    arrow-down
    12
    ·
    7 days ago

    I agree, but also being able to reconstruct content like that has some intelligence.

    That being said, when you have no way of telling what is in-sample vs out-of-sample, and what might be correct or convincing gibberish, you should never rely on these tools.

    The only time I really find them useful is with tools and RAG, where they can filter piles of content and then route me to useful parts faster.

          • panda_abyss@lemmy.ca
            link
            fedilink
            arrow-up
            1
            arrow-down
            6
            ·
            7 days ago

            That’s not an accurate characterization

            There are LLMs trained on brute forced sets of lemmas, which then are able to predict new ones, and there are “regular” models evaluated on math that are able to create new theorems based on prompting plus their latent parameters.