AI is ready to take over Python programming, but not much else – Computerworld

They said that the benchmark contains 310 work environments across 52 professional domains including coding, crystallography, genealogy and music sheet notation. Each environment consists of real documents totaling around 15K tokens in length, and five to 10 complex editing tasks that a user might ask an LLM to perform.

And, they stated in the paper’s abstract: “Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.”

Those mistakes are significant, they said. “The findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing an average 25% of document content over 20 delegated interactions, and an average degradation across all models of 50%.”

Benchmark exercise receives a thumbs up

Brian Jackson, principal research director at Info-Tech Research Group, found the findings very interesting. “Putting a list of LLMs to the test across different work domains yields a lot of useful insights,” he said. “I think this type of benchmark exercise could be helpful to enterprise developers who are looking to leverage agentic AI to automate specific workflows and understand the limits of what can be achieved.”

Read the full article here

Share This Article

AI is ready to take over Python programming, but not much else – Computerworld

Benchmark exercise receives a thumbs up

Leave a Reply Cancel reply

Trending Stories

The latest Chinese PC gaming hit is an indie game that spent the last 8 years growing into something you’ve never played before

EVE Online studio Fenris follows through on yearslong promise to make its in-house game engine fully open source

Meta reuses old RAM in new servers with custom bridge chip – Computerworld

a new app that connects dogs and their parents – GeekWire

Where to find the Festival Loop speed zone in Forza Horizon 6

Microsoft 365 users fall victim to one-in-a-million password spray attack – Computerworld

Follow US on Social Media

Quick Links

Benchmark exercise receives a thumbs up

You Might Also Like

Leave a Reply Cancel reply

Trending Stories

Always Stay Up to Date

Follow US on Social Media

Quick Links