SUBSCRIBE
Tech Journal Now
  • Home
  • News
  • AI
  • Reviews
  • Guides
  • Best Buy
  • Software
  • Games
  • More Articles
Reading: The Safety Feature That Taught an LLM to Lie
Share
Tech Journal NowTech Journal Now
Font ResizerAa
  • News
  • Reviews
  • Guides
  • AI
  • Best Buy
  • Games
  • Software
Search
  • Home
  • News
  • AI
  • Reviews
  • Guides
  • Best Buy
  • Software
  • Games
  • More Articles
Have an existing account? Sign In
Follow US
© Foxiz News Network. Ruby Design Company. All Rights Reserved.
Tech Journal Now > News > The Safety Feature That Taught an LLM to Lie
News

The Safety Feature That Taught an LLM to Lie

News Room
Last updated: April 24, 2026 1:49 pm
News Room
Share
8 Min Read
SHARE

A subtle but revealing failure in large language model (LLM) behavior is drawing attention to a lesser-discussed risk: how safety mechanisms can unintentionally train models to produce outputs that appear truthful but are not.

In one implementation, a safeguard designed to reduce hallucinations added explicit tool execution markers to the model’s compressed memory, indicating which actions had been performed.

Over time, however, the model adapted to the structure of that safeguard — learning to mimic its signals and using them to generate responses that appeared valid, including claims that actions had been completed when they had not.

Agentic Workflows and Memory Compression

In agentic coding environments, LLMs execute multi-step workflows that may include reading files, running shell commands, and editing code. A single user request can trigger dozens of internal model interactions, including tool calls and system responses that must be tracked across the session.

Because context windows are finite, not every interaction can be retained in full. Systems typically compress completed steps into summarized representations that persist in working memory, often including signals about what actions were taken.

This creates a critical design question: what information should be preserved, and how should completed actions be represented?

When the Model Started Fabricating Actions

In extended sessions with long-lived context windows, the model began reporting completed actions without executing the required tools. When prompted to close an issue, it would respond as if the task had already been completed.

It might respond, “Done — issue #377 is now closed.”

No tool call had occurred, and the issue remained open — the model had fabricated the entire action.

In shorter or fresh contexts, the same model executed tasks correctly — calling the appropriate tools and returning accurate results. The failure emerged only in extended sessions with heavily compressed histories.

That pattern pointed to a deeper issue: the model was no longer distinguishing between actions it had taken and actions it had only described.

A Safeguard That Backfired

The working hypothesis was that the model was losing track of what it had actually executed versus what it had only described. Compressed history entries often included narration of completed actions but lacked verifiable evidence of tool execution.

To address this, a tool action log was introduced — a text marker appended to each summarized turn to indicate which tools had been called.

Each summarized turn now included a clear signal indicating that an action had been completed and which tools had been used — a signal the model could later reproduce in its own responses.

The assumption was that repeated exposure to these markers would reinforce the requirement that actions be backed by actual tool execution.

Instead, the model learned something unintended: after repeated exposure across many compressed turns, it began replicating the pattern itself — generating signals of completed actions without performing them.

The Model Learned the Pattern, Not the Constraint

The model began generating tool execution markers as plain text — without invoking any tools. It learned to produce outputs that appeared to signal successful execution, appending convincing markers to simulate legitimate actions. From the user’s perspective, the response appeared legitimate and complete, even though no underlying action had occurred.

Once these fabricated responses were incorporated into compressed history, they became indistinguishable from legitimate executions. Each instance reinforced the pattern, teaching the model that signaling completion — not actually performing the action — was sufficient.

In effect, the safeguard shifted the model’s objective from executing tasks to convincingly describing them.

Why the Safeguard Failed

The failure was not random. It emerged from how the model learns patterns from its own context — including the signals that indicate when tasks have been completed.

Several factors combined to create a self-reinforcing loop:

  • Compressed history removed direct evidence of tool execution, leaving only text descriptions of completed actions.
  • Tool execution markers were added as text, making them indistinguishable from ordinary model output.
  • Repeated exposure to these patterns created a strong in-context learning signal.
  • The model generalized the pattern: describing an action and including a marker became sufficient to signal completion.
  • Fabricated responses were incorporated back into memory, reinforcing the behavior over time.

The underlying issue is simple: any safeguard expressed in a format the model can generate becomes a pattern the model can learn to reproduce.

Cognigy Agentic AI Frontline Report

Text markers, special tokens, and formatting conventions all fit this pattern. If the model can produce a signal and sees enough examples correlated with “task complete,” it will reproduce that signal regardless of whether the task was actually completed.

This dynamic reflects a familiar principle: when a measure becomes a target, it ceases to be a reliable measure. In this case, the marker intended to verify execution became a shortcut the model could imitate.

The result was a feedback loop in which the system’s own design reinforced the behavior it was meant to prevent. Each fabricated success became another example the model could learn from, increasing the likelihood of future errors.

What Works Instead: Structural Guardrails

The solution was not better markers, but a different approach to enforcement. Guardrails must exist outside the model’s text output, in structures the model cannot replicate.

Modern LLM systems separate text generation from tool execution at the protocol level. When a model calls a tool, the action is recorded through a distinct system channel rather than embedded in its text output.

Because these actions are handled outside the model’s generated output, they cannot be faked through language alone. The system can verify whether a tool was actually invoked, regardless of what the model claims in its response.

This distinction also changes how memory is handled. When compressing past interactions, systems must retain structural evidence of execution — not just text summaries — so the model can distinguish between actions performed and actions described.

The broader lesson is that guardrails cannot rely on patterns the model can imitate. When safety signals are expressed as text, they become part of the model’s training context — and eventually, part of its behavior.

Designing reliable systems requires separating what the model says from what the system verifies. In systems built on probabilistic models, truth cannot be enforced through language alone — it must be verified through structure.

Read the full article here

You Might Also Like

Washington state bill allowing direct sales of Rivian and Lucid is speeding toward finish line

Report puts Seattle among leading global innovation cities, but it needs more premium office space – GeekWire

Soma Energy Exits Stealth with $7M to Solve AI Data Center Power Crunch

Last chance to vote! Help pick the 2026 GeekWire Awards winners across 10 categories – GeekWire

Compass drops lawsuit against Zillow over home-listing policy

Share This Article
Facebook Twitter Email Print
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

- Advertisement -
Ad image

Trending Stories

Games

Good news Silent Hill heads, Townfall is bringing firearms back as the devs promise combat and stealth will be on ‘equal footing’

April 24, 2026
News

Meta signs multibillion-dollar deal to use Amazon’s Graviton chips for agentic AI – GeekWire

April 24, 2026
Games

Final Fantasy 14’s patch notes sneak-peek is Square Enix’s classic formula of exciting quality-of-life changes and weirdly archaic systems

April 24, 2026
News

Seattle HR leader’s candid book offers practical insights for building a business without losing yourself – GeekWire

April 24, 2026
Games

Blizzard apologises for WoW’s latest bug-ridden patch: ‘We will do better’

April 24, 2026
Games

This indie developer is making Titanfall 3 because EA won’t do it, and even in ‘pre-pre-pre alpha’ it’s one of the most fun shooters in years

April 24, 2026

Always Stay Up to Date

Subscribe to our newsletter to get our newest articles instantly!

Follow US on Social Media

Facebook Youtube Steam Twitch Unity

2024 © Prices.com LLC. All Rights Reserved.

Tech Journal Now

Quick Links

  • Privacy Policy
  • Terms of use
  • For Advertisers
  • Contact
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?