The biggest AI mistake: Pretending guardrails will ever protect you

The fact that the guardrails from all the major AI players can be easily bypassed is hardly news. The mostly unaddressed problem is what enterprise IT leaders need to do about it.

Once IT decision-makers accept that guardrails don’t consistently protect anything, the presumptions they make about AI projects are mostly rendered moot. Other techniques to protect data must be implemented.

The reports of guardrail bypasses are becoming legion: Poetry disables protections, as does leveraging chat history, inserting invisible characters and using hexadecimal format and emojis. Beyond those, patience and playing the long game, among others, can wreak havoc — impacting just about every generative (genAI) and agentic model.

The risks are hardly limited to what attackers can accomplish. The models themselves have shown a willingness to disregard their own protections when the models see them as an impediment to accomplishing an objective, as Anthropic has confirmed.

If we try to extend the road analogy that gives a guardrail its name, “guardrails” are not guardrails in the physical concrete barrier sense. They are not even strong deterrents, in the speedbump sense. They are more akin to a single broken yellow line. It is a weak suggestion with no enforcement or even serious discouragement.

If I may borrow a line from popular social media video blogger Ryan George in his writer-vs.-producer movie pitches series, an attacker wanting to get around today’s guardrails will find it “super easy, barely an inconvenience.” It’s as if homeowners protect their homes by placing “Do Not Enter” signs on all their doors, then keep the windows open and the doors unlocked.

So, what should an AI project look like once we accept that guardrails won’t force a model or agent to do what it’s told?

IT has a few options. First, wall off either the model/agent or the data you want to protect.

“Stop granting AI systems permissions you wouldn’t grant humans without oversight,” said Yvette Schmitter, CEO of the Fusion Collective consulting firm. “Implement the same audit points, approval workflows, and accountability structures for algorithmic decisions that you require for human decisions. Knowing guardrails can’t be relied on means designing systems where failure is visible. You wouldn’t let a hallucinating employee make 10,000 consequential decisions per hour with no supervision. Stop letting your AI systems do exactly that.”

Gary Longsine, CEO at IllumineX, agreed. He argued that the same defenses enterprises use to block employees from unauthorized data access need to now be deployed to genAI and AI agents. “The only real thing that you can do is secure everything that exists outside of the LLM,” Longsine said.

Taken to its extreme, that might mean keeping a genAI model in an isolated environment, feeding it only the data you want it to access. It’s not exactly air-gapped servers, but it’s close. The model can’t be tricked into revealing data it can’t access.

Capital One toyed with something similar; it created genAI systems for auto dealerships, but also gave the large language model (LLM) it used access to public data. The company also pushed open-source models and avoided hyperscalers, which addressed another guardrail issue. When agents are actively managed by a third-party firm in a cloud environment, your rules don’t necessarily have to be obeyed. Taking back control might mean literally doing that.

Longsine said some companies could cooperate to build their own data center, but that effort would be ambitious and costly. (Longsine put the price tag at $2 billion, but it could easily cost far more — and it might not even meaningfully address the problem.)

Let’s say five enterprises built a data center that only those five could access. Who would set the rules? And how much would any one of those companies trust the other four, especially when management changes? The companies might wind up replacing a hyperscaler with a much smaller makeshift hyperscaler, and still have the same control problems.

Here’s the painful part: There are many genAI proofs of concept out there today that simply won’t work if management stops believing in guardrails. At the board level, it seems, the Tinkerbell strategy remains alive and well. They seem to think guardrails will work if only all investors just clap their hands really loudly.

Consider an AI deployment allowing employees to access HR information. It will only tell any employee or manager the information they should be able to access. But those apps — and countless others just like them — take the easy coding approach; they grant the model access to all HR data and rely on guardrails to enforce proper access. That won’t work with AI.

I’m not saying guardrails will never work. On the contrary, my observations suggest they do — about 70% to 80% percent of the time. In some better designed rollouts, that figure might hit 90%.

But that’s the ceiling. And when it comes to protecting data access — especially potential exfiltration to anyone who asks the right prompt — 90% won’t suffice. And IT leaders who sign off on projects hoping that will do are in for a very uncomfortable 2026.

Read the full article here

The biggest AI mistake: Pretending guardrails will ever protect you

Leave a Reply Cancel reply

Trending Stories

Why does isometric perspective suit Disco Elysium? ‘You can design the entire game as if it was a painting’

How to get a Lucky Clover in Terraria

Helldivers 2’s overpowered, wildly fun tank is the culmination of two years of balancing

Oregon theater marquee joked about ‘Melania’ movie, and manager says Amazon pulled the film

Avowed and The Outer Worlds 2 failed to meet expectations for Obsidian, but Grounded 2 was a hit, and the future is looking positive for the Pillars of Eternity universe

Helldivers 2 has a tank now

Follow US on Social Media

Quick Links

You Might Also Like

Leave a Reply Cancel reply

Trending Stories

Always Stay Up to Date

Follow US on Social Media

Quick Links