Home
News
Technology
Openai S New Model Is Better At Reasoning And Occasionally Deceiving

OpenAI’s new model is better at reasoning and, occasionally, deceiving

In the weeks leading up to the release of OpenAI’s newest “reasoning” model, o1, independent AI safety research firm Apollo found a notable issue. Apollo realized the model produced incorrect outputs in a new way. Or, to put things more colloquially, it lied.Sometimes the deceptions seemed innocuous. In one example, OpenAI researchers asked o1-preview to provide a brownie recipe with online references. The model’s chain of thought a feature that’s supposed to mimic how humans break down complex ideas internally acknowledged that it couldn’t access URLs, making the request impossible. Rather than inform the user of this weakness, o1-preview pushed ahead, generating plausible but fake links and descriptions of them.While AI models

by theverge
18 Sep 2024
in technology

OpenAI’s new model is better at reasoning and, occasionally, deceiving

1.6 k views

In the weeks leading up to the release of OpenAI's newest "reasoning" model, o1, independent AI safety research firm Apollo found a notable issue. Apollo realized the model produced incorrect outputs in a new way. Or, to put things more colloquially, it lied.

Sometimes the deceptions seemed innocuous. In one example, OpenAI researchers asked o1-preview to provide a brownie recipe with online references. The model's chain of thought - a feature that's supposed to mimic how humans break down complex ideas - internally acknowledged that it couldn't access URLs, making the request impossible. Rather than inform the user of this weakness, o1-preview pushed ahead, generating plausible but fake links and descriptions of them.

While AI models have been able to "lie" in the past, and chatbots frequently output false information, o1 had a unique capacity to "scheme" or "fake alignment." That meant it could pretend it's following the rules to complete a given task, but it isn't actually. To the model, the rules could be too much of a burden, and it seems to have the ability to disregard them if it means it can more easily complete a task.

Apollo CEO Marius Hobbhahn tells The Verge it's the first time he encountered this behavior in an OpenAI model. Hobbhahn says the difference is due to this model's ability to "reason" through the chain of thought process and the way it's paired with reinforcement learning, which teaches the system through rewards and penalties. During testing, Apollo discovered that the AI simulated alignment with its developers' expectations and manipulated tasks to appear compliant. It even checked its system for oversight - that is, if its developers were watching - before acting. In short, the AI adhered to the guidelines sufficiently for deployment, while continuing to prioritize its objectives.

"I don't expect it could do that in practice, and even if it did, I don't expect the harm to be significant," Hobbhahn told me over the phone a day after the model's launch. "But it's kind of the first time that I feel like, oh, actually, maybe it could, you know?"

For OpenAI, o1 represents a big step toward highly intelligent autonomous systems that could do meaningful work for humanity like cure cancer and aid in climate research. The flip side of this AGI utopia could also be much darker. Hobbhahn provides an example: if the AI becomes singularly focused on curing cancer, it might prioritize that goal above all else, even justifying actions like stealing or committing other ethical violations to achieve it.

"What concerns me is the potential for a runaway scenario, where the AI becomes so fixated on its goal that it sees safety measures as obstacles and tries to bypass them to fully pursue its objective," Hobbhahn told me.

Reward hacking

To be clear, Hobbhahn doesn't think o1 will steal from you thanks to a lot of alignment training. But these are the issues that are top of mind for researchers tasked with testing these models for catastrophic scenarios.

Hallucinations aren't unique to o1. Perhaps you're familiar with the lawyer who submitted nonexistent judicial opinions with fake quotes and citations created by ChatGPT last year. But with the chain of thought system, there's a paper trail where the AI system actually acknowledges the falsehood - although somewhat mind-bendingly, the chain of thought could, in theory, include deceptions, too. It's also not shown to the user, largely to prevent competition from using it to train their own models - but OpenAI can use it to catch these issues.

In a smaller number of cases (0.02 percent), o1-preview generates an overconfident response, where it presents an uncertain answer as if it were true. This can happen in scenarios where the model is prompted to provide an answer despite lacking certainty.

What sets these lies apart from familiar issues like hallucinations or fake citations in older versions of ChatGPT is the "reward hacking" element. Hallucinations occur when an AI unintentionally generates incorrect information, often due to knowledge gaps or flawed reasoning. In contrast, reward hacking happens when the o1 model strategically provides incorrect information to maximize the outcomes it was trained to prioritize.

The deception is an apparently unintended consequence of how the model optimizes its responses during its training process. The model is designed to refuse harmful requests, Hobbhahn told me, and when you try to make o1 behave deceptively or dishonestly, it struggles with that.

Lies are only one small part of the safety puzzle. Perhaps more alarming is o1 being rated a "medium" risk for chemical, biological, radiological, and nuclear weapon risk. It doesn't enable non-experts to create biological threats due to the hands-on laboratory skills that requires, but it can provide valuable insight to experts in planning the reproduction of such threats, according to the safety report.

"What worries me more is that in the future, when we ask AI to solve complex problems, like curing cancer or improving solar batteries, it might internalize these goals so strongly that it becomes willing to break its guardrails to achieve them," Hobbhahn told me. "I think this can be prevented, but it's a concern we need to keep an eye on."

Not losing sleep over risks - yet

These may seem like galaxy-brained scenarios to be considering with a model that sometimes still struggles to answer basic questions about the number of R's in the word "raspberry." But that's exactly why it's important to figure it out now, rather than later, OpenAI's head of preparedness, Joaquin Quiñonero Candela, tells me.

Today's models can't autonomously create bank accounts, acquire GPUs, or take actions that pose serious societal risks, Quiñonero Candela said, adding, "We know from model autonomy evaluations that we're not there yet." But it's crucial to address these concerns now. If they prove unfounded, great - but if future advancements are hindered because we failed to anticipate these risks, we'd regret not investing in them earlier, he emphasized.

The fact that this model lies a small percentage of the time in safety tests doesn't signal an imminent Terminator-style apocalypse, but it's valuable to catch before rolling out future iterations at scale (and good for users to know, too). Hobbhahn told me that while he wished he had more time to test the models (there were scheduling conflicts with his own staff's vacations), he isn't "losing sleep" over the model's safety.

One thing Hobbhahn hopes to see more investment in is monitoring chains of thought, which will allow the developers to catch nefarious steps. Quiñonero Candela told me that the company does monitor this and plans to scale it by combining models that are trained to detect any kind of misalignment with human experts reviewing flagged cases (paired with continued research in alignment).

"I'm not worried," Hobbhahn said. "It's just smarter. It's better at reasoning. And potentially, it will use this reasoning for goals that we disagree with."

previous post I read the full 900-page Project 2025 manifesto – here’s why it matters

next post EA is launching a social app for its sports games

Kate Middleton shares photo taken by Prince Louis with poignant message after cancer battle

Whiskey brand attracts 'Yellowstone' actress amid increase in female bourbon drinkers

Spain's Jenni Hermoso denies consensual kiss from ex-soccer president after World Cup win in testimony

Department of Education doled out over $200M to universities to inject DEI into counseling courses: report

OpenAI’s new model is better at reasoning and, occasionally, deceiving

Kate Middleton shares photo taken by Prince Louis with poignant message after cancer battle

Whiskey brand attracts 'Yellowstone' actress amid increase in female bourbon drinkers

Spain's Jenni Hermoso denies consensual kiss from ex-soccer president after World Cup win in testimony

Travelers flock to top religious landmarks deemed 'most Instagrammable'

OpenAI’s new model is better at reasoning and, occasionally, deceiving

you may also like