Saturday, 21 Sep 2024

OpenAI’s new model is better at reasoning and, occasionally, deceiving

In the weeks leading up to the release of OpenAI’s newest “reasoning” model, o1, independent AI safety research firm Apollo found a notable issue. Apollo realized the model produced incorrect outputs in a new way. Or, to put things more colloquially, it lied.Sometimes the deceptions seemed innocuous. In one example, OpenAI researchers asked o1-preview to provide a brownie recipe with online references. The model’s chain of thought a feature that’s supposed to mimic how humans break down complex ideas internally acknowledged that it couldn’t access URLs, making the request impossible. Rather than inform the user of this weakness, o1-preview pushed ahead, generating plausible but fake links and descriptions of them.While AI models


OpenAI’s new model is better at reasoning and, occasionally, deceiving

In the weeks leading up to the release of OpenAI's newest "reasoning" model, o1, independent AI safety research firm Apollo found a notable issue. Apollo realized the model produced incorrect outputs in a new way. Or, to put things more colloquially, it lied.

Sometimes the deceptions seemed innocuous. In one example, OpenAI researchers asked o1-preview to provide a brownie recipe with online references. The model's chain of thought - a feature that's supposed to mimic how humans break down complex ideas - internally acknowledged that it couldn't access URLs, making the request impossible. Rather than inform the user of this weakness, o1-preview pushed ahead, generating plausible but fake links and descriptions of them.

While AI models have been able to "lie" in the past, and chatbots frequently output false information, o1 had a unique capacity to "scheme" or "fake alignment." That meant it could pretend it's following the rules to complete a given task, but it isn't actually. To the model, the rules could be too much of a burden, and it seems to have the ability to disregard them if it means it can more easily complete a task.

Apollo CEO Marius Hobbhahn tells The Verge it's the first time he encountered this behavior in an OpenAI model. Hobbhahn says the difference is due to this model's ability to "reason" through the chain of thought process and the way it's paired with reinforcement learning, which teaches the system through rewards and penalties. During testing, Apollo discovered that the AI simulated alignment with its developers' expectations and manipulated tasks to appear compliant. It even checked its system for oversight - that is, if its developers were watching - before acting. In short, the AI adhered to the guidelines sufficiently for deployment, while continuing to prioritize its objectives.

"I don't expect it could do that in practice, and even if it did, I don't expect the harm to be significant," Hobbhahn told me over the phone a day after the model's launch. "But it's kind of the first time that I feel like, oh, actually, maybe it could, you know?"

For OpenAI, o1 represents a big step toward highly intelligent autonomous systems that could do meaningful work for humanity like cure cancer and aid in climate research. The flip side of this AGI utopia could also be much darker. Hobbhahn provides an example: if the AI becomes singularly focused on curing cancer, it might prioritize that goal above all else, even justifying actions like stealing or committing other ethical violations to achieve it.

"What concerns me is the potential for a runaway scenario, where the AI becomes so fixated on its goal that it sees safety measures as obstacles and tries to bypass them to fully pursue its objective," Hobbhahn told me.

you may also like

Andaz Prague, the first Hyatt Hotel in Czech Republic Appoints Axel Ziegler as General Manager and Christopher Koleros as Director of Sales and Marketing
  • by travelandtourworld
  • descember 09, 2016
Andaz Prague, the first Hyatt Hotel in Czech Republic Appoints Axel Ziegler as General Manager and Christopher Koleros as Director of Sales and Marketing

Andaz Prague, a distinguished property marking Hyatt’s inaugural venture in the Czech Republic, proudly announces the strategic appointments of Axel Ziegler as General Manager and Christopher Koleros as Director of Sales and Marketing. Opened in 2022, Andaz Prague elegantly fuses the city’s rich historical essence with cutting-edge modern luxury, creating an enchanting guest experience in the heart of Prague.

read more