View Section: 2025-11-06 BW Column: Opening Up AI Safety

Audrey Tang

Last week, OpenAI released the world’s first openly available safety reasoning model: “gpt-oss-safeguard.” While it didn’t generate the kind of buzz that accompanies each new generation of ChatGPT, it is a consequential step—one that could determine whether AI content services can continue to earn society’s trust.
This comes in response to recent, repeated tragic incidents tied to the improper operation of AI models: for example, U.S. teenagers dying by suicide after interacting with AI chatbots, and the widespread use of these tools to generate large volumes of content unsuitable for minors. This has already prompted Australia to ban children under sixteen from using social platforms by the end of this year, with restrictions extending to AI chatbots; simultaneously, the U.S. Senate is weighing similar legislation.
These developments have made AI giants realize that without the capability to implement and respond to safety policies, the entire industry risks facing sweeping regulation.
At the Paris AI Summit in February, I co‑launched the ROOST Foundation with Meta’s Chief AI Scientist Yann LeCun and former Google CEO Eric Schmidt. The foundation is committed to working with OpenAI and other major players to develop openly licensed safety reasoning models.
We leverage production systems that run daily—such as ChatGPT and Sora—which not only possess advanced semantic understanding but are also adept at handling various attempts to exploit loopholes and jailbreaks. After eight months of development, the release of this open model signals that AI content moderation can move from a “black box” approach toward transparency.
Using the model starts with inputting a safety policy (covering local laws, organizational rules, and socio-cultural norms), followed by the content to be classified. The model then produces a complete reasoning trace: it determines whether the content complies with the policy and explains its reasoning process—without OpenAI imposing any preset stance.
Notably, it can adapt to local contexts. In Thailand, for instance, insulting the monarch is illegal, whereas foreign nationals in other countries are free to discuss the topic.
When the reasoning process can be publicly audited, not only is the frequency of misclassification reduced, but the boundaries of safety are no longer set unilaterally by Silicon Valley. Instead, communities around the world can adjust safety policies to fit their own contexts at any time.
Because any organization can directly adopt, modify, and deploy its own safety system, this also addresses a long‑standing challenge for small and mid‑sized platforms: relying solely on human moderation is expensive, often leaving minors exposed to risk or allowing illegal content to circulate unchecked.
This model eliminates those dilemmas as excuses. When perceived risks are converted into publicly verifiable safety mechanisms, the industry can avoid being dragged down by a few non‑compliant actors—and sidestep extreme measures such as an outright ban on minors using AI.
All of this suggests that content safety is shifting away from the past “arms races” and closed systems toward transparency and collaboration. A more diverse and open AI future is accelerating its arrival.

Link in context Link
(Interview and Compilation by Yu-Tang You. License: CC BY 4.0)

Link in context Link