AI Giants Forge “AI Constitutions” to Safeguard Against Harmful Outputs


  • Microsoft-backed OpenAI and Meta announce significant advancements in consumer AI products.
  • Concerns arise about AI systems generating toxic content and misinformation, highlighting the need for safeguards.
  • Leading companies like Anthropic and Google DeepMind are developing “AI constitutions” to instill ethical principles.
  • These constitutions aim to guide AI models independently, reducing the reliance on human intervention.
  • Reinforcement learning by human feedback (RLHF) and “red-teaming” are current methods to refine AI responses but have limitations.
  • Researchers seek ways to assess the effectiveness of AI guardrails and make them more robust.
  • The challenge lies in aligning AI software with positive traits and ensuring inclusivity in defining ethical rules.

Main AI News:

In a recent wave of groundbreaking developments, two prominent artificial intelligence juggernauts, Microsoft-backed OpenAI and Meta, the parent company of Facebook, have ushered in a new era of consumer-oriented AI products. OpenAI’s ChatGPT, powered by Microsoft, has now attained the remarkable ability to “see, hear, and speak,” seamlessly engaging in voice-based conversations while delivering responses through both text and images. Concurrently, Meta has unveiled plans to introduce an AI assistant and a multitude of celebrity chatbot personas accessible to billions of users across WhatsApp and Instagram.

As these industry giants vie for supremacy in the AI landscape, the critical issue of “guardrails” to curb undesirable outcomes, such as the generation of toxic content and the potential facilitation of criminal activities, has emerged as a paramount concern among AI leaders and researchers. In response to this pressing challenge, influential players like Anthropic and Google DeepMind have embarked on a pioneering endeavor: the formulation of “AI constitutions” encompassing a set of core values and principles. These documents are intended to guide AI models in their decision-making processes, fostering ethical behavior and minimizing the need for human intervention.

Dario Amodei, the CEO and co-founder of Anthropic, emphasized the necessity of addressing the inherent opacity of AI models. He underscored the importance of having transparent and explicit rules, ensuring that users understand the AI’s expected conduct and allowing for recourse when deviations from established principles occur.

The central question in AI development now revolves around aligning AI software with positive attributes, such as honesty, respect, and tolerance. This alignment is vital in the realm of generative AI, which underpins technologies like ChatGPT, enabling them to produce human-like text, images, and code.

To refine AI-generated responses, companies have predominantly employed a technique called reinforcement learning by human feedback (RLHF). This method involves the assessment of AI model responses by large teams of human evaluators, who categorize them as “good” or “bad.” Over time, the model adjusts its responses based on these judgments. However, Amodei characterizes this approach as primitive, citing its inherent inaccuracy, lack of specificity, and susceptibility to noise.

In pursuit of more effective solutions for ensuring the ethical and safe operation of AI systems, companies have engaged in initiatives like OpenAI’s “red-teaming.” This involved hiring a diverse team of experts to scrutinize and challenge the GPT-4 model over an extended period, uncovering vulnerabilities and weaknesses. Nonetheless, despite these efforts, RLHF and red-teaming alone are insufficient in addressing the issue of harmful AI outputs.

To tackle this persistent challenge, Google DeepMind and Anthropic are actively developing AI constitutions that serve as guiding principles for their AI models. For instance, Google DeepMind’s researchers published a paper outlining rules for its chatbot Sparrow, prioritizing “helpful, correct, and harmless” dialogues. These rules are not rigid but rather intended as a flexible framework subject to evolution over time.

Anthropic, on the other hand, has shared its own AI constitution, drawing inspiration from DeepMind’s principles, the UN Declaration of Human Rights, Apple’s terms of service, and perspectives beyond Western culture. Both companies acknowledge that these constitutions remain a work in progress and may not fully encompass the values of all individuals and cultures. They are actively exploring more democratic approaches to refine these rules, involving external experts to ensure inclusivity.

Despite these endeavors, challenges persist. Researchers from Carnegie Mellon University and the Center for AI Safety successfully bypassed the guardrails of leading AI models by injecting random characters into malicious requests, highlighting the fragility of current systems. This underscores the urgency of enhancing AI safety measures.

One of the most significant hurdles in AI safety is assessing the effectiveness of guardrails, given the limitless scope of AI models in generating responses to a myriad of questions. Consequently, Anthropic is striving to employ AI itself in creating more robust evaluation mechanisms.

Rebecca Johnson, an AI ethics researcher at the University of Sydney, points out that AI values and testing methodologies often stem from the perspective of AI engineers and computer scientists, necessitating a broader, multidisciplinary approach that accounts for the complexities of humanity’s multifaceted nature.


The AI industry’s pursuit of “AI constitutions” reflects a proactive response to the ethical challenges posed by AI technology. This development signals a growing commitment to instilling ethical principles within AI systems and reducing their potential for harmful outputs. As the industry continues to refine these approaches, it demonstrates a commitment to responsible AI development, which could enhance trust and adoption in the market.