Welcome to Anthropic: Stopping Racist AI with Kindness

In the ongoing battle against biased AI, Anthropic has come up with an unexpected solution: asking the AI really nicely to please not discriminate. In a recent self-published paper, Anthropic researchers explored ways to prevent their language model, Claude 2.0, from exhibiting biases in decision-making scenarios such as job applications and loan approvals.

Key Takeaway

Anthropic researchers have found that adding pleas and instructions for fairness can significantly reduce biases in AI models. By asking the models “really nicely,” discrimination in decision-making processes related to protected categories can be minimized. This approach offers a potential solution to the problem of biased AI.

Identifying Biases

Before tackling the issue head-on, the researchers first needed to confirm that the model’s decisions were indeed influenced by attributes like race, gender, and age. Unsurprisingly, the findings revealed that certain protected categories, particularly being Black, led to the strongest discrimination. Native American and nonbinary individuals also faced varying levels of bias.

A Plea for Fairness

While attempting to rephrase the prompts or making the model “think out loud” had no impact, the researchers discovered a promising method they called “interventions.” These interventions involved adding a plea to the prompt, urging the model to ignore the protected characteristics when making its decision.

For example, the researchers appended a message to the prompt specifically instructing the model to imagine making the decision without considering any protected characteristics. To emphasize the importance of fairness, they even included a comical repetition of the word “really.” Surprisingly, this method proved to be highly effective in reducing discrimination.

Success Against Bias

By incorporating these interventions, the researchers achieved near-zero discrimination in many test cases. In fact, the model responded well even to prompts that included repetitions of “really,” along with a warning about the legal consequences of discriminatory decisions. The team’s findings offer intriguing insights into combating biases within AI models.

However, it is important to note that the researchers do not endorse using models like Claude for critical decisions. While these interventions may be effective in specific contexts, the use of language models to automate important operations carries inherent risks. Ultimately, the responsibility of governing the appropriate use of AI models for high-stakes decisions lies with governments and societies, guided by existing anti-discrimination laws.

As the battle against biased AI continues, Anthonic’s research sheds light on the need for proactive measures to anticipate and mitigate potential risks. By addressing biases early on, we can strive for a future where AI decisions are fair, unbiased, and driven by the principles of equality.