Why AI Agents Shouldn't Govern Themselves: How to Foster Autonomous Cooperation in an Unruly Virtual World

Mike Neuenschwander
Oct 9, 2024
4 min read

Updated: Oct 14, 2024

The era of AI agents is upon us. According to Mark Zuckerberg, "soon there will be more AI agents than people." He's not alone is making such assertions. Google CEO Sundar Pichai pronounced that this year "AI will power 60% of personal device interactions, with Gen Z adopting AI agents as their preferred method of interaction.” And renowned AI inventor Andrew Ng says “AI agents will become an integral part of our daily lives, helping us with everything from scheduling appointments to managing our finances. They will make our lives more convenient and efficient.” Very soon, every business, government, and human being will have dozens—or even thousands—of AI agents continually performing actions on their behalf. Some agents will simply interact with other agents.

Agents derive much of their value from their ability to act autonomously—meaning you'll no longer be stuck to your computer or device in order to complete transactions. This type of hyper-automation isn't new: most of cloud computing uses a variety of technologies for autonomous scaling, workload management, and database tuning. But AI agents differ significantly from these technologies because AI agents aren't deterministic; that is, they won't perform a similar transaction in exactly the same manner every time it runs a task. For this reason, our existing approaches to managing automation simply don't apply.

Can AI Safety be Entirely Encapsulated in the Model?

In pursuit of AI safety, most experts call for better governance, improved training, stronger guard rails, and more thorough testing of models before their release. In this way, the theory goes, the companies creating frontier models will produce benevolent, altruistic LLMs—good netizens of the world, as it were. With safety built right into the model, they assert, we can blithely assume that any agents based on such models are also trustworthy.

But this also means As will be self-governing, and therefore prone to error and exposed to jailbreaks and other attacks. In my view, this is a dangerous approach. It's a "black box of trust," because the characteristics of trillion-parameter models will forever remain inscrutable to the people relying on them. Subsequently, the actions that AI agents take will remain mysterious and difficult to account for or reverse. It seems to me that we need something external to the frontier models that establishes rules of fair play in order to establish inter-agent trust.

Thinking Outside the Model

What's needed to establish trust across AI agents is a framework for "benchmark testing" and incentivizing agents during actual operation, and not just in the laboratory. This trust framework lives outside all of the models and enforces strict adherence to its rules of engagement. Think of the trust framework as the playing board (like the board of Go) and the AI agents as players using it. Whenever AI agents interact, they first need to agree to adhere to the rules of the game they're about to engage in.

I believe there are structural solutions needed to facilitate inter-agent trust in virtual spaces; these would include systems for agent identification and reputation. In this post, I'll only go so far as to offer a list of requirements that such structures would need to fulfill.

Now, the word "trust" is heavily overloaded, so let me pause here to define how I'm using it. In this context, trust refers to "a multilateral, durable collaborative action." If this definition feels a bit exacting, just know it isn't mine; it originates with work done by Nobel Prize winner Elinor Ostrom, Edella Schlager, and others in their ground-breaking research into how parties could come to trust each other in the absence of a central authority. What they found is that such arrangements do exist in nature and parties can come to trust one another, provided some ground rules are set and enforced. This seems a proper template for AI-to-AI trust.

So what are the principles of trust? The research is extensive and beyond the scope of a single blog post. However, Schlager summarizes them this way:

Exclusion: The group must be able to guard the resource from free loading, theft, or vandalism.
Rationality: The agreed upon rules must be attuned to the context of the resource
Involvement: Members have avenues to participate in modifying operational rules
Monitoring: Effective monitoring and auditing or policies
Enforcement: Sanctions can be imposed on violators of the rules
Arbitration: Appropriators have access to low cost but effective conflict resolution
Autonomy: The rights of appropriators to devise their own institutions are not challenged by external governmental authorities

I'll be exploring the ramifications and application of these "7 Principles of Trust" in a more technical nature in follow-on posts. For now, let's just name a couple of the features a trust framework for AI agents will require.

A method for AI agents to form groups and no each other, which implies some form of mutual identification of all actors involved
A method for group members to agree on the rules of the "game" or transaction at hand
An ability for AI agents to monitor behaviors of other members in the group
A method for incentivizing and punishing behaviors that fall inside or outside the boundaries of play

As mentioned before, I have several proposals on how such a system could work, all of which will appear on this blog. Ideally, I'll be hearing from many of you with even better ideas!

AI for Society Online
Shaping AI for Art, Trust, Safety, and Life

Why AI Agents Shouldn't Govern Themselves: How to Foster Autonomous Cooperation in an Unruly Virtual World

Can AI Safety be Entirely Encapsulated in the Model?

Thinking Outside the Model

Recent Posts