Structural Solutions for Governing AI Agents

Mike Neuenschwander
Oct 14, 2024
3 min read

How instrumenting browsers, protocols, and services can make AI trustworthy

In my previous post, I pointed out the need for non-AI infrastructure to support collaborative behaviors as the ultimate guardrails for AI safety. It's already well established that regardless of the training methods, fine tuning, and guardrails applied, the model will always be subject to aberrant behaviors. For example, see this article by Gary Marcus in which he states:

𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆 𝗰𝗮𝗻 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 𝗼𝗻 𝘁𝗵𝗶𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻, where changing a word or two in irrelevant ways or adding a few bits of irrelevant info can give you a different answer.

So, it's my contention that we need to turn to structural solutions that sit outside of these models to bring order to the AI ecosystem. This concept is nothing new. To create security and trust in previous iterations of the web, we created encryption protocols like SSL and TLS; we locked down browsers from executing scripts outside their native domains; we made use of digital certificates to ensure websites are trustworthy. These approaches are all structural solutions that promote collaborative behavior on the web.

Notably, security solutions provide standards to enable small groups to transact in secure ways. The structural solutions we've relied on for internet security so far are therefore practiced in relationships among a small or tightly connected actors—not monolithic structures that apply to all users of, say, an LLM. For this post, I'd like to point out a few basic structural features that we can add to the browsers, protocols, and services we use to govern LLM behaviors.

Applying the social science to AI safety

Turning again to social science, in 1990, Nobel Prize winner Elinor Ostrom advanced a theory on what keeps participants in a relationship committed to cooperation. In my view, these concepts apply to human actors as well as to LLMs. Ostrom claims that the following conditions need to exist in order to form stable, safe connections:

The participants perceive they will be harmed if no action is taken
A fair solution can be found through which all participants will be affected in similar ways
The durability of the relationship is believed to be high
The cost of participation is reasonably low
Most actors share social norms of reciprocity and trust
The group is stable and, preferably, small

Structure #1: Rewards points ledger

On the first point, cost/benefit has often been demonstrated as a method for encouraging more predictable, cautious behaviors from the model. This would be more effective if the ledger for a rewards points system were tabulated externally, say in a crypto wallet. In this way, the LLM can rely on external tabulation of a rewards score—as can any other agent or human inclined to rely on the LLM. This also gets to item number four, which keeps costs low for participants in a relationship.

Structure #2: Durable relationships

To the third item, the durability of the relationship must be high: context windows are getting longer, but this is per prompt and not multi-party. We need some mechanism to build stronger relationships into interactions with LLMs. Given the propensity for hallucinations, even within short prompt-response sequences, AI-to-AI and human-to-AI interactions need a way to ceremonialize long-lived interactions.

Structure #3: Small groups

And to the last point (6), small groups lend themselves better to cooperation and offer better chances to keep social norms agreed upon. By analogy, when you connect to a shopping website, the website doesn't create a TLS session with 10 million other users—the encryption ensures that only the right systems and people have access to your transactions,

Conclusion

Infrastructure to improve relationship-based approaches to AI agents, providing a risk/reward infrastructure through rewards points, and keeping group sizes reasonable all aid in governing AI agents external from the frontier models they exist inside of. These suggestions are high level for now, to capture requirements in general terms. They are based in principles from social science that should inform our approach to AI safety.

AI for Society Online
Shaping AI for Art, Trust, Safety, and Life

Structural Solutions for Governing AI Agents

How instrumenting browsers, protocols, and services can make AI trustworthy

Applying the social science to AI safety

Conclusion

Recent Posts