How instrumenting browsers, protocols, and services can make AI trustworthy
In my previous post, I pointed out the need for non-AI infrastructure to support collaborative behaviors as the ultimate guardrails for AI safety. It's already well established that regardless of the training methods, fine tuning, and guardrails applied, the model will always be subject to aberrant behaviors. For example, see this articleย by Gary Marcus in which he states:
๐ง๐ต๐ฒ๐ฟ๐ฒ ๐ถ๐ ๐ท๐๐๐ ๐ป๐ผ ๐๐ฎ๐ ๐ฐ๐ฎ๐ป ๐๐ผ๐ ๐ฏ๐๐ถ๐น๐ฑ ๐ฟ๐ฒ๐น๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฎ๐ด๐ฒ๐ป๐๐ ๐ผ๐ป ๐๐ต๐ถ๐ ๐ณ๐ผ๐๐ป๐ฑ๐ฎ๐๐ถ๐ผ๐ป, where changing a word or two in irrelevant ways or adding a few bits of irrelevant info can give you a different answer.
So, it's my contention that we need to turn to structural solutions that sit outside of these models to bring order to the AI ecosystem. This concept is nothing new. To create security and trust in previous iterations of the web, we created encryption protocols like SSL and TLS; we locked down browsers from executing scripts outside their native domains; we made use of digital certificates to ensure websites are trustworthy. These approaches are all structural solutions that promote collaborative behavior on the web.
Notably, security solutions provide standards to enable small groups to transact in secure ways. The structural solutions we've relied on for internet security so far are therefore practiced in relationships among a small or tightly connected actorsโnot monolithic structures that apply to all users of, say, an LLM. For this post, I'd like to point out a few basic structural features that we can add to the browsers, protocols, and services we use to govern LLM behaviors.
Applying the social science to AI safety
Turning again to social science, in 1990, Nobel Prize winner Elinor Ostrom advanced a theory on what keeps participants in a relationship committed to cooperation. In my view, these concepts apply to human actors as well as to LLMs. Ostrom claims that the following conditions need to exist in order to form stable, safe connections:
The participants perceive they will be harmed if no action is taken
A fair solution can be found through which all participants will be affected in similar ways
The durability of the relationship is believed to be high
The cost of participation is reasonably low
Most actors share social norms of reciprocity and trust
The group is stable and, preferably, small
Structure #1: Rewards points ledger
On the first point, cost/benefit has often been demonstrated as a method for encouraging more predictable, cautious behaviors from the model. This would be more effective if the ledger for a rewards points system were tabulated externally, say in a crypto wallet. In this way, the LLM can rely on external tabulation of a rewards scoreโas can any other agent or human inclined to rely on the LLM. This also gets to item number four, which keeps costs low for participants in a relationship.
Structure #2: Durable relationships
To the third item, the durability of the relationship must be high: context windows are getting longer, but this is per prompt and not multi-party. We need some mechanism to build stronger relationships into interactions with LLMs. Given the propensity for hallucinations, even within short prompt-response sequences, AI-to-AI and human-to-AI interactions need a way to ceremonialize long-lived interactions.
Structure #3: Small groups
And to the last point (6), small groups lend themselves better to cooperation and offer better chances to keep social norms agreed upon. By analogy, when you connect to a shopping website, the website doesn't create a TLS session with 10 million other usersโthe encryption ensures that only the right systems and people have access to your transactions,
Conclusion
Infrastructure to improve relationship-based approaches to AI agents, providing a risk/reward infrastructure through rewards points, and keeping group sizes reasonable all aid in governing AI agents external from the frontier models they exist inside of. These suggestions are high level for now, to capture requirements in general terms. They are based in principles from social science that should inform our approach to AI safety.