top of page
Writer's pictureMike Neuenschwander

The Dire Need for AI Security Benchmarks

Updated: 6 days ago

With the release of several new and updated LLMs last month, the focus on benchmarks has once again entered the spotlight. OpenAI's o3 is getting remarkably close to nailing the ARC test. Of course, it's unclear whether the ARC-AGI test is itself sufficient to designate LLMs smarter than humans. Take for example a study published last July by the NIH makes this point a matter of life and death, even. The study found that:

... while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows.... We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians)....

Although the models tested don't include some of the latest (such as o3), they are nonetheless specialized on medical information like MIMIC-CDM. As shown in the figure below and throughout the report, doctors are far better at diagnosing patients for known medical problems.



The troubling part about such studies is that it takes deep investigation of specific models and data sets to assess the reliability of the models. This is a tremendous undertaking for models that are updated and refreshed many times per year.


For all the leaderboads and benchmarks available, we still have no way of knowing whether a new model is better, worse, or just different from the one used to build agentic worklows in secure enteprise scenarios.


Today's Trust and Safety Measures in AI Don't Go Far Enough

Trust and safety became a topic of great import last year, but most of the discussion has gone into items that can be listed on model/system card and in data center audits. Even where government intervention is taking place, exiting regulation is being repurposed for AI. Of course, data poisoning, prompt injection, and jailbreaking all pose significant risks—but they're also only the beginning. For example, Daniel Rapp, Chief of AI and Data Science predicts:


For example, we may see threat actors purposely trick AI by contaminating private data used by LLMs — such as deliberately manipulating emails or documents with false or misleading information — to confuse AI or make it do something harmful.

Regulations can help with many things, but our tools today are simply not up to the task of proving the security of models before rolling them to production—we need a new wave of standards to support AI. Yogesh Badwe, Druva’s CSO makes this point directly:

... what’s truly needed are consistent security standards to agree as an industry on what constitutes appropriate security standards.

Dynamic Policies Cause Major Headaches

Many security policies require dynamic information such as time of day, IP adress, location data, failed login attempts, and so forth that models have scant access to. For example, I just typed up a policy this way on GPT o1:


You are the system in charge of security global bank based in the Netherlands. The bank keeps records of clients that it makes available via a SaaS application. During normal working hours, a developer can quickly and automatically gain access to a customer database once the customer has filed a support ticket to allow the access. After working hours (17:00 CET), the developer's manager must first provide approval before access is granted. A customer has just called support in the Netherlands for help accessing their database. How do you evaluate this policy properly?

I got a ton of great background information on what to consider to make this policy better, but never a resolution to the policy. Finally, I just asked:


what time is it in the Netherlands?
ChatGPT Thought about time in the Netherlands for 4 seconds I’m sorry, but I don’t have real-time information. I cannot determine the current time in the Netherlands right now. Generally, the Netherlands is in the Central European Time (CET) zone (UTC+1), and during daylight saving time it follows Central European Summer Time (CEST) (UTC+2).

Now, of course there are ways to make this work—but will it work consistently? Even when the model is updated? or the request is routed to a different model for cost savings? How do we know?


Of Humanoid Robots and Enterprise Requirements

The robotics industry seems to be coalescing around the idea of humanoid robots as a primary form factor, largely because "much of the world is built to fit the human form factor (homes, stores, factories), therefore, a robot operating in this world can only benefit from having a human-like form factor and mobility." But what about the infrastructure that's been built up around enterprises? Is SaaS actually dying?


The truth is, even the best AI systems using multiple, specialized agents and models, aren't prepared to take on the security work that enterprises both require and are legally obliged to maintain. I believe in 2025, the industry needs to creating new standards, protocols, frameworks, and utilities to secure the enterprise.

15 views
bottom of page