A stronger model with higher scores looks like a free upgrade, until the agent that worked last week starts getting things wrong, quietly. Here is what happened when we ran one agent on two frontier models and changed nothing else.
A customer is halfway through a return flow with your agent. They've shared the order number, the item and reason for the return. They then pause to ask: "Wait, do you offer…
Last week a failed tool call caused GPT-5.4-mini to cancel a real order simply because a customer asked a question involving cancellation. Here's a quick test that catches it.
Expertise.ai is a known disruptor in the AI space, building AI sales agents that guide prospects through personalized flows. Here's how Voxli untangled their testing workflow.
Recently, to assess AI Agent performance with tool calls, we executed the same multi-turn conversation across the three tiers of OpenAI's GPT-5.4: standard, mini, and nano.
In our last post we covered the risks of agent speculation. Today we look at how to set up Voxli to catch those speculations — using a feature called Hallucination detection.
It's no surprise that hallucinations are a common known failure during agentic AI testing. The agent starts to overpromise, begins to fabricate answers and even claims that it…