A stronger model with higher scores looks like a free upgrade, until the agent that worked last week starts getting things wrong, quietly. Here is what happened when we ran one agent on two frontier models and changed nothing else.
Recently, to assess AI Agent performance with tool calls, we executed the same multi-turn conversation across the three tiers of OpenAI's GPT-5.4: standard, mini, and nano.
It's no surprise that hallucinations are a common known failure during agentic AI testing. The agent starts to overpromise, begins to fabricate answers and even claims that it…