A better model can make your agent worse
A stronger model with higher scores looks like a free upgrade, until the agent that worked last week starts getting things wrong, quietly. Here is what happened when we ran one agent on two frontier models and changed nothing else.
Voxli