Will frontier models become agentic?

Abstract

Will frontier models become agentic? Some quite likely will - assuming that agentic models can improve output quality in a significant way. The rest of the essay explains the reasoning.

Improving LLM quality

There's some excitement about the promise of agents for improving LLM quality. In particular it seems as though we could selectively deploy workflows involving preliminary research, chain of thought execution followed by test design and execution and criticism by a QA agent. These are quite likely to improve response quality.

Trading tokens for quality

"Chain of thought" execution has been adopted widely by those benchmarking LLMs. "Chain of thought" execution trades tokens for quality. It allows the LLM to think by "talking to itself" before committing to producing a final answer. It turns out that adding notes into the context and breaking the problem into component parts has a positive effect on response quality.

While "chain of thought" execution is great, there are other areas where a similar approach could be used to trade the number of tokens for response quality. Tokens could be output related to:

Preliminary research
Test design
Test execution
Criticism

I describe an approach to doing this using a single request here. A single request may not seem much like using agents - but some of the principles used are those typically associated with agent workflows.

I don't have results to present yet, but most involved seem to think that agent workflows have the potential to improve the quality of responses.

Will frontier labs offer agent workflows?

This is the main question I want to address in this essay.

Today, we can see that labs offer models for a variety of tastes and budgets. Most frontier models are large neural networks that are oriented around token prediction. We can also see that some models use a "Mixture of Experts" approach to improve the quality of outputs. This is already a small step towards using a workflow involving multiple agents. In particular, a research agent, a testing agent and a critic could likely improve many responses.

In the future it seems quite likely that users will be able to interact with specific experts through an API. It also seems likely that the experts will increase in number and variety. The list of available experts is likely to ultimately include advanced models making use of agent workflows to improve response quality.

Typical interactions with such systems are likely to go through a "triage" process - which chooses an expert to assign to a query - and then an "execution" phase - during which the assigned expert processes the query - if possible.

Benchmarks

Progress in LLM development sometimes comes from "evals" - tests or benchmarks. What we have seen so far indicates that if it is possible to improve performance via changes to the prompt then that's what will happen. Specifically it was previously found that using a "chain of thought" prompt helped with benchmarks. The reaction was to add this in to the prompt during the testing process.

We may see a similar thing here. If research, testing or criticism have beneficial effects on benchmark scores, then they could also be incorporated into the benchmark execution via prompt engineering. This is likely to make incorporating these workflows into the models themselves seem less urgent - because it won't further improve benchmark scores.

Timescale

While agent workflows may be coming to frontier models, we really need a cost-benefit analysis to see how much benefit they can bring and at what cost. Until we have that the timescale for deployment is quite uncertain.