Fellowship of AIs

A single super-agent

Can we use agent orchestration techniques to produce a single "super agent" - without sacrificing generality?

The answer is quite likely: yes. If so, we should probably do it. The results would likely top capability leader boards. If so, that might well boost interest in agent orchestration approaches. How to do it? There are various options. One of the most obvious ones is to simply combine the results from multiple large models - sending the prompt to each one (ideally in parallel) and then aggregating the results - by using another call to the model.

Introducing variation

Getting multiple models to produce outputs won't help very much if the outputs are all the same. There are several different approaches to generating variation:

Different foundation models - could be used;
Prompt manipulation - request replies from agents with each major personality type, for example;
Temperature - increases in the model's temperature can be used to generate variation.

Implementations

Matt Schumer has an implementation titled AI-Oracle that calls Claude, GPT-4 and Perpexity - and then combines the results.

There is another project called MultiLLM.

There's a paper titled: "More Agents Is All You Need" - which presents a simple approach and then quantifies some of the results they obtained from it.

A tool called Beam uses a similar approach.

There's a project called FrugalGPT - which is broadly similar, though they are mainly optimizing for low cost, rather than improved performance.

Aggregation

There are various result aggregation options. The most obvious is to feed all the results to an agent and ask it to give a summary. That approach is probably better than another approach - get multiple agents to review the outputs and vote on which one is best.

Solution recognition vs solution generation

This section is about the principle that is is easier to recognize a correct solution than it is to generate one. That idea is related to the concept from proof theory in mathematics that it is easier to validate a proof than it is to find one.

This claim is fairly obviously true. It should help with the "aggregation" step.

Incorporating criticism

Here, the output of the agent(s) is given to one-or-more critics. An "aggregation" agent is then given the original solution(s), the proposals of the critics and asked to combine all the data in the best way possible. The critics can typically operate in parallel - reducing the time-cost of having multiple critics. So: you could have have one critic check the grammar, another check for logical inconsistencies - and so on. Using a critic has the effect of increasing the quantity of time spent generating each output token. If the extra time is spent well, it could result in quality increases.

Planning

A "preprocessing" stage is also a possibility. Some tasks can usefully be broken down into subtasks. the prompt could ask whether this was such a task and output either the original prompt, or a task breakdown, followed by the original prompt. These could be research tasks, for example. This type of planning is likely to increase the total number of tokens generated and - hopefully - increase the quality of the output.

Chain-of-thought reasoning

A similar operation to generating research subtasks is to use "chain-of-thought" reasoning. This is a tried and tested technique to using more tokens to improve LLM output quality. Here we could use a "preprocessing" stage to identify prompts that "chain-of-thought" reasoning might help with. Then - if the task is appropriate, we could output a "chain-of-thought" solution and then feed it and the original prompt through the LLM again - using a second pass.

Is a preliminary research task ever worse than a chain-of-thought approach? That is not clear. Perhaps we can resolve the issue experimentally.

Mismatches

If multiple different LLMs are used, there may be "impedence mismatches" between them. They could have different sized-context windows, different multi-media capabilities - and so on. It could be an issue.

Benchmarks

There are some results in the "More Agents Is All You Need" paper. However, more results are TBA.

Significance

If these types of agent orchestration technique can result in topping benchmark leaderboards, that would be an impressive result.

Where does the fellowship of AIs come in?

Inspired by The Lord of the Rings trilogy, the Eleizer Yudkowsky once compared his mission to Frodo's quest - likening his fellow travellers to Frodo's companions and calling the group the "Fellowship of the AI".

Here we have multiple AIs seeking the one true ring of human knowledge. It seems like a fellowship of AIs.

The ring's description seems reminiscent of the aggregation step. It famously reads:

One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them