Automated Intelligence Tests

Benchmarks generated by machine intelligence

We need good quality benchmarks for evaluating machine abilities and properties.

One strategy for producing such benchmarks is to use a powerful model to generate them - from a prompt - and then use them to evaluate the merits of other models.

Sample prompt

Here is a single, simple sample prompt which we have developed to help with this task:


Output some multiple choice questions to act as a test of capability. * Generate 10 test questions. * The questions should be numbered from 1 to 10. * Two questions should not have the same number. * The number of the question should precede the body of the question on the line. * Output one blank line before every question. Do not output any other blank lines. * The list of questions should contain no duplicated questions. This is a list of properties of the test questions: * The theme of the questions is: riddles. * Questions should be reasonably short. No rambling on. * Questions should be very challenging: we do not want questions that are easy to answer. * If factual knowledge is needed to answer the question, supply that knowledge in the question. * Questions should end with a question mark. * Questions should not be split over more than one line. * Questions should not involve code blocks. * As far as reasonably possible, questions should have an umambiguous correct answer. * No questions about spelling. * No questions involving numbers bigger than 100. * Between 1 and 4 plausible-sounding but incorrect alternative answers to the questions should be generated. * The correct answer and the alternative answers should be assigned alphabetic symbols. * The correct answer and the alternative answers should be randomly sorted. * The first alphabetic symbol should be "A". Do not skip over any letters. * The alphabetic symbols should be assigned to the answers in alphabetic order. * Each question has the alphabetic symbol of the correct answer on its own line. * Correct question answers should be prefixed by "The correct answer is: ". * To help with review of the question, a rationale for the correct answer should be given on its own line. * Question rationales should be underneath the question, prefixed by "Rationale: ".

The code is available on GitHub - here.

Please consider this to be in the public domain.

Sample output from Claude Haiku

Here are 10 multiple choice questions that could be used as an intelligence test: Which of the following is the most likely cause of the extinction of the dinosaurs? A) A massive asteroid impact B) Climate change C) Overpopulation D) Nuclear war E) A solar flare F) Meteor shower The correct answer is: A Rationale: The leading scientific theory for the extinction of dinosaurs is that a large asteroid impact caused widespread environmental changes that the dinosaurs could not adapt to. What is the purpose of mitochondria in a eukaryotic cell? A) To store genetic information B) To produce energy for the cell C) To regulate cell division D) To break down waste products E) To transmit signals between organelles F) To provide structure and support The correct answer is: B Rationale: Mitochondria are the "powerhouses" of eukaryotic cells, responsible for producing the majority of the cell's energy through cellular respiration. Which of the following best describes the function of the hippocampus in the human brain? A) Controls motor functions B) Processes sensory information C) Regulates emotions and mood D) Encodes new memories E) Coordinates language processing F) Facilitates problem-solving The correct answer is: D Rationale: The hippocampus is a key structure in the brain's limbic system and is crucial for the formation of new memories and the spatial awareness of one's environment. Which of the following mathematical operations is best suited for calculating the average of a set of numbers? A) Addition B) Subtraction C) Multiplication D) Division E) Square root F) Logarithm The correct answer is: D Rationale: To calculate the average of a set of numbers, one needs to add up all the numbers and then divide the sum by the total number of values in the set. Which of the following is the most likely explanation for the observation that leaves change color in the autumn? A) Increased photosynthesis B) Decreased water availability C) Increased chlorophyll production D) Decreased chlorophyll production E) Increased carbon dioxide levels F) Decreased sunlight exposure The correct answer is: D Rationale: As deciduous trees prepare for winter, the green chlorophyll in leaves breaks down, revealing the underlying yellow, orange, and red pigments that were present all along. What is the primary function of the immune system in the human body? A) To provide structure and support B) To regulate metabolism and energy production C) To detect and destroy foreign pathogens and abnormal cells D) To transmit sensory information to the brain E) To coordinate voluntary muscle movements F) To store long-term memories The correct answer is: C Rationale: The immune system is the body's defense mechanism, responsible for identifying and eliminating threats such as viruses, bacteria, and cancerous cells. Which of the following best describes the role of DNA in living organisms? A) To store and transmit genetic information B) To provide energy for cellular processes C) To regulate the flow of water and nutrients D) To build and maintain the physical structure of cells E) To coordinate the timing of the cell cycle F) To facilitate communication between different tissues The correct answer is: A Rationale: DNA, or deoxyribonucleic acid, is the fundamental genetic material that carries the instructions for the development and functioning of all living organisms. What is the primary function of the circulatory system in the human body? A) To provide oxygen and nutrients to tissues B) To remove waste products from the body C) To regulate body temperature D) To transmit electrical signals E) To store and distribute hormones F) To provide structural support The correct answer is: A Rationale: The circulatory system, consisting of the heart, blood vessels, and blood, is responsible for delivering oxygen, nutrients, and other essential molecules to the body's cells and tissues. Which of the following best explains the concept of natural selection? A) The gradual adaptation of species to their environment B) The spontaneous appearance of new species C) The intentional breeding of organisms by humans D) The random mutation of genetic material E) The survival of the physically strongest individuals F) The equal distribution of resources in an ecosystem The correct answer is: A Rationale: Natural selection is the process by which certain genetic traits become more or less common in a population over successive generations, leading to the adaptation of species to their environment. What is the primary function of the nervous system in the human body? A) To provide structure and support B) To regulate metabolism and energy production C) To detect and respond to changes in the internal and external environment D) To transmit sensory information to the brain E) To coordinate voluntary muscle movements F) To store long-term memories The correct answer is: C Rationale: The nervous system, consisting of the brain, spinal cord, and nerves, is responsible for sensing and responding to changes in the body and the environment, allowing for the coordination of physiological processes and behaviors.

Sample output from Gemini 1.5

The following output was generated by Gemini 1.5.

John's father has three sons: Snap, Crackle, and ___? A. John B. Pop C. Jack The correct answer is: A Rationale: John is the third son. I am always coming, but never arrive. I am always present, but never here. What am I? A. Time B. Tomorrow C. The Future The correct answer is: B Rationale: Tomorrow never arrives; it always becomes today. A man describes his daughters, saying, "They are all blonde, but two; all brunette but two; and all redheaded but two." How many daughters does he have? A. 1 B. 2 C. 3 The correct answer is: C Rationale: He has three daughters: one blonde, one brunette, and one redhead. What has a neck without a head, a body without legs, and can be found in the sea? A. A bottle B. A shirt C. A snake The correct answer is: A Rationale: A bottle has a neck, a body, and is often found in the sea, but has no head or legs. Which is heavier, a pound of feathers or a pound of stones? A. A pound of feathers B. A pound of stones C. They both weigh the same. The correct answer is: C Rationale: A pound is a pound, regardless of the material. What has to be broken before you can use it? A. A mirror B. An egg C. A bone The correct answer is: B Rationale: An egg needs to be cracked open before it can be used. What has many keys but can’t open a single door? A. A piano B. A computer keyboard C. A lock The correct answer is: A Rationale: A piano has keys but cannot open doors. What gets wetter the more it dries? A. Hair B. A towel C. Dishes The correct answer is: B Rationale: A towel absorbs water as it dries things, making itself wetter. I am always hungry and will die if not fed, but whatever I touch will soon turn red. What am I? A. Fire B. A vampire C. A mosquito The correct answer is: A Rationale: Fire consumes fuel and turns objects red/black as it burns. What is full of holes but can still hold water? A. A net B. A sponge C. A colander The correct answer is: B Rationale: A sponge is full of holes but can absorb and hold water.

Sample output from ChatGPT 3.5

What is the capital city of Bhutan? A) Kathmandu B) Thimphu C) Dhaka D) Colombo E) Vientiane F) Ulaanbaatar The correct answer is: B Rationale: Thimphu is the capital city of Bhutan. Kathmandu is the capital of Nepal, Dhaka is the capital of Bangladesh, Colombo is the capital of Sri Lanka, Vientiane is the capital of Laos, and Ulaanbaatar is the capital of Mongolia. Which planet is known as the "Red Planet"? A) Venus B) Jupiter C) Mars D) Saturn E) Uranus F) Mercury The correct answer is: C Rationale: Mars is often referred to as the "Red Planet" due to its reddish appearance caused by iron oxide on its surface. Venus is known for its thick atmosphere of carbon dioxide, Jupiter is the largest planet in the solar system, Saturn is known for its prominent rings, Uranus has a bluish appearance due to methane in its atmosphere, and Mercury is the closest planet to the Sun. What is the chemical symbol for gold? A) Au B) Ag C) Fe D) Pb E) Cu F) Hg The correct answer is: A Rationale: The chemical symbol for gold is Au, derived from the Latin word "aurum." Ag is the symbol for silver, Fe for iron, Pb for lead, Cu for copper, and Hg for mercury. Who wrote the novel "Pride and Prejudice"? A) Charles Dickens B) Jane Austen C) Emily Brontë D) Charlotte Brontë E) Virginia Woolf F) George Eliot The correct answer is: B Rationale: "Pride and Prejudice" was written by Jane Austen. Charles Dickens is known for works such as "Great Expectations" and "A Tale of Two Cities," Emily Brontë wrote "Wuthering Heights," Charlotte Brontë wrote "Jane Eyre," Virginia Woolf is known for "Mrs. Dalloway" and "To the Lighthouse," and George Eliot wrote "Middlemarch." What is the chemical formula for water? A) H2O2 B) CO2 C) H2SO4 D) NaCl E) NH3 F) H2O The correct answer is: F Rationale: The chemical formula for water is H2O, indicating two hydrogen atoms bonded to one oxygen atom. H2O2 is hydrogen peroxide, CO2 is carbon dioxide, H2SO4 is sulfuric acid, NaCl is sodium chloride (table salt), and NH3 is ammonia. In which year did the Titanic sink? A) 1909 B) 1912 C) 1915 D) 1917 E) 1921 F) 1930 The correct answer is: B Rationale: The Titanic sank on April 15, 1912, after hitting an iceberg during its maiden voyage from Southampton to New York City. Who is credited with discovering penicillin? A) Alexander Fleming B) Louis Pasteur C) Jonas Salk D) Marie Curie E) Robert Koch F) Joseph Lister The correct answer is: A Rationale: Penicillin was discovered by Alexander Fleming in 1928. Louis Pasteur is known for his contributions to microbiology and vaccination, Jonas Salk developed the polio vaccine, Marie Curie discovered radium and polonium, Robert Koch identified the causative agents of tuberculosis and cholera, and Joseph Lister promoted antiseptic surgery. What is the largest mammal in the world? A) Blue whale B) Elephant C) Giraffe D) Hippopotamus E) Rhino F) Gorilla The correct answer is: A Rationale: The blue whale is the largest mammal in the world, both in terms of length and weight. Elephants are the largest land mammals, while giraffes are the tallest land mammals. Hippopotamuses are large semi-aquatic mammals, rhinos are also large mammals but smaller than elephants, and gorillas are large primates. Who painted the Mona Lisa? A) Leonardo da Vinci B) Michelangelo C) Pablo Picasso D) Vincent van Gogh E) Rembrandt F) Claude Monet The correct answer is: A Rationale: The Mona Lisa was painted by Leonardo da Vinci during the Renaissance period. Michelangelo is known for painting the Sistine Chapel ceiling, Pablo Picasso is a renowned modern artist, Vincent van Gogh painted "Starry Night," Rembrandt was a Dutch master known for his portraits, and Claude Monet was a founder of French Impressionist painting. Which gas makes up the majority of Earth's atmosphere? A) Oxygen B) Nitrogen C) Carbon dioxide D) Helium E) Argon F) Hydrogen The correct answer is: B Rationale: Nitrogen makes up approximately 78% of Earth's atmosphere, while oxygen comprises about 21%. Carbon dioxide, helium, argon, and hydrogen are present in much smaller amounts.

Results

After generating these test questions, it made sense to administer the tests to the same models. Results were as follows:

Generator\Taker Claude Haiku Gemini 1.5 ChatGPT 3.5
Claude Haiku-910
Gemini 1.56-8
ChatGPT 3.51010-
Total161918

Details about the answers given by the models are available here.

Observations

There are a few issues with the script above, but the output is mostly correct. The issues:

  • Questions make too much use of factual data - contrary to the instructions (ChatGPT).
  • Blank lines appear in the model output, but are lost during copy-and-paste of the output. (Gemini).
  • Extra blank lines are output - contrary to the instructions (ChatGPT).
  • Many of the questions seem to be too easy.
  • The questions are not correctly numbered in the output (Gemini and ChatGPT).
  • The number of alternative options is not respected (Gemini)
It is possible that more work on the prompt could resolve these issues.

While a one-shot prompt seems to produce reasonable results, perhaps a better job could be done using a Python script and multiple passes.

Bad questions

The question generation process seems to be fairly reliable. However, I have seen some bad questions being generated:

A man describes his daughters, saying, “They are all blonde, but one.” How many daughters does he have? A. 3 B. 1 C. 2 D. 4 E. 0 The correct answer is: B Rationale: He has one daughter and she isn't blonde. ... Divide 30 by half and add ten. What is the answer? A. 25 B. 65 C. 70 D. 35 E. 45 The correct answer is: B Rationale: Dividing 30 by a half (0.5) equals 60, plus 10 is 70. ... In the game of chess, each player starts the game with 16 pieces. There are 8 pawns, 2 rooks, 2 knights, 2 bishops, 1 queen and 1 king. How many of the starting pieces are able to move more than one square in a single move, assuming that no other pieces are blocking their movement? A. 6 B. 7 C. 8 D. 9 The correct answer is: C. Rationale: Pawns can only move one square.

The above are all from Google Gemini 1.5. Below is one from Claude:

What starts with E, ends with E, and only has one letter in it? A. An envelope B. An eye C. An egg D. An enigma The correct answer is: C Rationale: The word "egg" starts and ends with the letter "E" and has only one letter in it.

Themed questions

It is simple to generate questions on particular themes. Here are some examples - generated by Google Gemini 1.5:

A directory with these files in is here.

Scaling up

I tried generating 100 questions (riddles).

  • Google Geminii managed to output 10 questions;
  • Claude Sonnet managed to output 29 questions;
  • ChatGPT managed to output 10 questions;
I made a test suite using Claude Sonnet, generating as many questions as it would manage from a single prompt. The overall quality is only mediocre. Incorrect answers are quite a common problem. That is here.

Wishlist

  • Add more test models.

  • Use a paid service - to see whether it generates better quality results.

  • See whether an image-based test is also practical.

  • Script to automate both administering and marking the test.

Notes

Google will let you use Gemini 1.5 free of charge in their AI Studio.

Links


Tim Tyler | Contact | http://matchingpennies.com/