Creating a standard for question difficulty levels for MIMIR leagues

- Paul Pop

Apr 19, 2024

While we are nearing our next season (which we plan to make happen at the end of next month or beginning of June), I have quite a difficult topic to discuss.

Question difficulty level is something all question setters/editors aspire to get correct, so to as give every player an equivalent number of questions of the same difficulty level. Failing to do so, will result in advantages for one or more of the players, and beats one basic assumption of the mimir format - that everyone has an equal chance of scoring the maximum number of points (which will be equal, if every player is equally strong). This can result in sub-optimal experience, and one common complaint in mimir leagues is that seats are not balanced.

Is a perfectly balanced set possible with all the quads having the same gradient of difficulty levels (from 1 to 4)? Perhaps. But, given the data from many seasons of many mimir quiz leagues, including ours, it is likely impossible. We rarely find quads which have the assigned difficulty level (before the game-week) equalling that of the difficulty based on the data collected from the game-week. For example, see our public scoreboard from season 1. You can see here that the assigned difficulty level (in the quad code) is rarely in the same order as ‘D.op = Difficulty level based on quarter-part (25%) distribution of player-wise data’ or ‘D.g = Difficulty level based on quarter-part (25%) distribution of game-wise data’.

I dare you to find a game-week in any of the mimir quiz leagues where D.op or D.g or equivalent, is in the same order for every single quad. I strongly believe that has never happened in the history of any mimir quiz league, and if it has happened, I suspect the quiz setter is an AI which has gathered all the information about you including your private browsing history, your deepest darkest secrets and phobias, so as to assess the likelihood of you answering a question correctly or passing or getting it incorrect.

So, I propose creating a standard for preparing questions to fit a difficulty level rating, instead of basing it solely on statistics from the averages from a play-testing before the game-week (usually, from a maximum of 8 people, and often 4 people) or “feels like level X” assignment of difficulty level. While the former two can be used as corroborative tools, we should try and develop a more robust system for assigning question difficulty level which has less to do with our feelings or low sample size averages, and more to do with an objective methodological approach.

What I have to propose here is this: instead of doing post-hoc modifications of questions to make it a certain level of perceived difficulty, we (setters and players) collectively agree that a framework of carefully-created guidelines represents the difficulty levels (instead of assessing it based on D.og or D.g or equivalent ratings) and use those guidelines to frame any question so as to create questions of a certain difficulty level.

We at EMU, invite everyone - whether the setters/editors of a mimir league, or players of such leagues, to provide feedback on this first attempt to create a set of guidelines. We can build it into something better with all your help, and everyone (the setters/editors who will start receiving less complaints, and the players who will acknowledge that the setters/editors have followed carefully drafted guidelines) benefits from it. You can even name this standards - instead of ‘Kyoto Protocol’, it could be ‘Kyu two two protocol’ (Why two 2’s (diff. level) protocol). Give your suggestions/feedback for this set of guidelines, and suggest a good name for this standard in the comments section or email or on our Facebook group.

So, finally, here are the guidelines:

Guidelines for different difficulty levels:

1. Level 1 should be questions whose answers are likely to be known and/or answered by 76-100%. So, it should be easy questions. These are questions which require almost no recall from memory, as the answers can come from ‘muscle memory’. No critical thinking is necessary for cracking such questions. While it is not recommended, it is permitted to have answers which have come before in other quizzes (but not EMU) (“Peters”) for level 1 questions. The questions should be easily workoutable from the clues. These are the ways one can achieve this:

1a) Keep at least one easy hint, ideally a tangential hint like example or pop-culture reference that leads to the entire answer. For e.g., (bolded here),
Q: Some carbon from the terrestrial biosphere and hydrosphere is returned to the atmosphere via natural emissions of [BLANK]. The primary natural source of [BLANK] emissions is wetlands, followed by either onshore geological sources (gas–oil seeps, mud volcanoes, diffuse microseepage, and geothermal manifestations including volcanoes) or termites. Termites, who occur predominantly in the tropical and subtropical latitudes release [BLANK] during the anaerobic decomposition of plant biomass in their gut. FITB.
A: Methane/CH4. Termite mounds mitigate approx. half of termite methane and emissions (which is only ~1 to 3% of global methane emissions).
1b) When clueing, use clues that are quite direct with no misdirection. See example above. Here is another example (see bolded portion):
Q: The article (see image) describes the significant maturation of an Italian artist in his scientific method, over the course of his life (from 1452 to 1519), through his sketches and notes. For example, an early sketching of the human head by him illustrates the medieval doctrine that the vertical and horizontal axis of the skull must cross at the site of the sensus communis, or common sense, where all perceptions were believe to be gathered. While anatomically detailed, his early sketches drew from previous misconceptions. Identify the artist, whose most famous work can be found in the Louvre Museum in Paris. https://imgur.com/PigDli4.jpg (article); https://imgur.com/FE74sNz.jpg (sketching)
A: Leonardo da Vinci Accept Leonardo or da Vinci (his most
famous painting being Mona Lisa) https://imgur.com/YO5iI4e.jpg
1c) Use questions with very simple terms or explanations as answers, irrespective of the number of the words. Examples:
Q: Hirudin is a protein found in the salivary glands of leeches.
What does it help in? https://imgur.com/9fYhjcO.jpg
A: It helps them feed without the coagulating/clotting of the
blood of the victim OR it thins the blood
Q: With a length of ~2,000 km and width of ~16-320 km, the largest aggregation of coral reefs in the world, found off the Queensland coast of Australia, is of what type? This type of reefs are separated from the coast by wider, deeper lagoons. At their shallowest points, they can reach the water’s surface forming a [BLANK] to navigation. https://imgur.com/GEMAJIL.jpg
A: Barrier reefs. The one in Australia being the Great Barrier
Reef https://imgur.com/UYl4vmi.jpg
1d) The answers can be those which are known by (almost) everyone AND exposed to on a daily basis (see examples above).
1e) Letter count should be added for all level 1 questions unless it is extremely easy.

2. Level 2 should be questions whose answers are likely to be known by 51-75%, or likely to be answered by 51-75%. So, it should be moderately difficult questions, leaning towards the easier side. It will only require minimal recall from memory and/or minimal critical thinking. For this, the questions should be workoutable with a little effort, as all the relevant clues will directly lead to the full answer. These are the ways one can achieve this:

2a) Keep a tangential hint that leads to the entire answer. For e.g., (bolded here), the clue leads to the answer ‘tripod’, and not ‘tri’ or ‘pod’ alone.
Q: Bathypterois grallator, the [BLANK] fish or [BLANK] spiderfish, is a deep-sea benthic fish found relatively widely in the Atlantic, Pacific, and Indian oceans. They seem to prefer to perch on the pelagic sediment using elongated fin rays in the tail and two pelvic fins to stand. This resembles a [BLANK], the reason for their naming, which is shared with a camera equipment used to stabilize the camera and used in low light photography. FITB. https://imgur.com/HZW5GMJ.jpg
A: Tripod https://imgur.com/PngpVps.jpg
2b) When clueing, use clues that are rather direct, but along with something that may very slightly confuse the participants. For example, the bolded portion in the question above, the clue almost only leads to the answer ‘tripod’, but the ‘low light photography’ part may make some people rethink the answer for a bit.
2c) Use questions with one or two-word moderately difficult answers, but not more than that (unless it's a moderately difficult explanation). Examples are:
One word answer (only bolded portion is needed):
Q: Lippershey is most famous for filing for the patents of a device to the States General of Netherlands who granted him no patents, but instead 900 florins to modify it into a binocular. What device?
A: Refracting telescope https://imgur.com/gibD7bB.jpg
Simple explanation type
Q: The age difference between Reuben Blake and his twin sister is around 4 years and 10 months. This is because their parents Simo and Jody Blake got pregnant via in-vitro fertilization twice from the same embryo batch. However, twins being born on different years happens naturally too, like in the case of Buffalo, N.Y.’s Ronan Rosputni and his brother Rory. How?
A: New year’s eve and new year (accept any similar variant). Ronan Rosputni was born at 11:37 p.m .on Dec. 31, 2011 and his brother Rory was born at 12:10 a.m. on Jan. 1.
2d) The answers can be those which are known by everyone but NOT exposed to on a daily basis (see examples above).
2e) Letter count can be used to aid the participants with the questions.

3. Level 3 should be questions whose answers are likely known by 26-50% of the participants, or likely to be answered by 26-50% because the time is not sufficient enough to work out the clues. So, it should be a reasonably tough question, involving good recall from memory and/or moderate critical thinking. For this, either the question must have somewhat harder clues or the clues should lead only one part of the answer. These are the ways one can achieve this:

3a) Keep hints that lead to only part of the answer (which may not be required). For e.g., (bolded here) the clue leads to a part of the name that is not necessary, as surnames is the only necessary part (and the rest of the name earning them merely brownie points):
Q: Play audio-visual after the question. Which composer met a Common Starling Sturnus vulgaris on May 27th, 1784, in a Viennese shop while the bird was singing an improvised version of the theme from his Piano Concerto no. 17 in G major? [BLANK] bought and took him home to be a family pet. Lyanda Lynn Haupt details this in her book [BLANK]’s Starling. In the audio-visual, you can hear Haupt’s rescued starling Carmen singing the [BLANK] composition (feebly). He has a pack of wild canids in his name. https://imgur.com/yHUB54V.mp4
A: Wolfgang Amadeus Mozart https://imgur.com/f5oN3DV.jpg
3b) When clueing, use clues that are rather cryptic/indirect, but guessable by everyone if they put some thought into it (like the example above).
3c) Use questions with one or two complex word answers, but not more than that.
Q: In a paper published in the PNAS, Burrows and Ostriker (Pr: O-stry-ker) derive by dimensional and physical analysis using basic physical arguments, the characteristic masses and sizes of important objects in the universe in terms of just a few fundamental [BLANKS]. One of these is ħ, which is the reduced X Y (6 apostrophe 1, 8) (h/2 pi). What is h, which can be used to express all physical quantities in terms of only three dimensional quantities, one each for mass, length, and time?
A: Planck's constant (6.62607015 × 10^-34 m² kg/s)
3d) The answers can be that which is used by specialists in a field as well known by a good number of people outside of the field, but not so frequently heard or seen (see example above).
3e) Letter count in the question is only necessary for questions which straddle between level 3 and 4 difficulty levels.

4. Level 4 should be questions whose answers are likely known by 25% of the participants or less, or likely to be answered by 25% or less because the time is not sufficient enough to work out the clues. So, it should be a high order thinking question, involving excellent recall from memory and/or heavy critical thinking. For this, either the question must have minimal clues or the clues should be complicated enough that keep them thinking for a while while the clock ticks. These are the ways one can achieve this:

4a) Keep any revealing tangential hints to zero. For e.g., if the answer is ‘Point Nemo’, no reference to the titular Clown Fish character in the animated movie should be made, as a hint.
4b) When clueing, use rather cryptic/indirect clues than direct clues. E.g. (boldened):
Q: This is a technique to look at atoms in biological systems using one of the smallest atomic particles. Name this technique (4 hyphen 8,10) which has revolutionized molecular biology, earning its creators the 2017 Nobel Prize for Chemistry. The first word is used for a type of preservation too (Short form accepted).
A: Cryo-electron microscopy/cryo-EM/Transmission electron cryomicroscopy/CryoTEM
4c) Use questions with longer/complex or multi-word answers (even upto 3 or 4 words (hyphenated or otherwise) as in the example above. Or even, two part answers. This however doesn’t include questions which may require answering in a sentence, if the answers themselves are quite easy. For e.g. (of not a level 4):
Q: Play video first. In the British sitcom ‘The IT Crowd’, the catchphrase Roy is referring to, is the most common advice he (and others in the IT department) gives for solving any IT problem. Even today, this advice works (but a temporary fix) for a lot of situations since it terminates the stuck process allowing any misbehaving code to replenish itself. Identify the catchphrase or solution. https://imgur.com/mXN2kR0.mp4
A: "Have you tried turning it off and on again?"/Rebooting/Restarting
(Accept answers with the same meaning).
Here is an example of a sentence-type answer that is level 4:
Q: Although rare, heteroparental superfecundation happens in humans, which results in the production of what biological oddity? Clytemnestra by Tyndareus and Helen by Zeus from Greek mythology are an example. https://imgur.com/nZntRlQ.jpg - Clytemnestra and Helen from Helen of Troy (2003)
A: Twins from two fathers (prompt on twins, or twins from different parents or similar). It's due to the fertilization of two different ova/eggs from the same batch by sperms of different males. Leda was the mother of both Clytemnestra and Helen.
Here is an example of a two-part answer:
Q: Subspecies of the birds found in the island country X is sometimes named taprobanus after their ancient name Taprobane (Pr: tap-ROB-a-nee) (16th-17th C). Examples include the subspecies of Pied Cuckoo Clamator jacobinus taprobanus, and a subspecies of Common Kingfisher Alcedo atthis taprobana. The latter subspecies is also found in the southern portion of an adjacent country. The X White-eye, endemic to X, has the name Zosterops ‘Y’ensis. Y is another old name for the country X. Identify X AND Y.
A: X = Sri-Lanka and Y = Ceylon (the adjacent country being India) https://imgur.com/tuF1LZY.jpg
4d) The answers can be very niche, which only a subject specialist would have likely heard of.
4e) Don’t provide a letter count for the answer, within the question.

Example quad from season 1

While there are some examples of good gradation from last season, there are only a few perfect ones as it is near impossible to predict which ones are of a particular difficulty level for 100+ people. Since we have more number of setters play-testing this season, and have a season of data, we can probably better predict the difficulty levels.

To decide what difficulty level your questions are, you can look up the difficulty levels calculated in the Public Scoreboard for season 1 for similar questions (you can look up the questions in the public sets using the corresponding quad codes. Quad codes = Theme number (1 to 12) dot question difficulty level. For example, 12.1 means that the quad belongs to the theme of Wildlife and has a question difficulty level of 1 (the least difficult question of the quad).

Examples:

The following quad from season 1 game-week 1 has a perfect gradation (based on difficulty levels calculated post-week based on quarter-part distribution of player-wise data (for example if the question has a get rate (Corrects/Total attempts) of 0.95, it's level 1, if it's 0.24, it's level 4)

Level 1:

Q: See video first. Which long-standing myth did the Eratosthenes bust? It is also a myth that this pre-Eratosthenes myth was popular after this discovery, although it has seen resurgence in the past few decades.

A: Flat Earth

Level 2:

Q: In the documentary 'Behind the Curve', we see flat-Earthers themselves disproving flat-earth with experiments (but failing to accept the results due to confirmation bias). In one of those (as shown in the video), they employed which device which harnesses the principle of conservation of angular momentum? (Video prompt: a spinning object)

[A word count can be given here to make sure it is a level 2]

A: Gyroscope

Level 3:

Q: A chamber of what element does he plan to use after the zero Gauss chamber also disproved flat Earth? While it has silvery white colour, its surface can oxidise to have an iridescent tinge. See the spectrum of colours of the chamber (in the video). This is appropriate given the two letter symbol which can be read differently in the context of sexuality.

A: Bismuth (Bi). The clue being 'Bisexual' shortened as Bi (and colours of the pride flag).

Level 4:

Q: What device (4, 5, 7), a magnetic shield with an incredibly high field attenuation, does he use to enclose the gyroscope to try to "disprove" the previous results? The name contains a number, a unit of measurement of magnetic induction, and a term commonly seen associated with commerce. It also has an alternate 9 letter name (Video prompt: a cylindrical object).

[Note that in this particular level 4 questions, clues to every word and the word count has been given because it’s an extremely difficult question. However, the hints can still be reduced, and we will do that for the next season]

A: Zero Gauss Chamber/MuChamber

EMU’s Substack

Discussion about this post