Measuring reader frustration: Hapaxes percentage

Each time a confident beginning reader encounters a word that only occurs once in a text, a little seed of frustration is sown; there is no second chance for review, there’s no chance for another context to enlighten the reader about the meaning of the word. In corpus linguistics they call this hapax legomenon (“something said only once” in Greek) with the plural: hapax legomena or the short form: hapaxes.

For every word that occurs more than once in a text, the reader has a chance to refresh their memory and grow their understanding of the word’s use in context.

The more unique words in a text, the more frustrating and tedious the reading and the learning of vocabulary is for the confident beginning reader. (Of course, re-reading always rewards the reader--and one of the best ways to cement the knowledge of a unique word is to see it re-used again; however, this is a small consolation during the initial struggles.) We can actually quantify this frustration using the formula:

Total number of words that only occur once Total number of words in a text = Percentage of words that are hapax legomenon \frac{Total number of words that only occur once }{Total number of words in a text} = Percentage of words that are hapax legomenon {Total number of words that only occur once} over {Total number of words in a text} = Percentage of words that are hapax legomenon

The following table shows the instances of hapaxes in the major works of Franz Kafka:

WorkTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
Das Schloß108,1556,5882,8212.61
Der Prozeß71,7765,1602,1823.04
Amerika89,1526,6593,0913.47
Die Verwandlung19,1572,8941,5207.93
Brief an Den Vater16,2922,6381,4378.82
In der Strafkolonie10,2801,8941,0039.76
Vier Geschichten13,0722,4371,33310.2
Ein Landarzt13,1392,7781,59512.14
Betrachtung5,8671,49488715.12
Das Urteil3,9951,08968217.07
Aphorisms3,5021,01660817.36
Kleinere Werke1,70966641023.99

Some conclusions that may help guide our study choices: Das Schloss and Amerika are nearly the same length; yet reading Amerika requires learning 919 more words that only occur once. Die Verwandlung is 6,000 words longer than Ein Landarzt, yet both collections require about the same amount of vocabulary to be learned; yet Die Verwandlung has fewer unique words.

In the table above, hapaxes are not cumulative, as the Franz Kafka corpus above contains 356,096 words and yet only 5,461 hapaxes, only 1.53 percent of the total corpus. However, these rare words typically appear elsewhere in modern German, as the language of Kafka is relatively close to contemporary German. By contrast, in classical latin and greek, one encounters hapax legomenon that were not only said once in a work, but we don’t have a record of them appearing in the rest of the limited corpus of the whole language.

Confident beginning readers should keep in mind that when encountering a rare word in a text, it may actually be more commonly used and the reader may see it again elsewhere or even hear it conversation. This web application of texts and tracking the user’s lookups aims to greatly reduce the frustration that beginning readers encounter when encountering a word that only occurs once.

This analysis can also be done on the smaller works.

For the collection Betrachtung:

SectionTitleTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
18Unglücklichsein1,40349131522.45
1Kinder auf der Landstraße1,07548031128.93
2Entlarvung eines Bauernfängers62932721934.82
5Der Ausflug ins Gebirge137975237.96
7Der Kaufmann59833624240.47
10Die Vorüberlaufenden1621087143.83
9Der Nachhauseweg1531057649.67
11Der Fahrgast23416011850.43
8Zerstreutes Hinausschaun88654450
3Der plötzliche Spaziergang25018113152.4
6Das Unglück des Junggesellen1431097753.85
12Kleider1431127854.55
14Zum Nachdenken für Herrenreiter25919214455.6
13Die Abweisung19114310756.02
15Das Gassenfenster113916658.41
17Die Bäume43412762.79
4Entschlüsse18115211764.64
16Wunsch, Indianer zu werden65624569.23

For the collection Vier Geschichten:

SectionTitleTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
4Josefine, die Sängerin oder Das Volk der Mäuse6,0401,41279513.16
2Eine kleine Frau2,69984551419.04
3Ein Hungerkünstler3,4241,08671020.74
1Erstes Leid90944531634.76

For the collection Kleinere Werke:

SectionTitleTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
7Vom Scheintod2571356826.46
6Eine alltägliche Verwirrung30918611737.86
2Die Erfindung des Teufels2281489441.23
8Heimkehr25215310541.67
3Die Chinesischen Mauer und der Turmbau von Babel36623015642.62
4Prometheus120805545.83
5Die Wahrheit über Sancho Pansa99836262.63
1Kleine Fabel78685064.1

For the collection Ein Landarzt:

SectionTitleTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
14Ein Bericht für eine Akademie3,1891,03764120.1
2Ein Landarzt2,13280653925.28
11Elf Söhne1,60864442826.62
6Schakale und Araber1,30154835727.44
5Vor dem Gesetz58827017229.25
7Ein Besuch im Bergwerk87941527431.17
13Ein Traum72035323933.19
4Ein altes Blatt69136525036.18
9Eine kaiserliche Botschaft32319012739.32
10Die Sorge des Hausvaters47627518939.71
12Ein Brudermord61636327244.16
1Der neue Advokat26018314154.23
3Auf der Galerie29020716155.52
8Das nächste Dorf66705380.3

One thing is clear, for a confident beginning reader setting out to read Kafka's short stories for the first time, reading them in their published sequential order is probably never the best approach. Students should strongly consider readings with lower percentage of frustrating hapaxes. For a slightly different approach to learning progressions, see the article Which German text to read when? Learning a language is difficult, let’s make it as easy as possible.

Written by Todd Cook. Language enthusiast, modern outdoorsman, software craftsman. Find him on or

Is there a subject you want to know more information about? Please tell us.

Not a user? Register for your Free Account