### Measuring reader frustration: Hapaxes percentage

Each time a confident beginning reader encounters a word that only occurs once in a text, a little seed of frustration is sown; there is no second chance for review, there’s no chance for another context to enlighten the reader about the meaning of the word. In corpus linguistics they call this hapax legomenon (“something said only once” in Greek) with the plural: hapax legomena or the short form: hapaxes.

For every word that occurs more than once in a text, the reader has a chance to refresh their memory and grow their understanding of the word’s use in context.

The more unique words in a text, the more frustrating and tedious the reading and the learning of vocabulary is for the confident beginning reader. (Of course, re-reading always rewards the reader--and one of the best ways to cement the knowledge of a unique word is to see it re-used again; however, this is a small consolation during the initial struggles.) We can actually quantify this frustration using the formula:

$\frac\left\{Total number of words that only occur once \right\}\left\{Total number of words in a text\right\} = Percentage of words that are hapax legomenon$

The following table shows the instances of hapaxes in the major works of Franz Kafka:

WorkTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
Das Schloß108,1556,5882,8212.61
Der Prozeß71,7765,1602,1823.04
Amerika89,1526,6593,0913.47
Die Verwandlung19,1572,8941,5207.93
Brief an Den Vater16,2922,6381,4378.82
In der Strafkolonie10,2801,8941,0039.76
Vier Geschichten13,0722,4371,33310.2
Ein Landarzt13,1392,7781,59512.14
Betrachtung5,8671,49488715.12
Das Urteil3,9951,08968217.07
Aphorisms3,5021,01660817.36
Kleinere Werke1,70966641023.99

Some conclusions that may help guide our study choices: Das Schloss and Amerika are nearly the same length; yet reading Amerika requires learning 919 more words that only occur once. Die Verwandlung is 6,000 words longer than Ein Landarzt, yet both collections require about the same amount of vocabulary to be learned; yet Die Verwandlung has fewer unique words.

In the table above, hapaxes are not cumulative, as the Franz Kafka corpus above contains 356,096 words and yet only 5,461 hapaxes, only 1.53 percent of the total corpus. However, these rare words typically appear elsewhere in modern German, as the language of Kafka is relatively close to contemporary German. By contrast, in classical latin and greek, one encounters hapax legomenon that were not only said once in a work, but we don’t have a record of them appearing in the rest of the limited corpus of the whole language.

Confident beginning readers should keep in mind that when encountering a rare word in a text, it may actually be more commonly used and the reader may see it again elsewhere or even hear it conversation. This web application of texts and tracking the user’s lookups aims to greatly reduce the frustration that beginning readers encounter when encountering a word that only occurs once.

This analysis can also be done on the smaller works.

For the collection Betrachtung:

SectionTitleTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
18Unglücklichsein1,40349131522.45
1Kinder auf der Landstraße1,07548031128.93
2Entlarvung eines Bauernfängers62932721934.82
5Der Ausflug ins Gebirge137975237.96
7Der Kaufmann59833624240.47
10Die Vorüberlaufenden1621087143.83
9Der Nachhauseweg1531057649.67
11Der Fahrgast23416011850.43
8Zerstreutes Hinausschaun88654450
3Der plötzliche Spaziergang25018113152.4
6Das Unglück des Junggesellen1431097753.85
12Kleider1431127854.55
14Zum Nachdenken für Herrenreiter25919214455.6
13Die Abweisung19114310756.02
15Das Gassenfenster113916658.41
17Die Bäume43412762.79
4Entschlüsse18115211764.64
16Wunsch, Indianer zu werden65624569.23

For the collection Vier Geschichten:

SectionTitleTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
4Josefine, die Sängerin oder Das Volk der Mäuse6,0401,41279513.16
2Eine kleine Frau2,69984551419.04
3Ein Hungerkünstler3,4241,08671020.74
1Erstes Leid90944531634.76

For the collection Kleinere Werke:

SectionTitleTotal WordsUnique Dictionary EntriesHapax legomenaPercent of words that are hapax legomenon
7Vom Scheintod2571356826.46
6Eine alltägliche Verwirrung30918611737.86
2Die Erfindung des Teufels2281489441.23
8Heimkehr25215310541.67
3Die Chinesischen Mauer und der Turmbau von Babel36623015642.62
4Prometheus120805545.83
5Die Wahrheit über Sancho Pansa99836262.63
1Kleine Fabel78685064.1

For the collection Ein Landarzt:

 Section Title Total Words Unique Dictionary Entries Hapax legomena Percent of words that are hapax legomenon 14 Ein Bericht für eine Akademie 3,189 1,037 641 20.1 2 Ein Landarzt 2,132 806 539 25.28 11 Elf Söhne 1,608 644 428 26.62 6 Schakale und Araber 1,301 548 357 27.44 5 Vor dem Gesetz 588 270 172 29.25 7 Ein Besuch im Bergwerk 879 415 274 31.17 13 Ein Traum 720 353 239 33.19 4 Ein altes Blatt 691 365 250 36.18 9 Eine kaiserliche Botschaft 323 190 127 39.32 10 Die Sorge des Hausvaters 476 275 189 39.71 12 Ein Brudermord 616 363 272 44.16 1 Der neue Advokat 260 183 141 54.23 3 Auf der Galerie 290 207 161 55.52 8 Das nächste Dorf 66 70 53 80.3

One thing is clear, for a confident beginning reader setting out to read Kafka's short stories for the first time, reading them in their published sequential order is probably never the best approach. Students should strongly consider readings with lower percentage of frustrating hapaxes. For a slightly different approach to learning progressions, see the article Which German text to read when? Learning a language is difficult, let’s make it as easy as possible.

Written by Todd Cook. Language enthusiast, modern outdoorsman, software craftsman. Find him on or