German Gender and Big Data

Hello everyone,

and first of all let me say a big fat

Thank you for 30.000 comments

Yup, we passed it some day last week and usually I do some kind of giveaway but I actually missed it this time.
So we’ll do the next one at 40.000, I guess :). But yeah, vielen lieben Dank. Your comments are part of the reason why this blog even made it through the first few years and without them it would feel really lonely around here. And also, you ask lots of great questions that add value to the article itself.
I read every single comment and I’ll continue to do that, so keep ’em coming :).

And now on to our next point which is… my failure.
Honestly, this week was a complete failure for me, article-wise. My plan was to do an exercise on noun gender, and I thought I had an idea.
I didn’t want to test single nouns but rather the rules that Slavica told us about last time. But for this to actually work, we’d need to do it flashcard-style… so you get asked the same thing repeatedly until it’s automatic. And then I realized I actually don’t have a flashcard setup in my quiz software.
So then I thought “Okay, I’m gonna do nouns” and I looked for lists of the most common nouns and so on, but actually it’s the same thing… I need a flashcard setup, not just “one up multiple choice”. So yeah… there went my idea of doing a gender practice :/.

I do have something interesting about gender though, because a while back someone sent me a really interesting email. His day name is Emmanuel Haton, but I’ll refer to him with his secret identity…. Excel-Man. He had done some serious number crunching on the issue of German gender and I found this so interesting that I absolutely had to share that with you.

So, using a giant database of German nouns and their frequency, Excel-Man used his superpower to check for trends between endings and gender and calculated how accurate this rule of thumb is if you just follow it blindly.
And he didn’t stop there. Then, he collected the most common exceptions to the particular rule of thumb and then calculated the accuracy again. And it’s all weighted by frequency, mind you. So if 99 of 100 words follow a certain pattern but that one exception is super common, it’ll increase the accuracy quite a bit.
And to top it all off, it even accounts for compounds. So if die Sicht was an exception for instance, its frequency would also include words like Einsicht and Aussicht and so on.

He used the DeReWo frequency database of the  Leibniz-Institut für Deutsche Sprache, which contains abou 23,000 singular nouns and which is based on a collection of recent German texts totalling 23 billion words. That’s thousands and thousands of words.
So it’s a pretty damn nice insight into regularities and how regular they are and all in all an incredible piece of work and I’m gonna share part of his email and the tables with you below.

Not all of it is equally useful in practice, but I’m sure you’ll find some really nice ideas and insights. And it’s generally just very interesting to see how accurate these trends really are.
So… take a look and then tell me what you think. I’m really really curious for your feedback and if you can use this for your studies.
Oh and also, since I don’t have an exercise, if you have a great tool for practicing German gender, please share it in the comments. I know there’s loads of apps out there but I have no idea what’s good.
So, I’ll see you in the comments, have a great weekend and bis nächste Woche :)

Oh, by the way… there’s a big new feature about to come to the website. Not sure when, but I’d say two weeks at the most.
It’s really really exciting, so get ready :) :) :).

***

German Gender and Big Data
(by Excel-Man)

I looked into the question of noun gender in German. I know, for a native speaker, this is a most uninteresting topic. But that’s only because you’ve been immersed all along in the language and therefore gender is obvious. Not so for a foreign learner, for whom it is something of an enigma and a holy grail. And for a language perfectionist like me it is a tragedy: I can easily umsonst walk by a bakery if I can’t remember the correct gender for Kuchen…

I decided to have a go at it, not from a semantic point of view (meaning: babies, diminutives and young animals are neutral; days, months and seasons are masculine, etc.) but using another ally… drumroll…

BIG DATA

Ok, really simply statistics. Ok, really some analysis in an excel table. My hope was to find some useable rules, and I did find some. As always in languages, few rules are valid 100% of the time but I found out some regularities that can save the day with a pretty good success rate.

Also I made sure to consider frequency data. I have read that overall in the German vocabulary the percentage of masculine/ feminine words is such and such, which can help make some “bets” when you don’t know the gender of a word. But, to be useable, this kind of tip needs to be based on the usage frequency of the words, not only on their raw number.

Ideally usage frequency of spoken German should be considered, because rules of thumb are much more useful when speaking, as you don’t have the possibility to open a dictionary and check. Unfortunately I couldn’t get my hands on any such spoken word frequency database. Instead I relied on the general DeReWo database from the Leibniz-Institut für Deutsche Sprache.

The results

Some known things were confirmed:

  •  words in -UNG are feminine except der Sprung
  •  words in -ION are mostly feminine
  • ,,,.

But there were also some less obvious things, like for instance about the gender of words that end in -r

The gender of words ending in -R

For some reason the R final sounds masculine to me but it’s not exactly true.
First of all, let’s be semantic one second, if the -R word designates a feminine person (die Mutter) or a masculine person (der Bauer) then go for the person’s gender.
The exception is das Opfer and I remember it by thinking that the Opfer doesn’t actually do something. Similarly, devices that do something (der Computer) are masculine.
– otherwise -AUER is always feminine
– other words ending in -ER are, to my surprise, more often neutral (das Meer, das Messer) except if the first letter is K – in that case, go for masculine (der Koffer).
– words ending in -UR and -HR are mostly feminine but for some exceptions, including das Jahr,
– for other -R words (that is, not ending in -ER), try masculine and you’ll be right 78% of the time.

Words ending with I

  •  words ending with “-ei” are overwhelmingly feminine
  • other words in -i are 95% masculine.
  • Words ending in -f, -b, -g(except -ung) are 90% masculine
  • Words ending in -us, -uss, -uß (except das Haus) are 90% masculine.
    Easy to remember when thinking that -us is the typical masculine marker in Latin.
  • Similarly, words in -um (the typical Latin marker for neutral) are indeed almost all neutral except for…
  • …those in –aum, which are masculine.

For more statistical data, see results in tables A and B below. In every case you can see that learning a few exceptions seriously increases the probability of guessing right.

Finally, usage statistics let us know the list of words that are so frequent, including the frequency of their compounds, that it is worth learning their gender by heart (see table C).

Conclusion: doing this analysis was slightly geeky but fun. In total, the situations described here cover over 70% of the usage.
Enough to make me feel safer when ordering einen Kuchen at the bakery. And yes, it is der Kuchen.

** if you’re reading on mobile, please flip to landscape!! The full tables should fit then **

Table A

Gender of nouns in -R

(in total, 16% of all word occurrences, so these words make up 16% of the total nouns used)

ending:trend:accuracy:extras:accuracy with extras:
-er and is a feminine (Mutter) or masculine person (Bauer)f/m99%das Opfer100%
-er
and a device that does something, or a month
m97%das Thermometer (and other “meter” devices), die Leiter99%
-auer
(if not a masculine person)
f100%
begins by K-
m91%das Klavier, die Kammer, das Kloster100%
other words in -ern58%der Fehler, der Meter, der Liter
der Ärger, der Hunger, der Finger
die Nummer, die Steuer, die Feier
80%
-hr or -urf67%das Jahr
das Ohr, das Rohr,der Verkehr, der Schwur, der Azur
100%
other words ending in -rm78%das Tor, das Paar, das Haar, die Tür, die Bar, die Schar94%

Table B

Other noun gender rules

when a noun ends with:if you say:then you are x% rightand if you learn these exceptions:then you are x% right(this rule covers x% of total noun occurrences)
-ungf99%der Sprung100%9.2%
-ionf98%der Champion, der Spion, der Skorpion, das Stadion, das Ion100%2.3%
-eitf98%der Streit100%2.7%
-ftf94%der Saft, der Lift, der Stift, das Geschäft, das Heft, der Schaft (but other words in -schaft are f99%1.7%
-e
 (except if a masculine person)
f90%das Ende, das Interesse, das Gebäude, das Gelände, das Finale, das Auge, der Name, der Kaffee, der Schnee, der Käse97%.4%
-nnm98%das Kinn, das Zinn99%0.7%
-gm91%die Burg, das Zeug, das XXX-ing99%2.9%
-pfm99%das Geschöpf100%0.3%
-ikf99%der Streik, das Mosaik100%1.0%
-bm89%das Lob, das Grab, das Weib, das Pub, das Kalb, das Verb, das Laub99%0.4%
-fm88%der/das Golf, der /die Elf,
das Schaf, das Schiff, das Dorf, das Kaff
99%1.4%
-tzm89%das XXX-etz99%1.3%
-nzf90%der Tanz, der Schwanz, der Glanz, der Kranz99%0.5%
-us, -uss, -uß (except das Haus)m92%die Nuss, das Muß, das Plus, das Minus, das Virus, das Aus98%1.0%
words in UM
-aumm100%0.3%
other words in -umn97%der Konsum, der Irrtum, der Reichtum100%0.9%
words in -is
-nisn86%der Penis
die Kenntnis, die Erlaubnis
96%0.5%
-eism94%das Eis, das Gleis99%0.5%
other words in -isf77%das Remis, das Palais
der Mais, der Cannabis, der Kurbis
95%0.1%
words in “-i”
-eif98%der Schrei, der Papagei, der Brei , das Geschrei100%0.6%
other words in -im95%die Safari, die Salami, die Gaudi
das Taxi, das Alibi, das Sushi, das Müsli
99%0.8%

Table C

Frequent nouns worth learning by heart

The frequency includes the frequency of the compound words with the same root.

der Tag1.8%
die Xxx-schaft (but der Schaft )1.1%
das Jahr1.0%
die Zeit0.8%
die Stadt0.6%
der Euro0.6%
das Ende0.5%
das Spiel0.5%
der Fall0.5%
das Haus0.5%
der Platz0.4%
der Xxx-trag0.4%
der Satz0.4%
die Arbeit0.4%
der Punkt0.3%
das Land0.3%
der Meter (the unit, but das Thermometer )0.3%
der Bau0.3%
das Leben0.3%
die Welt0.3%
der Gang0.3%
der Artikel0.3%
der Rat0.3%
der Ort0.3%
der Weg0.2%
das Geld0.2%
die Sicht0.2%
das Unternehmen0.2%
der Grund0.2%
der Schluss0.2%
das Bild0.2%
das Thema0.2%
der Abend0.2%
die Wahl0.2%
die Zahl0.2%
das Mal0.2%
der Kreis0.2%
der Zug0.2%
das Amt0.2%
das Werk0.2%
die Form0.2%
der Raum0.2%
das Wort (but die Antwort )0.2%

***

Let us know what you think and of course also if you have any questions.
See you in the comments :)

 

4.8 6 votes
Article Rating

for members :)

Subscribe
Notify of
guest
91 Comments
Oldest
Newest
Inline Feedbacks
View all comments