German Gender and Big Data

Hello everyone,

and first of all let me say a big fat

Thank you for 30.000 comments

Yup, we passed it some day last week and usually I do some kind of giveaway but I actually missed it this time.
So we’ll do the next one at 40.000, I guess :). But yeah, vielen lieben Dank. Your comments are part of the reason why this blog even made it through the first few years and without them it would feel really lonely around here. And also, you ask lots of great questions that add value to the article itself.
I read every single comment and I’ll continue to do that, so keep ’em coming :).

And now on to our next point which is… my failure.
Honestly, this week was a complete failure for me, article-wise. My plan was to do an exercise on noun gender, and I thought I had an idea.
I didn’t want to test single nouns but rather the rules that Slavica told us about last time. But for this to actually work, we’d need to do it flashcard-style… so you get asked the same thing repeatedly until it’s automatic. And then I realized I actually don’t have a flashcard setup in my quiz software.
So then I thought “Okay, I’m gonna do nouns” and I looked for lists of the most common nouns and so on, but actually it’s the same thing… I need a flashcard setup, not just “one up multiple choice”. So yeah… there went my idea of doing a gender practice :/.

I do have something interesting about gender though, because a while back someone sent me a really interesting email. His day name is Emmanuel Haton, but I’ll refer to him with his secret identity…. Excel-Man. He had done some serious number crunching on the issue of German gender and I found this so interesting that I absolutely had to share that with you.

So, using a giant database of German nouns and their frequency, Excel-Man used his superpower to check for trends between endings and gender and calculated how accurate this rule of thumb is if you just follow it blindly.
And he didn’t stop there. Then, he collected the most common exceptions to the particular rule of thumb and then calculated the accuracy again. And it’s all weighted by frequency, mind you. So if 99 of 100 words follow a certain pattern but that one exception is super common, it’ll increase the accuracy quite a bit.
And to top it all off, it even accounts for compounds. So if die Sicht was an exception for instance, its frequency would also include words like Einsicht and Aussicht and so on.

He used the DeReWo frequency database of the  Leibniz-Institut für Deutsche Sprache, which contains abou 23,000 singular nouns and which is based on a collection of recent German texts totalling 23 billion words. That’s thousands and thousands of words.
So it’s a pretty damn nice insight into regularities and how regular they are and all in all an incredible piece of work and I’m gonna share part of his email and the tables with you below.

Not all of it is equally useful in practice, but I’m sure you’ll find some really nice ideas and insights. And it’s generally just very interesting to see how accurate these trends really are.
So… take a look and then tell me what you think. I’m really really curious for your feedback and if you can use this for your studies.
Oh and also, since I don’t have an exercise, if you have a great tool for practicing German gender, please share it in the comments. I know there’s loads of apps out there but I have no idea what’s good.
So, I’ll see you in the comments, have a great weekend and bis nächste Woche :)

Oh, by the way… there’s a big new feature about to come to the website. Not sure when, but I’d say two weeks at the most.
It’s really really exciting, so get ready :) :) :).

***

German Gender and Big Data
(by Excel-Man)

I looked into the question of noun gender in German. I know, for a native speaker, this is a most uninteresting topic. But that’s only because you’ve been immersed all along in the language and therefore gender is obvious. Not so for a foreign learner, for whom it is something of an enigma and a holy grail. And for a language perfectionist like me it is a tragedy: I can easily umsonst walk by a bakery if I can’t remember the correct gender for Kuchen…

I decided to have a go at it, not from a semantic point of view (meaning: babies, diminutives and young animals are neutral; days, months and seasons are masculine, etc.) but using another ally… drumroll…

BIG DATA

Ok, really simply statistics. Ok, really some analysis in an excel table. My hope was to find some useable rules, and I did find some. As always in languages, few rules are valid 100% of the time but I found out some regularities that can save the day with a pretty good success rate.

Also I made sure to consider frequency data. I have read that overall in the German vocabulary the percentage of masculine/ feminine words is such and such, which can help make some “bets” when you don’t know the gender of a word. But, to be useable, this kind of tip needs to be based on the usage frequency of the words, not only on their raw number.

Ideally usage frequency of spoken German should be considered, because rules of thumb are much more useful when speaking, as you don’t have the possibility to open a dictionary and check. Unfortunately I couldn’t get my hands on any such spoken word frequency database. Instead I relied on the general DeReWo database from the Leibniz-Institut für Deutsche Sprache.

The results

Some known things were confirmed:

  •  words in -UNG are feminine except der Sprung
  •  words in -ION are mostly feminine
  • ,,,.

But there were also some less obvious things, like for instance about the gender of words that end in -r

The gender of words ending in -R

For some reason the R final sounds masculine to me but it’s not exactly true.
First of all, let’s be semantic one second, if the -R word designates a feminine person (die Mutter) or a masculine person (der Bauer) then go for the person’s gender.
The exception is das Opfer and I remember it by thinking that the Opfer doesn’t actually do something. Similarly, devices that do something (der Computer) are masculine.
– otherwise -AUER is always feminine
– other words ending in -ER are, to my surprise, more often neutral (das Meer, das Messer) except if the first letter is K – in that case, go for masculine (der Koffer).
– words ending in -UR and -HR are mostly feminine but for some exceptions, including das Jahr,
– for other -R words (that is, not ending in -ER), try masculine and you’ll be right 78% of the time.

Words ending with I

  •  words ending with “-ei” are overwhelmingly feminine
  • other words in -i are 95% masculine.
  • Words ending in -f, -b, -g(except -ung) are 90% masculine
  • Words ending in -us, -uss, -uß (except das Haus) are 90% masculine.
    Easy to remember when thinking that -us is the typical masculine marker in Latin.
  • Similarly, words in -um (the typical Latin marker for neutral) are indeed almost all neutral except for…
  • …those in –aum, which are masculine.

For more statistical data, see results in tables A and B below. In every case you can see that learning a few exceptions seriously increases the probability of guessing right.

Finally, usage statistics let us know the list of words that are so frequent, including the frequency of their compounds, that it is worth learning their gender by heart (see table C).

Conclusion: doing this analysis was slightly geeky but fun. In total, the situations described here cover over 70% of the usage.
Enough to make me feel safer when ordering einen Kuchen at the bakery. And yes, it is der Kuchen.

** if you’re reading on mobile, please flip to landscape!! The full tables should fit then **

Table A

Gender of nouns in -R

(in total, 16% of all word occurrences, so these words make up 16% of the total nouns used)

ending:trend:accuracy:extras:accuracy with extras:
-er and is a feminine (Mutter) or masculine person (Bauer)f/m99%das Opfer100%
-er
and a device that does something, or a month
m97%das Thermometer (and other “meter” devices), die Leiter99%
-auer
(if not a masculine person)
f100%
begins by K-
m91%das Klavier, die Kammer, das Kloster100%
other words in -ern58%der Fehler, der Meter, der Liter
der Ärger, der Hunger, der Finger
die Nummer, die Steuer, die Feier
80%
-hr or -urf67%das Jahr
das Ohr, das Rohr,der Verkehr, der Schwur, der Azur
100%
other words ending in -rm78%das Tor, das Paar, das Haar, die Tür, die Bar, die Schar94%

Table B

Other noun gender rules

when a noun ends with:if you say:then you are x% rightand if you learn these exceptions:then you are x% right(this rule covers x% of total noun occurrences)
-ungf99%der Sprung100%9.2%
-ionf98%der Champion, der Spion, der Skorpion, das Stadion, das Ion100%2.3%
-eitf98%der Streit100%2.7%
-ftf94%der Saft, der Lift, der Stift, das Geschäft, das Heft, der Schaft (but other words in -schaft are f99%1.7%
-e
 (except if a masculine person)
f90%das Ende, das Interesse, das Gebäude, das Gelände, das Finale, das Auge, der Name, der Kaffee, der Schnee, der Käse97%.4%
-nnm98%das Kinn, das Zinn99%0.7%
-gm91%die Burg, das Zeug, das XXX-ing99%2.9%
-pfm99%das Geschöpf100%0.3%
-ikf99%der Streik, das Mosaik100%1.0%
-bm89%das Lob, das Grab, das Weib, das Pub, das Kalb, das Verb, das Laub99%0.4%
-fm88%der/das Golf, der /die Elf,
das Schaf, das Schiff, das Dorf, das Kaff
99%1.4%
-tzm89%das XXX-etz99%1.3%
-nzf90%der Tanz, der Schwanz, der Glanz, der Kranz99%0.5%
-us, -uss, -uß (except das Haus)m92%die Nuss, das Muß, das Plus, das Minus, das Virus, das Aus98%1.0%
words in UM
-aumm100%0.3%
other words in -umn97%der Konsum, der Irrtum, der Reichtum100%0.9%
words in -is
-nisn86%der Penis
die Kenntnis, die Erlaubnis
96%0.5%
-eism94%das Eis, das Gleis99%0.5%
other words in -isf77%das Remis, das Palais
der Mais, der Cannabis, der Kurbis
95%0.1%
words in “-i”
-eif98%der Schrei, der Papagei, der Brei , das Geschrei100%0.6%
other words in -im95%die Safari, die Salami, die Gaudi
das Taxi, das Alibi, das Sushi, das Müsli
99%0.8%

Table C

Frequent nouns worth learning by heart

The frequency includes the frequency of the compound words with the same root.

der Tag1.8%
die Xxx-schaft (but der Schaft )1.1%
das Jahr1.0%
die Zeit0.8%
die Stadt0.6%
der Euro0.6%
das Ende0.5%
das Spiel0.5%
der Fall0.5%
das Haus0.5%
der Platz0.4%
der Xxx-trag0.4%
der Satz0.4%
die Arbeit0.4%
der Punkt0.3%
das Land0.3%
der Meter (the unit, but das Thermometer )0.3%
der Bau0.3%
das Leben0.3%
die Welt0.3%
der Gang0.3%
der Artikel0.3%
der Rat0.3%
der Ort0.3%
der Weg0.2%
das Geld0.2%
die Sicht0.2%
das Unternehmen0.2%
der Grund0.2%
der Schluss0.2%
das Bild0.2%
das Thema0.2%
der Abend0.2%
die Wahl0.2%
die Zahl0.2%
das Mal0.2%
der Kreis0.2%
der Zug0.2%
das Amt0.2%
das Werk0.2%
die Form0.2%
der Raum0.2%
das Wort (but die Antwort )0.2%

***

Let us know what you think and of course also if you have any questions.
See you in the comments :)

 

4.9 15 votes
Article Rating

Newsletter for free?!

Sign up to my epic newsletter and get notified whenever I post something new :)
(roughly once per week)

No Spam! Read our privacy policy for more info.

Your Thoughts and Questions

Subscribe
Notify of
guest
93 Comments
Newest
Oldest
Inline Feedbacks
View all comments
bluellama
bluellama
10 months ago

ist ‘großartig’ in der Umgangssprache tatsächlich häufig, oder nur bei Cari und Manuel?

Aaron
Aaron
2 years ago

This reminds me of a couple Linguistics courses I took in university. Thanks for doing it, very useful! ;)

DW05
DW05
2 years ago

Is there a link to the Excel worksheet? That would be really helpful.

ui bkk
ui bkk
2 years ago

okuk

Napay 1
2 years ago

Hi Emmanuel

The last table is not well fitted for PC, the second column allows one digit instead three (which is needed)

Besides that, this post is amazing I will share from now on to all german learners :)

Aditiwari
Aditiwari
2 years ago

Der Schung is another exception

Vienna
Vienna
2 years ago
Reply to  Emanuel

Possibly meant “der Schwung”

Francesca Greenoak
Francesca Greenoak
2 years ago

How interesting. As 53% of German words are feminine, I learned the endings that are mostly f. I pick up the rest as I go: busk it as m or n.

Anonymous
Anonymous
2 years ago

My comments is nowhere to be found. Let me repeat then; words such as Schrei or Sprung do not have an ending which is something that is reserved for multisyllabic words. Ei and ung are part of the stem of the word here therefore you cannot derive their gender based on an ending syllable that does not exist. We don’t say in English that the word jump has en ending in ump or mp etc.

Alice
Alice
2 years ago

I just try to learn the article with the noun, because that was stressed by the online programs early on. I will try to look over all this information, but there is no foolproof way to learn German definite articles. The feminine ones I usually get right, but the masculine and neuter ones I sometimes forget. I practice definite articles every day (with three different subscription programs), and I always get a few wrong no matter how much I study them.
Just going to keep trying…
;)

DianaM
DianaM
2 years ago

Thanks so much, super useful.

I do have one more adjustment to your English – “it’s all weighed by frequency”: “weighed” should be “weighted”. That’s the correct term in statistical analysis.

https://www.investopedia.com/terms/w/weightedaverage.asp

Edit – am I doing something wrong? I tried to download the .pdf, but only got part of the article. It stops part way through Table B.

Cengiz
Cengiz
2 years ago

have i just noticed or was there time references under messages all the time?! i tried a few articles and it is same in all messages. although i said it would be better with dates, your argument for not putting also seems valid. anyway now people can say their opinion about both styles.

coleussanctus
coleussanctus
2 years ago

Interesting stuff. I would be curious to poke around in the data and dig into some of the finer details, like how many words are in each category, what the frequency looks like, how many words are compound. From poking around on dict.cc for words that end in *auer, it looks like you can have pretty much everything you need for daily use if you learn 3 words – Bauer (m), Dauer (f), Mauer (f). I picked a small category on purpose because it’s easy to look at, but I still thought it was interesting.

Another thought, there are obviously strong trends based on the form of the word (which is a huge help when you know what the word looks like but don’t know the meaning), but I wonder how many of the exceptions could be explained from the meaning of the word. For nouns ending in -nis, it seems like they tend to be neuter if they describe something concrete, or at least with a tangible aspect, and feminine for more abstract ideas. Erkenntnis, Kenntnis (f); Ergebnis, Verhältnis (n). It’s more of a broad trend than anything else, and it breaks down at some point, but I did find a fair amount of research about abstract nouns tending to be feminine. I have some other theories about how to explain the gender split for -nis nouns (that could possibly apply to other nouns), but that’s a bit too much for a comment. How useful any of this is for daily learning, I don’t know, but this is what I end up doing when I have nowhere to go.

Dr.Rami
Dr.Rami
2 years ago

i’d like to thank you for your free-membership gift that allows me to get into this massive amount of pretty good thinking approachs in learning germany and languages in general
i’ll pay that back as soon as i can … what a decent person you are!
thank you again

Dr.Rami
Dr.Rami
2 years ago
Reply to  Dr.Rami

but where could i find the date of each article … seem to be unchronicled

Cengiz
Cengiz
2 years ago
Reply to  Dr.Rami

yeah, though i just love this blog, thats a concern for me too. i am surprised that most people doesnt seem to care about it at all. without date it feels like i am floating in a space of articles without any point of reference. it is as if; it werent for the emails that Emanuel sends weekly one could easily assume that this blog had stopped years ago and we are just lingering in a museum of old entries. i dont mean “old” is bad, just that without reference it feels kind of weird. i dont know, maybe i am the weird :) then again, maybe this “weirdness” is what makes this blog special. because there is really something different about this blog that makes it stand out among other tons of similar websites. sure it is due to emanuels übermensch dedication and courtesy but maybe this date thing also adds a unique flavor.

Cpot
Cpot
2 years ago

Mein Gott. Es ist sehr gute Informationen. Vielen Dank, dass Sie mich es wissen lassen.

Ryan
Ryan
2 years ago

Wow. Very nice! Concerning words like computer. My experience is, that you look at the gender of the German word.
Auf Deutsch heißen diese Wörter Anglizismen.
Example:

Der Computer – > der Rechner
Das Notebook – > das Notizbuch
Die E-Mail – > could be based on “die Post”

There could be exceptions. I haven’t spend much time looking for them.

Metin
Metin
2 years ago

Am I the only one who got surprised by this article? I have searched all internet to see such a compilation but could not find it. This is useful not only for learners but also for teachers.

Thank you. Really nice work

Richard
Richard
2 years ago

Like many of your posts, I printed this one out. They’re great! However, the absence of a pdf version in this case does make for inconvenience: ie having to hand-write the text obscured by the Privacy and Cookies message at the bottom of each page..

pmccann
pmccann
2 years ago
Reply to  Richard

Hang in there: the button for the pdf version often magically emerges after Elsa’s typos have been seen to.

Anonymous
Anonymous
2 years ago

I can guarantee that virtually every word ending in -ung is feminine. Der Sprung is not an exception, it does not have an -ung ending, an ending is defined as a separate syllable attached to the stem of the word, Several of the other examples you gave followed that mistaken pattern.

Anonymous
Anonymous
2 years ago

I can guarantee that virtually every word ending in -ung is feminine. Der Sprung is not an exception, it does not have an -ung ending, an ending is defined as a separate syllable attached to the stem of the word, Several of the other examples you gave followed that mistaken pattern.

kalamazoo
kalamazoo
2 years ago

On Amazon, you can buy a self-published book called “Der Die Das, the secrets of German gender’ by Constantin Vayenas. It seems like that book might have a somewhat similar approach, looking for patterns below the surface. The reviews it gets are good. Is anyone here familiar with this book???

Bruce Terry
Bruce Terry
2 years ago
Reply to  kalamazoo

Yes, it is very deeply researched and well-presented.