Does anyone know of a good library for mapping a person's name to his or her gender? [closed]
I am looking for a library or database that can provide guesses about whether a person is male or female based on his or her name or nickname. Something like
john => "M",
mary => "F",
alex => "A", #ambiguous
I am looking for something that supports names other than English names (such as Japanese, Indian, etc.).
Before I get another answer along the lines of "you are going to offend people by assuming their sex/gender" let me be clear, my application does not interact with anyone. It does not send emails or contact anyone in anyway. There are no users to ask. In many cases, the person in question is dead, and the only information I have is name, birth date, and date of death. The reason I want to know the sex of the individual is to make the grammar of the output nicer and to aid in possible searches that may come latter.
Solution 1:
gender.c is an open source C program that does a good job. It comes with data for 44568 first names from all around the world. There is good documentation and a description of the file format (basically plain text) so it should not be to difficult to read it from your own application.
Here is what the author says:
A few words on quality of data
The dictionary of first names has been prepared with utmost care. For example, the Turkish, Indian and Korean names in this dictionary have all been independently classified by several native speakers. I also took special care to list only those names which can currently be found.
The lesson from this?
Any modifications should be done very cautiously (and they must also adhere to the sorting required by the search algorithm). For example, knowing that "Sascha" is a boy's name in Germany, the author never assumed the English "Sasha" to be a girl's name. Knowing that "Jan" is a boy's name in Germany, I never assumed it to be also a English short form of "Janet". Another case in point is the name "Esra". This is a boy's name in Germany, but a girl's name in Turkey.
The program calculates a probability for the name being male of female. It can do so with the name as input alone or with the name and country of origin, which gives significantly better results.
You can download it from the website of the German computer magazine c't 40 000 Namen. The article is in German but don't worry, all documentation is English. Here is the direct ftp link 0717-182.zip if you are not interested in the article. The zip-File contains the source code, an windows executable, the database and the documentation.
Solution 2:
The gender of a name is something that cannot be inferred programmatically in the general case. You need a name database. Here is a free name database from the US Census Bureau.
EDIT: The link for the 2010 name is dead but there are working links and a libraries in the comments.
Solution 3:
"I tell ya, life ain't easy for a boy named 'Sue.'"
...So, why make it any harder? If you need to know the sex, just ask... Otherwise, don't worry about it.
Solution 4:
I've builded a free API that gives a probabilistic guess on the gender based on a first name. Instead of using any of the above mentioned approaches, i instead use a huge dataset of profiles from social networks to provide a probabilistic guess along with a certainty factor. It also supports optional filtering through country or language id's. It's getting better by the day as more profiles are added to the dataset.
It's free to use at http://genderize.io
ONE thing you should consider is using a tool that takes demographics into account, as naming conventions will rely heavily on this.
Example
http://api.genderize.io?name=kim
{"name":"kim","gender":"female","probability":"0.89","count":1440}
http://api.genderize.io?name=kim&country_id=dk
{"name":"kim","gender":"male","probability":"0.95","count":44,"country_id":"dk"}
Solution 5:
Here are two oddball approaches that may not even work, and likely wouldn't work en masse without violating the terms of a license:
Use the Facebook API (which I know virtually nothing about, it may not even be possible) to perform two searches: one for FB male users with that first name, and one for female. Use the two numbers to decide the probability of gender.
Much looser but more scalable, use the Google API and search for the name plus the gender-specific pronouns, and compare the numbers. For instance, there are 592,000,000 results for searching for "Richard his" (not as a phrase), but only 179,000,000 for "Richard her".