为什么斯坦福大学核心性别识别没有确定性? [英] Why is stanford corenlp gender identification nondeterministic?

查看:159
本文介绍了为什么斯坦福大学核心性别识别没有确定性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到以下结果,并且您可以看到爱德华(Edward)这个名字有不同的结果(null和male).这已经有好几个名字了.

I have the following results and as you can see the name edward has different results (null and male). This has happened with several names.

edward, Gender: null
james, Gender: MALE
karla, Gender: null
edward, Gender: MALE

此外,如何自定义性别词典?我要添加西班牙语和中文名称.

Additionally, how can I customize the gender dictionaries? I want to add Spanish and Chinese names.

推荐答案

您提出了很多问题!

1.)Karla不在默认的性别映射文件中,因此这就是空的原因

1.) Karla is not in the default gender mappings file, so that is why that's getting null

2.)如果要创建自己的自定义文件,则应采用以下格式:

2.) If you want to make your own custom file, it should be in this format:

JOHN \ tMALE

JOHN\tMALE

每行应该有一个NAME \ tGENDER条目

There should be one NAME\tGENDER entry per line

GenderAnnotator只能为映射获取1个文件,因此您需要使用要添加的名称制作一个新文件.

The GenderAnnotator can only take 1 file for the mappings, so you need to make a new file with the names you want added on.

默认的性别映射文件位于stanford-corenlp-3.5.2-models.jar文件中.

The default gender mappings file is in the stanford-corenlp-3.5.2-models.jar file.

您可以通过以下方式从该jar中提取默认的性别映射文件:

You can extract the default gender mappings file from that jar in this manner:

  • mkdir tmp-stanford-models-expanded

  • mkdir tmp-stanford-models-expanded

cp/stanford-corenlp-3.5.2-models.jar tmp-stanford-models-expanded

cp /path/of/stanford-corenlp-3.5.2-models.jar tmp-stanford-models-expanded

cd tmp-stanford-models-expanded

cd tmp-stanford-models-expanded

jar xf stanford-corenlp-3.5.2-models.jar

jar xf stanford-corenlp-3.5.2-models.jar

现在应该是tmp-stanford-models-expanded/edu

there should now be tmp-stanford-models-expanded/edu

您想要的文件是tmp-stanford-models-expanded/edu/stanford/nlp/models/gender/first_name_map_small

the file you want is tmp-stanford-models-expanded/edu/stanford/nlp/models/gender/first_name_map_small

3.)以这种方式构建管道以使用自定义性别词典:

3.) Build your pipeline in this manner to use your custom gender dictionary:

Properties props = new Properties();
props.setProperty("annotators",
    "tokenize, ssplit, pos, lemma, gender, ner");
props.setProperty("gender.firstnames","/path/to/your/gender_dictionary.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

4.)尝试在管道中先运行性别(请参阅上文对注释器的订购).如果令牌已经具有NER标签,则RegexNERSequenceClassifier(添加Gender标签的类)可能会被阻止.在我看来,首先运行性别注释器将解决此问题.因此,在构建渠道时,请确保性别优先于ner.

4.) Try running gender BEFORE ner in your pipeline (see my ordering of the annotators above). It is possible for the RegexNERSequenceClassifier (which is the class that adds the Gender tags) to get blocked if tokens already have NER tags. It looks to me like running the gender annotator first will fix the problem. So when you build the pipeline, make sure gender comes before ner.

NER标记器将序列爱德华·詹姆斯·卡拉·爱德华"标记为"O O PERSON PERSON".我不完全确定为什么前两个标记的NER标签会获得"O".我会注意到,爱德华·詹姆斯·卡拉·爱德华"产生的"PERSON PERSON PERSON PERSON",并牢记NER标记因子在句子中的位置,因此,句子开头的小写字母可能会导致第一个标记"edward".标记为"O"?

The sequence "edward james karla edward" is tagged "O O PERSON PERSON" by the NER tagger. I am not entirely sure why those first two tokens get "O" for their NER tags. I would note that "Edward James Karla Edward" yields "PERSON PERSON PERSON PERSON", and keep in mind the NER tagger factors in position in the sentence, so perhaps being lower cased at the beginning of the sentence is causing the first token "edward" to be marked as "O"?

如果对此有任何疑问,请告诉我,我们将很乐意为您提供更多帮助!

If you have any issues with this, please let me know and I will be happy to help more!

TL; DR

1.)卡拉(Karla)被标记为错误,因为该名称不在性别字典中

1.) Karla is marked wrong because that name is not in the gender dictionary

2.)您可以使用NAME \ tGENDER来创建自己的性别映射文件,并确保将属性"gender.firstnames"设置为新性别映射文件的路径.

2.) You can make your own gender mappings file with NAME\tGENDER , make sure the property "gender.firstnames" is set to path of your new gender mapping file.

3.)确保将性别注释者放在ner注释者之前,这应该可以解决问题!

3.) Make sure the gender annotator goes before the ner annotator, this should fix the problem!

这篇关于为什么斯坦福大学核心性别识别没有确定性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆