我的宪报上的实体无法识别 [英] Entities on my gazette are not recognized

查看:62
本文介绍了我的宪报上的实体无法识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个自定义的NER模型.那就是我所做的:

I would like to create a custom NER model. That's what i did:

培训数据(stanford-ner.tsv):

TRAINING DATA (stanford-ner.tsv):

Hello    O
!    O
My    O
name    O
is    O
Damiano    PERSON
.    O

产品(stanford-ner.prop):

PROPERTIES (stanford-ner.prop):

trainFile = stanford-ner.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
maxLeft=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
gazette=gazzetta.txt
cleanGazette=true

GAZZETTE gazzetta.txt):

GAZZETTE gazzetta.txt):

PERSON John
PERSON Andrea

我通过命令行使用以下命令构建模型:

I build the model via command line with:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -prop stanford-ner.prop

并测试:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -loadClassifier ner-model.ser.gz -textFile test.txt

我用以下文字做了两项测试:

I did two tests with the following texts:

>>>测试1<<<

  • TEXT: 你好!我叫达米亚诺(Damiiano),这是一个要测试的假文字.

  • TEXT: Hello! My name is Damiano and this is a fake text to test.

输出 你好/O!/O 我/O名称/O是/O Damiano/PERSON和/O这个/O是/O a/O假冒/O文本/O到/O测试/O ./O

OUTPUT Hello/O !/O My/O name/O is/O Damiano/PERSON and/O this/O is/O a/O fake/O text/O to/O test/O ./O

>>>测试2<<<

  • TEXT: 你好!我叫约翰,这是一个要测试的假文字.

  • TEXT: Hello! My name is John and this is a fake text to test.

输出 你好/O!/O 我/O名称/O是/O约翰/O和/O这个/O是/O a/O假冒/O文本/O到/O测试/O ./O

OUTPUT Hello/O !/O My/O name/O is/O John/O and/O this/O is/O a/O fake/O text/O to/O test/O ./O

如您所见,仅找到"Damiano"实体.这个实体存在于我的训练数据中,但"John"(第二项测试)在公报中.问题是这样.

As you can see only "Damiano" entity is found. This entity is in my training data but "John" (second test) is inside the gazzette. So the question is.

为什么无法识别John实体?

Why does John entity is not recognized ?

非常感谢您.

推荐答案

斯坦福常见问题解答说,

如果使用了宪报,则不能保证 公报始终被用作预定类的成员,并且它确实 不保证不会选择宪报以外的词.它 只是为CRF提供了另一个训练功能.如果 CRF对于其他功能具有较高的权重,因此宪报功能可能会 不堪重负.

If a gazette is used, this does not guarantee that words in the gazette are always used as a member of the intended class, and it does not guarantee that words outside the gazette will not be chosen. It simply provides another feature for the CRF to train against. If the CRF has higher weights for other features, the gazette features may be overwhelmed.

如果您想要将文本识别为班级成员的内容 当且仅当它在单词列表中时,您可能更喜欢 regexner或Stanford CoreNLP中包含的tokensregex工具.这 不能保证CRF NER接受宪报中的所有字词 预期的类别,并且它也可以接受非预期类别的单词 宪章》作为课程的一部分.

If you want something that will recognize text as a member of a class if and only if it is in a list of words, you might prefer either the regexner or the tokensregex tools included in Stanford CoreNLP. The CRF NER is not guaranteed to accept all words in the gazette as part of the expected class, and it may also accept words outside the gazette as part of the class.

顺便说一句,以单元测试"的方式(即仅使用一个或两个示例)来测试机器学习管道不是一个好习惯,因为它应该处理更大的数据量,更重要的是, ,它本质上是概率性的.

Btw, it is not a good practice to test machine learning pipelines in a 'unit-test'-way, i.e. with only one or two examples, because it is supposed to work on much greater volume of data and, more importantly, it is probabilistic by nature.

如果要检查是否实际使用了宪报文件,最好采用现有示例(有关austen.gaz.propausten.gaz.txt示例,请参见上面链接的页面底部),并用自己的名称替换多个名称的,然后检查.如果失败,请先尝试更改测试,例如添加更多名称,重新格式化文本等等.

If you want to check if your gazette file is actually used, it may be better to take existent examples (see the bottom of the page linked above for austen.gaz.prop and austen.gaz.txt examples) and replace multiple names by your own ones, then check. If it fails, firstly try to change your test, e.g. add more names, reformulate text and so on.

这篇关于我的宪报上的实体无法识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆