使用命名实体训练模型 [英] Train model using Named entity
问题描述
我正在使用命名实体识别器来查看standford corenlp.我有不同种类的输入文本,需要将其标记到自己的实体中.因此,我开始训练自己的模型,但似乎无法正常工作.
I am looking on standford corenlp using the Named Entity REcognizer.I have different kinds of input text and i need to tag it into my own Entity.So i started training my own model and it doesnt seems to be working.
例如:我的输入文本字符串是有关丰田陆地巡洋舰1956-1987年黄金投资组合的49条杂志文章的书, http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q "
For eg: my input text string is "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q"
我将通过示例来训练自己的模型,并仅查找我感兴趣的一些单词.
I go through the examples to train my own models and and look for only some words that I am interested in.
我的jane-austen-emma-ch1.tsv看起来像这样
My jane-austen-emma-ch1.tsv looks like this
Toyota PERS
Land Cruiser PERS
在上面的输入文本中,我仅对这两个单词感兴趣.一个是 丰田(Toyota)和另一个词是Land Cruiser.
From the above input text i am only interested in those two words. The one is Toyota and the other word is Land Cruiser.
austin.prop看起来像这样
The austin.prop look like this
trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
运行以下命令以生成ner-model.ser.gz文件
Run the following command to generate the ner-model.ser.gz file
java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
public static void main(String[] args) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
try {
NERClassifierCombiner classifier = new NERClassifierCombiner(false, false,
serializedClassifier2,serializedClassifier);
String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
System.out.println("---");
List<List<CoreLabel>> out = classifier.classify(ss);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
}
System.out.println();
}
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
这是我得到的输出
Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS
我认为这是错误的.我正在寻找Toyota/PERS和Land Cruiser/PERS(这是一个多价值的领域.
which i think its wrong.I am looking for Toyota/PERS and Land Cruiser/PERS(Which is a multi valued fied.
感谢您的帮助.非常感谢您提供帮助.
Thanks for the Help.Any help is really appreciated.
推荐答案
NERClassifier *是单词级别的,也就是说,它标记单词而不是短语.鉴于此,分类器的表现似乎不错.如果需要,可以将构成短语的单词连字符.因此,在带有标签的示例和测试示例中,您需要将"Land Cruiser"改为"Land_Cruiser".
The NERClassifier* is word level, that is, it labels words, not phrases. Given that, the classifier seems to be performing fine. If you want, you can hyphenate words that form phrases. So in your labeled examples and in your test examples, you would make "Land Cruiser" to "Land_Cruiser".
这篇关于使用命名实体训练模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!