如何基于斯坦福-nlp条件随机场模型训练法国NER? [英] how to train a french NER based on stanford-nlp Conditional Random Fields model?
问题描述
我发现了Stanford-NLP的工具,并发现它非常有趣. 我是法国的数据挖掘人员/数据科学家,喜欢文本分析,并且愿意使用您的工具,但是NER的法文版本对我来说还是很困惑的.
I discovered the tools of stanford-NLP and found it really interesting. I'm a french dataminer / datascientist, fond of text analysis and would love to use your tools, but the NER being unavailable in french is quite puzzling to me.
我很想制作自己的法语NER,如果认为值得的话,甚至可以提供它作为对软件包的贡献,所以...您能向我介绍根据以下内容为法语NER训练CRF的要求吗?斯坦福大学coreNLP吗?
I would love to make my own french NER, perhaps even provide it as a contribution to the package if it is considered worthy, so... could you brief me on the requirements to train a CRF for french NER based on the stanford coreNLP ?
谢谢.
推荐答案
NB:我不是Stanford工具的开发人员,也不是NLP专家.只是lambda用户在某个时候也需要此类信息.另请注意,以下给出的部分信息来自官方FAQ: http://nlp.stanford.edu/software/crf-faq.shtml#a
NB: I am not a developper of the Stanford tools, nor a NLP expert. Just a lambda user that also needed such informations at some point. Also note that part of the information given below are from the official FAQ: http://nlp.stanford.edu/software/crf-faq.shtml#a
以下是我训练自己的NER的步骤:
Here are the steps I followed to train my own NER:
- 安装java8
-
创建训练/测试样本.它必须采用
.tsv
文件的形式,格式如下:
- Install java8
Create a train/test sample. It must take the form of
.tsv
files with the following format:
Venez O
découvrir O
lundi DAY
le O
nouvel O
espace O
de O
vente O
ODHOJS ORGANISATION
根据文本的原始格式,可以使用SQL语句或其他NLP工具创建此示例.标签是最复杂的部分,因为我不知道要手工完成其他方法.
Depending on the original format of your text, you can create this sample with SQL statement or other NLP tools. The labelling is the most complicated part as I don't know other ways to proceed than to do it by hand.
使用以下命令训练模型:
Train the model with this command:
java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop prop.txt
在此处中也描述了prop.txt
.
这应该创建一个包含新训练的模型的新.jar
.
This should create a new .jar
containing the newly trained model.
测试模型性能:
java -cp "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile test.tsv > test.res
输入test.tsv
与train.tsv
文件具有相同的格式. test.res
中的输出有一个额外的列,其中包含NER预测的类.最后几行还显示了有关精度,召回率和F1的摘要.
The input test.tsv
has the same format than the train.tsv
file. The output in test.res
has an extra column containing the NER predicted class. The last lines also show the summary in terms of precision, recall and F1.
最后,您可以对实际数据使用NER:
Finally, you can use your NER on real data:
java -cp "stanford-ner.jar:lib/*" -mx5g edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt -outputFormat inlineXML > test.res
希望有帮助.
这篇关于如何基于斯坦福-nlp条件随机场模型训练法国NER?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!