如何基于斯坦福-nlp条件随机场模型训练法国NER? [英] how to train a french NER based on stanford-nlp Conditional Random Fields model?

查看:176
本文介绍了如何基于斯坦福-nlp条件随机场模型训练法国NER?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了Stanford-NLP的工具,并发现它非常有趣. 我是法国的数据挖掘人员/数据科学家,喜欢文本分析,并且愿意使用您的工具,但是NER的法文版本对我来说还是很困惑的.

I discovered the tools of stanford-NLP and found it really interesting. I'm a french dataminer / datascientist, fond of text analysis and would love to use your tools, but the NER being unavailable in french is quite puzzling to me.

我很想制作自己的法语NER,如果认为值得的话,甚至可以提供它作为对软件包的贡献,所以...您能向我介绍根据以下内容为法语NER训练CRF的要求吗?斯坦福大学coreNLP吗?

I would love to make my own french NER, perhaps even provide it as a contribution to the package if it is considered worthy, so... could you brief me on the requirements to train a CRF for french NER based on the stanford coreNLP ?

谢谢.

推荐答案

NB:我不是Stanford工具的开发人员,也不是NLP专家.只是lambda用户在某个时候也需要此类信息.另请注意,以下给出的部分信息来自官方FAQ: http://nlp.stanford.edu/software/crf-faq.shtml#a

NB: I am not a developper of the Stanford tools, nor a NLP expert. Just a lambda user that also needed such informations at some point. Also note that part of the information given below are from the official FAQ: http://nlp.stanford.edu/software/crf-faq.shtml#a

以下是我训练自己的NER的步骤:

Here are the steps I followed to train my own NER:

  1. 安装java8
  2. 创建训练/测试样本.它必须采用.tsv文件的形式,格式如下:

  1. Install java8
  2. Create a train/test sample. It must take the form of .tsv files with the following format:

  Venez    O
  découvrir    O
  lundi    DAY
  le    O
  nouvel    O
  espace    O
  de    O
  vente    O
  ODHOJS    ORGANISATION

根据文本的原始格式,可以使用SQL语句或其他NLP工具创建此示例.标签是最复杂的部分,因为我不知道要手工完成其他方法.

Depending on the original format of your text, you can create this sample with SQL statement or other NLP tools. The labelling is the most complicated part as I don't know other ways to proceed than to do it by hand.

使用以下命令训练模型:

Train the model with this command:

java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop prop.txt

此处中也描述了prop.txt.

这应该创建一个包含新训练的模型的新.jar.

This should create a new .jar containing the newly trained model.

测试模型性能:

java -cp "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile test.tsv > test.res

输入test.tsvtrain.tsv文件具有相同的格式. test.res中的输出有一个额外的列,其中包含NER预测的类.最后几行还显示了有关精度,召回率和F1的摘要.

The input test.tsv has the same format than the train.tsv file. The output in test.res has an extra column containing the NER predicted class. The last lines also show the summary in terms of precision, recall and F1.

最后,您可以对实际数据使用NER:

Finally, you can use your NER on real data:

java -cp "stanford-ner.jar:lib/*" -mx5g edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz  -textFile test.txt -outputFormat inlineXML > test.res

希望有帮助.

这篇关于如何基于斯坦福-nlp条件随机场模型训练法国NER?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆