SpaCy模型训练数据:WikiNER [英] SpaCy model training data: WikiNER

查看:414
本文介绍了SpaCy模型训练数据:WikiNER的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于2.0版本的SpaCy的模型xx_ent_wiki_sm,提到了"WikiNER"数据集,这导致了文章从Wikipedia学习多语言命名实体识别".

For the model xx_ent_wiki_sm of 2.0 version of SpaCy there is mention of "WikiNER" dataset, which leads to article 'Learning multilingual named entity recognition from Wikipedia'.

是否有用于下载此类数据集的资源以重新训练该模型?还是用于Wikipedia转储处理的脚本?

Is there any resource for downloading of such dataset for retraining that model? Or script for Wikipedia dump processing?

推荐答案

来自Joel(和我)的前研究人员小组的数据服务器似乎处于脱机状态:

The data server from Joel (and my) former researcher group seems to be offline: http://downloads.schwa.org/wikiner

我在这里找到了wp3文件的镜像,这是我在spaCy中使用的文件:

I found a mirror of the wp3 files here, which are the ones I'm using in spaCy: https://github.com/dice-group/FOX/tree/master/input/Wikiner

要重新训练spaCy模型,您需要创建一个train/dev拆分(我将在线获取我的以进行直接比较,但现在...只是随机剪切),并使用.iob扩展名.然后使用:

To retrain the spaCy model, you'll need to create a train/dev split (I'll get mine online for direct comparison, but for now...just take a random cut), and name the files with the .iob extension. Then use:

spacy convert -n 10 /path/to/file.iob /output/directory

-n 10参数对于在spaCy中使用非常重要:它将句子连接到每个包含10个句子的伪段落"中.这样,模型就可以知道文档可以包含多个句子.

The -n 10 argument is important for use in spaCy: it concatenates sentences into 'pseudo-paragraphs' of 10 sentences each. This lets the model learn that documents can come with multiple sentences.

这篇关于SpaCy模型训练数据:WikiNER的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆