解析维基语料库时禁用Gensim删除标点符号等 [英] Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

查看:130
本文介绍了解析维基语料库时禁用Gensim删除标点符号等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在gensim上使用python训练word2vec模型在英语维基百科上.我密切关注了 https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw .

I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that.

它对我有用,但是我对生成的word2vec模型不满意的是命名实体被拆分,这使得该模型无法用于我的特定应用程序.我需要的模型必须将命名实体表示为单个向量.

It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector.

这就是为什么我计划使用spacy解析Wikipedia文章并将北卡罗莱纳州"之类的实体合并为"north_carolina",以便word2vec将它们表示为单个向量的原因.到目前为止一切顺利.

Thats why I planned to parse the wikipedia articles with spacy and merge entities like "north carolina" into "north_carolina", so that word2vec would represent them as a single vector. So far so good.

spacy解析必须是预处理的一部分,我最初是按照链接讨论中的建议使用的:

The spacy parsing has to be part of the preprocessing, which I originally did as recommended in the linked discussion using:

...
wiki = WikiCorpus(wiki_bz2_file, dictionary={})
for text in wiki.get_texts():
    article = " ".join(text) + "\n"
    output.write(article)
...

这将删除标点符号,停用词,数字和大写字母,并将每篇文章保存在结果输出文件中的单独一行中.问题在于,spacy的NER在此预处理文本上并没有真正起作用,因为我猜它依赖于NER(?)的标点和大写.

This removes punctuation, stop words, numbers and capitalization and saves each article in a separate line in the resulting output file. The problem is that spacy's NER doesn't really work on this preprocessed text, since I guess it relies on punctuation and capitalization for NER (?).

有人知道我是否可以禁用" gensim的预处理,以便它不会删除标点符号等,但仍将Wikipedia文章直接从压缩的Wikipedia转储中解析为文本吗?还是有人知道更好的方法来做到这一点?预先感谢!

推荐答案

您可以在spaCy中使用gensim word2vec预训练模型,但是这里的问题是您处理管道的顺序:

You can use a gensim word2vec pretrained model in spaCy, but the problem here is your processing pipeline's order:

  1. 您将文本传递给gensim
  2. Gensim解析并标记字符串
  3. 您将令牌标准化
  4. 您将令牌传递回spaCy
  5. 您创建了一个w2v语料库(带有spaCy)(?)

这意味着当spaCy获得文档时,文档已经被标记,是的,它的NER很...复杂:

That means the docs are already tokenized when spaCy gets them, and yes, its NER is... complex: https://www.youtube.com/watch?v=sqDHBH9IjRU

您可能想做的是:

  1. 您将文本传递给spaCy
  2. spaCy使用 NER
  3. 解析它们
  4. spaCy相应地对它们进行标记化,将实体保留为一个标记
  5. 您使用spacy.load()加载gensim w2v模型
  6. 您使用加载的模型在spaCy中创建w2v语料库
  1. You pass the texts to spaCy
  2. spaCy parses them with NER
  3. spaCy tokenizes them accordingly, keeping entities as one token
  4. you load the gensim w2v model with spacy.load()
  5. you use the loaded model to create the w2v corpus in spaCy

您需要做的就是从gensim下载模型,并告诉spaCy从命令行查找它:

All you need to do is download the model from gensim and tell spaCy to look for it from the command line:

  1. wget [模型网址]
  2. python -m初始化模型[选项] [您刚刚下载的文件]

以下是init-model的命令行文档: https://spacy.io/api/cli#init-model

Here is the command line documentation for init-model: https://spacy.io/api/cli#init-model

然后像en_core_web_md一样加载它,例如您可以使用.txt,.zip或.tgz模型.

then load it just like en_core_web_md, e.g. You can use .txt, .zip or .tgz models.

这篇关于解析维基语料库时禁用Gensim删除标点符号等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆