解析维基语料库时禁用Gensim删除标点符号等 [英] Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

查看：130 发布时间：2020/5/18 1:00:43 python nlp gensim word2vec spacy

本文介绍了解析维基语料库时禁用Gensim删除标点符号等的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在gensim上使用python训练word2vec模型在英语维基百科上.我密切关注了 https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw .

I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that.

它对我有用，但是我对生成的word2vec模型不满意的是命名实体被拆分，这使得该模型无法用于我的特定应用程序.我需要的模型必须将命名实体表示为单个向量.

It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector.

这就是为什么我计划使用spacy解析Wikipedia文章并将北卡罗莱纳州"之类的实体合并为"north_carolina"，以便word2vec将它们表示为单个向量的原因.到目前为止一切顺利.

Thats why I planned to parse the wikipedia articles with spacy and merge entities like "north carolina" into "north_carolina", so that word2vec would represent them as a single vector. So far so good.

spacy解析必须是预处理的一部分，我最初是按照链接讨论中的建议使用的:

The spacy parsing has to be part of the preprocessing, which I originally did as recommended in the linked discussion using:

...
wiki = WikiCorpus(wiki_bz2_file, dictionary={})
for text in wiki.get_texts():
    article = " ".join(text) + "\n"
    output.write(article)
...

这将删除标点符号，停用词，数字和大写字母，并将每篇文章保存在结果输出文件中的单独一行中.问题在于，spacy的NER在此预处理文本上并没有真正起作用，因为我猜它依赖于NER(?)的标点和大写.

This removes punctuation, stop words, numbers and capitalization and saves each article in a separate line in the resulting output file. The problem is that spacy's NER doesn't really work on this preprocessed text, since I guess it relies on punctuation and capitalization for NER (?).

有人知道我是否可以禁用" gensim的预处理，以便它不会删除标点符号等，但仍将Wikipedia文章直接从压缩的Wikipedia转储中解析为文本吗?还是有人知道更好的方法来做到这一点?预先感谢！

解析维基语料库时禁用Gensim删除标点符号等 [英] Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

解析维基语料库时禁用Gensim删除标点符号等 [英] Disabling Gensim&#39;s removal of punctuation etc. when parsing a wiki corpus

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

解析维基语料库时禁用Gensim删除标点符号等 [英] Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

登录关闭