如何通过使用gensim wikicorpus获得带标点的Wikipedia语料库文本? [英] How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

查看:645
本文介绍了如何通过使用gensim wikicorpus获得带标点的Wikipedia语料库文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取带有标点符号的文本,因为在我的doc2vec模型中考虑后者很重要.但是,Wikicorpus仅检索文本.在网上搜索后,我找到了以下页面:

I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages:

  1. 来自gensim github问题部分的页面.这是某人提出的问题,答案是将WikiCorpus子类化(由Piskvorky回答).幸运的是,在同一页面中,有一个代码代表建议的子类"解决方案.该代码由Rhazegh提供. (链接)
  2. stackoverflow中的页面,标题为:在解析Wiki语料库时,禁止Gensim删除标点符号等".但是,没有提供明确的答案,并且已在spaCy的情况下进行了处理. (链接)
  1. Page from gensim github issues section. It was a question by someone where the answer was to subclass WikiCorpus (answered by Piskvorky). Luckily, in the same page, there was a code representing the suggested 'subclass' solution. The code was provided by Rhazegh. (link)
  2. Page from stackoverflow with a title: "Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus". However, no clear answer was provided and was treated in the context of spaCy. (link)

我决定使用第1页中提供的代码.我当前的代码(mywikicorpus.py):

I decided to use the code provided in page 1. My current code (mywikicorpus.py):

import sys
import os
sys.path.append('C:\\Users\\Ghaliamus\\Anaconda2\\envs\\wiki\\Lib\\site-packages\\gensim\\corpora\\')

from wikicorpus import *

def tokenize(content):
    # override original method in wikicorpus.py
    return [token.encode('utf8') for token in utils.tokenize(content, lower=True, errors='ignore')
        if len(token) <= 15 and not token.startswith('_')]

def process_article(args):
   # override original method in wikicorpus.py
    text, lemmatize, title, pageid = args
    text = filter_wiki(text)
    if lemmatize:
        result = utils.lemmatize(text)
    else:
        result = tokenize(text)
    return result, title, pageid


class MyWikiCorpus(WikiCorpus):
def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):
    WikiCorpus.__init__(self, fname, processes, lemmatize, dictionary, filter_namespaces)

    def get_texts(self):
        articles, articles_all = 0, 0
        positions, positions_all = 0, 0
        texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
        pool = multiprocessing.Pool(self.processes)
        for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
            for tokens, title, pageid in pool.imap(process_article, group):  # chunksize=10):
                articles_all += 1
                positions_all += len(tokens)
            if len(tokens) < ARTICLE_MIN_WORDS or any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
                continue
            articles += 1
            positions += len(tokens)
            if self.metadata:
                yield (tokens, (pageid, title))
            else:
                yield tokens
    pool.terminate()

    logger.info(
        "finished iterating over Wikipedia corpus of %i documents with %i positions"
        " (total %i articles, %i positions before pruning articles shorter than %i words)",
        articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
    self.length = articles  # cache corpus length

然后,我使用了Pan Yang(链接).此代码启动WikiCorpus对象并检索文本.我当前代码中的唯一更改是启动MyWikiCorpus而不是WikiCorpus.代码(process_wiki.py):

And then, I used another code by Pan Yang (link). This code initiates WikiCorpus object and retrieve the text. The only change in my current code is initiating MyWikiCorpus instead of WikiCorpus. The code (process_wiki.py):

from __future__ import print_function
import logging
import os.path
import six
import sys
import mywikicorpus as myModule



if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) != 3:
        print("Using: python process_wiki.py enwiki-20180601-pages-    articles.xml.bz2 wiki.en.text")
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = myModule.MyWikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
        else:
            output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

通过命令行运行了process_wiki.py代码.我在命令提示符下的最后一行得到了主体的文本:

Through command line I ran the process_wiki.py code. I got text of the corpus with the last line in the command prompt:

(2018-06-05 09:18:16,480:信息:已完成保存4526191文章)

(2018-06-05 09:18:16,480: INFO: Finished Saved 4526191 articles)

当我在python中读取文件时,我检查了第一篇文章,并且没有标点符号.示例:

When I read the file in python, I checked the first article and it was without punctuation. Example:

(无政府主义是一种政治哲学,提倡基于自愿机构的自治社会,尽管一些作者更具体地将其定义为基于无等级或自由协会的机构,但无政府主义认为国家是不可取的不必要的东西.在反对国家为中央无政府主义的同时又有害无益,具体来说就是反对权力或等级制度.

(anarchism is a political philosophy that advocates self governed societies based on voluntary institutions these are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical or free associations anarchism holds the state to be undesirable unnecessary and harmful while opposition to the state is central anarchism specifically entails opposing authority or hierarchical)

我的两个相关问题,希望您能为我提供帮助:

My two relevant questions, and I wish you can help me with them, please:

  1. 我上面报告的管道中有什么问题吗?
  2. 无论采用哪种管道,如果我打开gensim wikicorpus python代码(
  1. is there any thing wrong in my reported pipeline above?
  2. regardless such pipeline, if I opened the gensim wikicorpus python code (wikicorpus.py) and wanted to edit it, what is the line that I should add it or remove it or update it (with what if possible) to get the same results but with punctuation?

非常感谢您抽出宝贵的时间阅读这篇长文章.

Many thanks for your time reading this long post.

最良好的祝愿

Ghaliamus

Ghaliamus

推荐答案

问题出在您定义的标记化功能上

The problem lies on your defined tokenize func

def tokenize(content):
    return [token.encode('utf8') for token in utils.tokenize(content, 
            lower=True, errors='ignore') if len(token) <= 15 and not 
            token.startswith('_')]

func utils.tokenize(content,lower = True,errors ='ignore')只是将文章标记为标记列表.但是,.../site-packages/gensim/utils.py中此func的实现会忽略标点符号.

The func utils.tokenize(content, lower=True, errors='ignore') simply tokenize the article into list of tokens. However, the implement of this func in .../site-packages/gensim/utils.py ignore the punctuation.

例如,当您调用utils.tokenize(我喜欢吃香蕉,苹果")时,它返回["I","love",饮食",香蕉",苹果"]

For example, when you call utils.tokenize("I love eating banana, apple") it return ["I", "love","eating","banana","apple"]

无论如何,您可以按照以下定义自己的标记化功能,以保留标点符号.

Anyway, you can define your own tokenize func as follow to retain punctuations.

def tokenize(content):
    #override original method in wikicorpus.py
    return [token.encode('utf8') for token in content.split() 
           if len(token) <= 15 and not token.startswith('_')]

这篇关于如何通过使用gensim wikicorpus获得带标点的Wikipedia语料库文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆