我如何准备使用整个维基百科进行自然语言处理? [英] How do I prepare to use entire wikipedia for natural language processing?

查看:67
本文介绍了我如何准备使用整个维基百科进行自然语言处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有点新.我有一个项目,我必须在其中下载和使用 Wikipedia 进行 NLP.我面临的问题如下:我只有 12 GB 的 RAM,但英文维基转储压缩了 15 GB 以上.这会限制我处理维基吗?我不需要来自维基的任何图片.我需要在处理前解压缩转储吗?有人可以告诉我所需的步骤或指向我的相关内容吗?提前致谢.

解决方案

处理维基百科转储最简单的方法是依赖 kiwix.org 转储,您可以在以下位置找到:https://wiki.kiwix.org/wiki/Content_in_all_languages

然后使用python您可以执行以下操作

% wget http://download.kiwix.org/zim/wiktionary_eo_all_nopic.zim...% pip install --user libzim% 蟒蛇在 [2]:从 libzim.reader 导入文件在 [3] 中:总计 = 0...:...:以 File("wiktionary_eo_all_nopic.zim") 作为读者:...:对于范围内的 uid(0, reader.article_count):...: page = reader.get_article_by_id(uid)...:总计 += len(page.content)...:打印(总计)

这是一个简单的处理,你应该明白开始.特别是,截至 2020 年,使用 wikimarkup 的原始维基百科转储非常难以处理,因为您无法将 wikimarkup 转换为 html,包括没有完整的 wikimedia 设置的信息框.还有 REST API 但为什么在工作已经完成的情况下挣扎:)

关于处理后数据的存储位置,我认为行业标准是 PostgreSQL 或 ElasticSearch(也需要大量内存)但我真的很喜欢 hoply,更普遍的是 OKVS.

I am a bit new here. I have a project where I have to download and use Wikipedia for NLP. The questions I am facing are as follows: I have RAM of only 12 GB, but the English wiki dump is over 15 GB compressed. Does this limit my processing of wiki? I do not need any picture from the wiki. Do I need to uncompress the dump before processing? Can someone just tell me the steps required or point to me related content for it? Thanks in advance.

解决方案

The easiest to process wikipedia dump is to rely on kiwix.org dump that you can find at: https://wiki.kiwix.org/wiki/Content_in_all_languages

Then using python you can do the following

% wget http://download.kiwix.org/zim/wiktionary_eo_all_nopic.zim
...
% pip install --user libzim
% ipython
In [2]: from libzim.reader import File

In [3]: total = 0
   ...:
   ...: with File("wiktionary_eo_all_nopic.zim") as reader:
   ...:     for uid in range(0, reader.article_count):
   ...:         page = reader.get_article_by_id(uid)
   ...:         total += len(page.content)
   ...: print(total)

This is an simplistic processing, you should get the point to get started. In particular, as of 2020, the raw wikipedia dump using wikimarkup are very difficult to process in the sense you can not convert wikimarkup to html including infoboxes without a full wikimedia setup. There is also the REST API but why struggle when the work is already done :)

Regarding where to store the data AFTER processing, I think the industry standard is PostgreSQL or ElasticSearch (which also requires lots of memory) but I really like hoply, and more generally OKVS.

这篇关于我如何准备使用整个维基百科进行自然语言处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆