gensim库中的WikiCorpus是否可以在阿拉伯语Wikipedia转储上使用? [英] Does WikiCorpus from gensim library works on Arabic Wikipedia dump?

查看:58
本文介绍了gensim库中的WikiCorpus是否可以在阿拉伯语Wikipedia转储上使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到一个在阿拉伯语Wikipedia转储上使用Wikicorpus的代码,并且我知道该过程将花费很长时间才能执行,我还四处搜索执行该命令时收到的警告,内容为:

I see a code which uses Wikicorpus on an Arabic Wikipedia dump, and I know that the process will take a long time to execute, I also searched around about the warning that I get when executing it which says:

(UserWarning:检测到Windows;将别名分块化为chunkize_serial
warnings.warn(检测到Windows;别名分块为chunkize_serial))

(UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial"))

,回答说没关系,没什么大不了的,只是警告.但是等了3天左右没有任何反应!我开始怀疑它是否真的对阿拉伯转储文件有效,还是在将阿拉伯转储文件传递给Wikicorpus对象之前必须进行某种预处理?数据大小约为989.6 MB.然后,我用两个打印命令将WikiCorpus代码行包围起来,以了解它何时开始以及何时完成执行,像这样:

and answers said that it's ok, nothing serious, it's just a warning. But after waiting about 3 days without any response! I start wondering whether is it truly work on the Arabic dump file, or I have to do certain kind of pre-processing before passing the Arabic dump file to the Wikicorpus object? the data size is about 989.6 MB. and I surround the WikiCorpus code line with two print commands, to know when it started and when it finished executing, like this:

print('start WikiCorpus')
wiki = WikiCorpus(self.in_f)
print('finish WikiCorpus')

其中self.in_f是这样的阿拉伯语Wikipedia转储:(/文件所在的路径/arwiki-20200201-pages-articles.xml.bz2),但在运行时从未到达第二个打印命令./p>

where the self.in_f is the Arabic Wikipedia dump like this: (/the path where the file located/arwiki-20200201-pages-articles.xml.bz2), but never reached the second print command during the runtime.

推荐答案

它应该可以工作,尤其是在阿拉伯语具有清晰的单词分隔符(例如单词之间的空格)的情况下.

It should work, especially if Arabic has clear word-delimiters (like spaces between words).

但是,鉴于 gensim &大多数相关的Python数据科学库在其他地方都可以进行更多的开发/测试/使用.在Windows中,多处理功能有些奇怪.如果您可以选择在其他操作系统上工作,那将使事情变得更容易.

However, lots of things are harder on Windows, given that gensim & most related Python data-science libraries get more development/testing/use elsewhere, & there are some Windows-specific oddities with multiprocessing. If you have the option of working on another OS, that can make things easier.

最近有另一个问题描述了 en dump& WikiCorpus –有一些想法可以在我的回答中进行检查,尽管目前尚不清楚问问者曾经解决过这个问题.

There was another recent question describing a similar problem with an en dump & WikiCorpus – there are ideas of things to check in my answer there, though it's unclear if the asker ever resolved the problem.

此外,在Windows中使用依赖于Python multiprocessing 的代码时,可能特别有必要在主"块中关闭代码,如果您的文件不会被重新运行由其他进程重新导入,并调用Windows特定的 freeze_support()函数.请参阅有关gensim项目列表上相关问题的一些最新讨论.

Also, when using code that relies on Python multiprocessing in Windows, it may be especially necessary to set your code off in a 'main' block that's won't be re-run if your file is re-imported by other processes, and call a Windows-specific freeze_support() function. See some recent discussion of a related matter on the gensim project list.

这篇关于gensim库中的WikiCorpus是否可以在阿拉伯语Wikipedia转储上使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆