如何在 Python 中拆分大型维基百科转储 .xml.bz2 文件? [英] How to split large wikipedia dump .xml.bz2 files in Python?

查看:41
本文介绍了如何在 Python 中拆分大型维基百科转储 .xml.bz2 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 使用维基媒体转储文件 (.xml.bz2) 构建离线维基词典.我从这篇文章作为指南开始.它涉及多种语言,我想将所有步骤组合为一个单独的 Python 项目.我已经找到了该过程所需的几乎所有库.现在唯一的问题是有效地将大 .xml.bz2 文件拆分为多个较小的文件,以便在搜索操作期间更快地进行解析.

I am trying to build a offline wiktionary using the wikimedia dump files (.xml.bz2) using Python. I started with this article as the guide. It involves a number of languages, I wanted to combine all the steps as a single python project. I have found almost all the libraries required for the process. The only hump now is to effectively split the large .xml.bz2 file into number of smaller files for quicker parsing during search operations.

我知道python中存在bz2库,但它只提供压缩和解压操作.但是我需要一些可以像 bz2recover 从命令行那样做的事情,它将大文件分成许多较小的垃圾.

I know that bz2 library exists in python, but it provides only compress and decompress operations. But I need something that could do something like bz2recover does from the command line, which splits large files into a number of smaller junks.

更重要的一点是拆分不应拆分以开头并以结尾的xml文档中的页面内容已压缩.

One more important point is the splitting shouldn't split the page contents which start with <page> and ends </page> in the xml document that has been compressed.

以前是否有可用的库来处理这种情况,或者代码必须从头开始编写?(任何大纲/伪代码都会非常有帮助).

Is there a library previously available which could handle this situation or the code has to be written from scratch?(Any outline/pseudo-code would be greatly helpful).

注意:我想让生成的包跨平台兼容,因此无法使用特定于操作系统的命令.

Note: I would like to make the resulting package cross-platform compatible, hence couldn't use OS specific commands.

推荐答案

我终于自己写了一个 Python 脚本:

At last I have written a Python Script myself:

import os
import bz2

def split_xml(filename):
    ''' The function gets the filename of wiktionary.xml.bz2 file as  input and creates
    smallers chunks of it in a the diretory chunks
    '''
    # Check and create chunk diretory
    if not os.path.exists("chunks"):
        os.mkdir("chunks")
    # Counters
    pagecount = 0
    filecount = 1
    #open chunkfile in write mode
    chunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2")
    chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
    # Read line by line
    bzfile = bz2.BZ2File(filename)
    for line in bzfile:
        chunkfile.write(line)
        # the </page> determines new wiki page
        if '</page>' in line:
            pagecount += 1
        if pagecount > 1999:
            #print chunkname() # For Debugging
            chunkfile.close()
            pagecount = 0 # RESET pagecount
            filecount += 1 # increment filename           
            chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
    try:
        chunkfile.close()
    except:
        print 'Files already close'

if __name__ == '__main__':
    # When the script is self run
    split_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2')

这篇关于如何在 Python 中拆分大型维基百科转储 .xml.bz2 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆