使用BeautifulSoup读取数千个XML文档 [英] Reading 1000s of XML documents with BeautifulSoup

查看:260
本文介绍了使用BeautifulSoup读取数千个XML文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取一堆xml文件并对它们进行处理.我要做的第一件事是根据文件中的数字对它们进行重命名.

I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file.

您可以在此处 警告这将启动一个108MB的zip文件的下载!.那是一个巨大的xml文件,其中包含成千上万个较小的xml文件.我已经将它们分解成单独的文件.我想根据里面的数字重命名文件(预处理的一部分).我有以下代码:

You can see a sample of the data hereWarning this will initiate a download of a 108MB zip file!. That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code:

from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import os

def rename_xml_files(directory):
    xml_files = [xml_file for xml_file in os.listdir(directory) ]

    for filename in xml_files:
        filename = filename.strip()
        full_filename = directory + "/" +filename
        print (full_filename)
        f = open(full_filename, "r")
        xml = f.read()
        soup = BeautifulSoup(xml)
        del xml
        del soup
        f.close()

如果我注释掉"soup ="和"del"行,则它可以完美运行.如果我添加"soup = ..."行,它将工作一会儿,然后最终将被淘汰-它将使python内核崩溃.我使用的是Enthought Canopy,但我已经尝试过从命令行运行它,而且它还在此处爬行.

If I comment out the "soup =" and "del" lines, it works perfectly. If I add the "soup = ..." line, it will work for a moment and then it will eventually crap out - it just crashes the python kernel. I'm using Enthought Canopy, but I've tried it running from the command line and it craps out there, too.

我想,也许是没有为变量"soup"分配空间,所以我尝试添加"del"命令.同样的问题.

I thought, perhaps, it was not deallocating the space for the variable "soup" so I tried adding the "del" commands. Same problem.

关于如何规避此事的任何想法?我不喜欢BS.如果有更好的方法,我会喜欢的,但是我需要一些示例代码.

Any thoughts on how to circumvent this? I'm not stuck on BS. If there's a better way of doing this, I would love it, but I need a little sample code.

推荐答案

尝试使用Python标准xml库中的cElementTree.parse()代替BeautifulSoup. 汤对于解析普通网页非常有用,但是cElementTree的速度很快.

Try using cElementTree.parse() from Python's standard xml library instead of BeautifulSoup. 'Soup is great for parsing normal web pages, but cElementTree is blazing fast.

赞:

import xml.etree.cElementTree as cET

# ...

def rename_xml_files(directory):
    xml_files = [xml_file for xml_file in os.listdir(directory) ]

    for filename in xml_files:
        filename = filename.strip()
        full_filename = directory + "/" +filename
        print(full_filename)
        parsed = cET.parse(full_filename)
        del parsed

如果您的XML格式正确,则应对其进行解析.如果您的机器仍无法处理内存中的所有数据,则应查看流式传输 XML.

If your XML formatted correctly this should parse it. If your machine is still unable to handle all that data in memory, you should look into streaming the XML.

这篇关于使用BeautifulSoup读取数千个XML文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆