Python XML解析算法速度 [英] Python XML Parsing Algorithm Speed

查看:148
本文介绍了Python XML解析算法速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前在heroku上的一个python-flask webapp中解析了以下格式的大XML文件:

 < ;书名=书名> 
< volume n =1name =volume1name>
< chapter n =1>
< li n =1> li 1内容< / li>
< li n =2> li 2内容< / li>
< / chapter />
< chapter n =2>
< li n =1> li 1内容< / li>
< li n =2> li 2内容< / li>
< / chapter />
< / volume>
< volume n =2name =volume2name>
< chapter n =1>
< li n =1> li 1内容< / li>
< li n =2> li 2内容< / li>
< / chapter />
< chapter n =2>
< li n =1> li 1内容< / li>
< li n =2> li 2内容< / li>
< / chapter />
< / volume>
< / book>

我用来解析,分析并通过Flask显示的代码如下:

  from lxml import etree 
file = open(books / filename.xml)
parser = etree.XMLParser(recover = True)
tree = etree.parse(文件,解析器)
root = tree.getroot()
$ b $ get getChapter(volume,chapter):
i = 0
data = []
while True:
try:
data.append(root [volumeList()。index(volume)] [chapter-1] [i] .text)
除了IndexError:
break
i + = 1
if data == []:
data = None
return data
$ b $ def volumeList():
data = tree.xpath('// volume / @ name')
返回数据

def chapterCount(volume) :
currentChapter = 1
count = 0
True:
data = getChapter(volume,currentChapter)
if data == None:
break
else:
count + = 1
currentChapter + = 1
返回计数

def volumeNumerate():
list = volumeList()
i = 1
dict = {}
for list中的元素:
dict [i] =元素
i + = 1
返回字典

def render_default_values(模板,** kwargs):
chapter = getChapter(session ['volume'],session ['chapter'])
count = chapterCount(session ['volume'])
return render_template (template,chapter = chapter,count = count,** kwargs)

@ app.route('/< volume> /< int:chapter>')
def goto音量,章节):
session ['volume'] = volume
session ['chapter'] = chapter
return render_default_values(index.html)

我遇到的问题是每当Flask试图渲染一个带有许多章节的卷时(每当chapterCount(session ['volume' ])>约50左右),装载和公关页面的处理需要很长时间。相比之下,如果应用程序正在加载一个10/15章以下的卷,那么加载几乎是即时的,即使是在线的web应用程序。
因此,是否有一个好的方法可以优化这一点,并提高速度和性能?
非常感谢!



(PS:作为参考,这是我旧的getChapter函数,我停止使用,因为我不想引用个人`li',并希望代码能够与任何通用的XML文件一起工作,它比目前的getChapter函数快得多!:
$ b $ $ pre $ code> def OLDgetChapter(volume,chapter):
data = tree.xpath('// volume [@name =%s] / chapter [@ n =%d] / li / text )'%(volume,chapter))
if data == []:
data = None
返回数据

非常感谢!

解决方案

你有没有听说过BeautifulSoup?

BeautifulSoup 在为你解析 xml 方面做了繁琐的工作,除了在C中完成。



我确信这会更快(并且更可读):
$ bs4导入BeautifulSoup

文件名=test.xml
汤= BeautifulSoup(open(filename),xmlb
pre $
$ b $ def chapterCount(volume_name):
volume = soup.find(volume,attrs = {name:volume_name})
chapter_count = len(volume.find_all (chapter,recursive = False))
return chapter_count
$ b $ def getChapter(volume_name,chapter_number):
volume = soup.find(volume,{name :volume_name})
chapter = volume.find(chapter,{n:chapter_number})
items = [chapter.contents内容的内容if content!=\\\
]
返回\\\
.join([items.contents [0]为item中的项目])


#从现在开始,它和原来的代码一样
$ b $ def render_default_values(template,** kwargs):
chapter = getChapter(session ['volume'],session ['chapter'])
count = chapterCount(session [ 'volume'])
return render_template(template,chapter = chapter,count = count,** kwargs)

@ app.route('/< volume> /< int:chapter>')
def goto(volume,chapter):
session ['volume'] = volume
session ['chapter'] = chapter
return render_default_values(index.html)
getChapter 函数会更快,但重点是您不必迭代当您想要通过 chapterCount 来计算特定卷中的章节时,可以在每章中使用它。这两个函数现在完全独立。



两个函数的结果

 >>> print(chapterCount(volume1name))
2

>>> print(getChapter(volume1name,2))
li 1 content
li 2 content

编辑:

我只是,看看是否有一个更快的方法来计算章节。请继续关注:) - 更新:答案是您可以使用 recursive = False 来阻止BS返回找到的元素的整个树 find_all 。或者,直接使用 lxml



编辑: b

我只注意到你在视图中调用了 render_default_values 。你不应该这样做,或者至少你应该以不同的方式调用这个函数。因为渲染默认值的意思是... 渲染默认值

允许此函数根据全局变量渲染其他东西( session )被认为不是Pythonic,可能导致意大利面代码(未知错误等)。

I'm currently parsing a large XML file of the following form in a python-flask webapp on heroku:

<book name="bookname">
  <volume n="1" name="volume1name">
    <chapter n="1">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
    <chapter n="2">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
  </volume>
  <volume n="2" name="volume2name">
    <chapter n="1">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
    <chapter n="2">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
  </volume>
</book>

The code that I use to parse, analyze it, and display it through Flask is as the following:

from lxml import etree
file = open("books/filename.xml")
parser = etree.XMLParser(recover=True)
tree = etree.parse(file, parser)
root = tree.getroot()

def getChapter(volume, chapter):
    i = 0
    data = []
    while True:
        try:
            data.append(root[volumeList().index(volume)][chapter-1][i].text)
        except IndexError:
            break
        i += 1
    if data == []:
        data = None
    return data

def volumeList():
    data = tree.xpath('//volume/@name')
    return data

def chapterCount(volume):
    currentChapter = 1
    count = 0
    while True:
        data = getChapter(volume, currentChapter)
        if data == None:
            break
        else:
            count += 1
            currentChapter += 1
    return count

def volumeNumerate():
    list = volumeList()
    i = 1
    dict = {}
    for element in list:
        dict[i] = element
        i += 1
    return dict

def render_default_values(template, **kwargs):
    chapter = getChapter(session['volume'],session['chapter'])
    count = chapterCount(session['volume'])
    return render_template(template, chapter=chapter, count=count, **kwargs)

@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
    session['volume'] = volume
    session['chapter'] = chapter
    return render_default_values("index.html")

The issue that I am having is that whenever Flask is trying to render a volume with many chapters, (whenever chapterCount(session['volume']) > about 50 or so), the loading and processing of the page takes a very long time. In comparison, if the app is loading a volume that has say under 10/15 chapters, the loading is almost instantaneous, even as a live webapp. As such, is there a good way that I can optimize this, and improve the speed and performance? Thanks a lot!

(PS: For reference, this is my old getChapter function, that I stopped using since I don't want to refer to an individual `li' in the code and want the code to work with any generic XML file. It was considerably faster than the current getChapter function though!:

def OLDgetChapter(volume, chapter):
    data = tree.xpath('//volume[@name="%s"]/chapter[@n=%d]/li/text()'%(volume,chapter))
    if data == []:
        data = None
    return data

Thanks a lot!

解决方案

Have you heard about BeautifulSoup?

BeautifulSoup does the tedious work on parsing xml for you, except it does it in C.

I'm positively sure this will be much faster (and much more readable):

from bs4 import BeautifulSoup

filename = "test.xml"
soup = BeautifulSoup(open(filename), "xml")

def chapterCount(volume_name):
    volume = soup.find("volume", attrs={"name": volume_name})
    chapter_count = len(volume.find_all("chapter", recursive=False))
    return chapter_count

def getChapter(volume_name, chapter_number):
    volume = soup.find("volume", {"name": volume_name})
    chapter = volume.find("chapter", {"n": chapter_number})
    items = [ content for content in chapter.contents if content != "\n" ]
    return "\n".join([ item.contents[0] for item in items ])


# from now on, it's the same as your original code

def render_default_values(template, **kwargs):
    chapter = getChapter(session['volume'],session['chapter'])
    count = chapterCount(session['volume'])
    return render_template(template, chapter=chapter, count=count, **kwargs)

@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
    session['volume'] = volume
    session['chapter'] = chapter
    return render_default_values("index.html")

Note that not only the getChapter function will be faster, but the main point is that you won't have to iterate over it for each chapter when you want to count the chapters in a specific volume through chapterCount. Both functions are now totally independent from each other.

Results from both functions:

>>> print(chapterCount("volume1name"))
2

>>> print(getChapter("volume1name", 2))
li 1 content
li 2 content

EDIT:

I just asked a question to see if there could be a faster way to count the chapters. Stay tuned :) - Update: the answer is that you can use recursive=False to prevent BS from returning the entire tree of the elements found with find_all. Or, directly use lxml.

EDIT:

I just noticed that you call render_default_values in your view. You shouldn't do that, or at least you should call this function a different way. Because "render default values" means... well, render default values.

Allowing this function to render something else based on a global variable (session) is considered not very Pythonic and can lead to spaghetti code (unknown bugs, etc).

这篇关于Python XML解析算法速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆