如何在python中异步处理xml? [英] How can I process xml asynchronously in python?

查看:69
本文介绍了如何在python中异步处理xml?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的XML数据文件(> 160M)要处理,似乎SAX/expat/pulldom解析是解决之道.我想要一个线程来筛选节点,并将要处理的节点推送到队列中,然后其他工作线程将下一个可用的节点拉出队列并进行处理.

I have a large XML data file (>160M) to process, and it seems like SAX/expat/pulldom parsing is the way to go. I'd like to have a thread that sifts through the nodes and pushes nodes to be processed onto a queue, and then other worker threads pull the next available node off the queue and process it.

我有以下内容(我知道它应该有锁-稍后会)

I have the following (it should have locks, I know - it will, later)

import sys, time
import xml.parsers.expat
import threading

q = []

def start_handler(name, attrs):
    q.append(name)

def do_expat():
    p = xml.parsers.expat.ParserCreate()
    p.StartElementHandler = start_handler
    p.buffer_text = True
    print("opening {0}".format(sys.argv[1]))
    with open(sys.argv[1]) as f:
        print("file is open")
        p.ParseFile(f)
        print("parsing complete")


t = threading.Thread(group=None, target=do_expat)
t.start()

while True:
    print(q)
    time.sleep(1)

问题是while块的主体仅被调用一次,因此我什至无法ctrl-C中断它.在较小的文件上,输出符合预期,但这似乎表明仅在完全解析文档后才调用处理程序,这似乎违反了SAX解析器的目的.

The problem is that the body of the while block gets called only once, and then I can't even ctrl-C interrupt it. On smaller files, the output is as expected, but that seems to indicate that the handler only gets called when the document is fully parsed, which seems to defeat the purpose of a SAX parser.

我确定这是我自己的无知,但我看不出我在哪里犯错.

I'm sure it's my own ignorance, but I don't see where I'm making the mistake.

PS:我也尝试过这样更改start_handler:

PS: I also tried changing start_handler thus:

def start_handler(name, attrs):
    def app():
        q.append(name)
    u = threading.Thread(group=None, target=app)
    u.start()

不过没有爱.

推荐答案

增量解析!因此,只需一次将文件送入解析器,并确保在执行过程中有条件地将控制权交给其他线程-例如:

ParseFile, as you've noticed, just "gulps down" everything -- no good for the incremental parsing you want to do! So, just feed the file to the parser a bit at a time, making sure to conditionally yield control to other threads as you go -- e.g.:

while True:
  data = f.read(BUFSIZE)
  if not data:
    p.Parse('', True)
    break
  p.Parse(data, False)
  time.sleep(0.0)

time.sleep(0.0)调用是Python所说的如果任何准备就绪并正在等待,则屈服于其他线程"; Parse方法在此处.

the time.sleep(0.0) call is Python's way to say "yield to other threads if any are ready and waiting"; the Parse method is documented here.

第二点是,忘记使用此功能的锁! -使用 Queue.Queue 代替,它是本质上是线程安全的,几乎始终是在Python中协调多个线程的最佳和最简单的方法.只需在其上创建一个Queue实例qq.put(name),并在q.get()上工作了线程阻塞,等待其他工作要做–太简单了!

The second point is, forget locks for this usage! -- use Queue.Queue instead, it's intrinsically threadsafe and almost invariably the best and simplest way to coordinate multiple threads in Python. Just make a Queue instance q, q.put(name) on it, and have worked threads block on q.get() waiting to get some more work to do -- it's SO simple!

((当没有更多工作要做时,可以使用几种辅助策略来协调工作线程的终止,但是最简单的,缺少的特殊要求是仅使它们成为守护程序线程,因此它们都将终止)当主线程执行操作时-请参见文档).

(There are several auxiliary strategies you can use to coordinate the termination of worker threads when there's no more work for them to do, but the simplest, absent special requirements, is to just make them daemon threads, so they will all terminate when the main thread does -- see the docs).

这篇关于如何在python中异步处理xml?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆