如何获得ipython并行处理的中间结果? [英] how to get intermidiate results in ipython parallel processing?

查看:117
本文介绍了如何获得ipython并行处理的中间结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的工作是处理大量的xmls;为了获得更快的结果,我想使用ipython的并行处理;下面是我的示例代码。因为我只是用 celementTree 模块找到xml / xsd的元素数量。

my work is to deal with lots of xmls; to get faster results i want to use ipython's parallel processing; below is my sample code. in that i am just finding the number of elements of xml/xsd with celementTree module.

>>> from IPython.parallel import Client
>>> import os
>>> c = Client()
>>> c.ids
>>> lview = c.load_balanced_view()
>>> lview.block =True
>>> def return_len(xml_filepath):
        import xml.etree.cElementTree as cElementTree
        tree = cElementTree.parse(xml_filepath)
        my_count=0
        file_result=[]
        cdict={}
        for elem in tree.getiterator():
            cdict[my_count]={}
            if elem.tag:
                cdict[my_count]['tag']=elem.tag
            if elem.text:
                cdict[my_count]['text']=(elem.text).strip()
            if elem.attrib.items():
                cdict[my_count]['xmlattb']={}
                for key, value in elem.attrib.items():
                    cdict[my_count]['xmlattb'][key]=value
            if list(elem):
                cdict[my_count]['xmlinfo']=len(list(elem))
            if elem.tail:
                cdict[my_count]['tail']=elem.tail.strip()
            my_count+=1
        output=xml_filepath.split('\\')[-1],len(cdict)
        return output
        ## return cdict
>>> def get_dir_list(target_dir, *extensions):
        """
        This function will filter out the files from given dir based on their extensions
        """
        my_paths=[]
        for top, dirs, files in os.walk(target_dir):
            for nm in files:
                fileStats = os.stat(os.path.join(top, nm))
                if nm.split('.')[-1] in extensions:
                    my_paths.append(top+'\\'+nm)
        return my_paths
>>> r=lview.map_async(return_len,get_dir_list('C:\\test_folder','xsd','xml'))

获得我必须做的最终结果
>>> r.get()通过这个我将得到结果,当过程将完成

我的问题是我能够得到中间结果,当他们完成;例如,如果我将我的工作应用到包含1000个xmls / xsds文件的文件夹,那么我可以在处理完特定文件后立即获得结果。喜欢第一个文件完成 - >显示其结果......第二个文件已完成--->显示其结果........第1000个文件完成 - >显示其结果不像上面的当前工作; 等到最终文件完成然后它将显示所有这1000个文件的完整结果。

还处理导入/命名空间错误我在 return_len 函数内定义了 import ;有没有更好的方法来解决这个问题?

to get the final result i have to do >>> r.get() by this i will get result when process will complete

my question is am i able to get the intermediate results while they are getting finished;
for example if i have applied my work to a folder which contains 1000 xmls/xsds files then can i get results immediately when that particular files has been processed. like 1st file is done--> show its result... 2nd file is done---> show its result........ 1000th file is done--> show its result not like current work as above; wait till final file get finished then it will show complete result of all those 1000 files.
also to deal with import/namespace error i have defined import inside of return_len function; is there any better way to deal with that?

推荐答案

当然。 AsyncMapResult(map_async返回的类型)可以立即迭代,
和迭代产生的项目与最终由 r.get()生成的列表相同。所以在你这样做之后:

Sure. AsyncMapResult (the type returned by map_async) are iterable immediately, and the items yielded by the iteration are the same as the list ultimately produced by r.get(). So after you do:

amr = lview.map_async(return_len, get_dir_list('C:\\test_folder','xsd','xml'))

你可以这样做:

for r in amr:
    print r

或保留索引为枚举

for i,r in enumerate(amr):
    print i, r 

或使用 reduce builtin:

or perform reductions with the reduce builtin:

summary_result = reduce(myfunc, amr)

所有这些都将在结果到达时迭代。如果您不关心排序并且每项任务的时间变化很大,您可以传递 map_async(...,ordered = False)。如果您这样做,当您遍历AMR时,您将以先到先得的方式获得单独的结果,而不是保留提交订单。

All of these will iterate through your results as they arrive. If you don't care about the ordering and the time for each task is significantly varied, you can pass map_async(...,ordered=False). If you do this, when you iterate through the AMR, you will get individual results on a first-come-first-serve basis, rather than preserving the submission order.

有更多信息 ipython docs


还处理导入/命名空间错误我在return_len函数中定义了import;有没有更好的方法来解决这个问题?

also to deal with import/namespace error i have defined import inside of return_len function; is there any better way to deal with that?

是和否。有几种方法可以设置引擎命名空间,例如使用模块, @ parallel.require(module)装饰器,或者只是使用<显式执行导入code>%px将xml.etree.cElementTree导入为cElementTree ,在某些情况下每个都有好处。但我经常发现将函数中的导入设置为最简单的方法,而且意外最少。

Yes and no. There are a few ways to set up the engine namespace, such as using modules, the @parallel.require("module") decorator, or simply performing the import explicitly with %px import xml.etree.cElementTree as cElementTree, each of which has benefits in certain scenarios. But I often find putting imports in the function to be the easiest way, and with the least surprises.

这篇关于如何获得ipython并行处理的中间结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆