C扩展(HDF5)中的IO绑定线程的GIL [英] GIL for IO bounded thread in C extension (HDF5)

查看:154
本文介绍了C扩展(HDF5)中的IO绑定线程的GIL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个采样应用程序,它每秒获取一次 250,000 个样本,将它们缓冲在内存中,并最终附加到pandas提供的HDFStore上.总的来说,这很棒.但是,我有一个线程可以运行并不断清空数据采集设备( DAQ ),它需要定期运行.大约一秒钟的偏差往往会弄坏东西.以下是观察到的时序的极端情况. Start表示DAQ读取开始,Finish表示完成读,IO表示HDF写入(DAQIO都发生在单独的线程中).

I have a sampling application that acquires 250,000 samples per second, buffers them in memory and eventually appends to an HDFStore provided by pandas. In general, this is great. However, I have a thread that runs and continually empties the data acquisition device (DAQ) and it needs to run on a somewhat regular basis. A deviation of about a second tends to break things. Below is an extreme case of the timings observed. Start indicates a DAQ read starting, Finish is when it finishes, and IO indicates an HDF write (Both DAQ and IO occur in separate threads).

Start        : 2016-04-07 12:28:22.241303
IO (1)       : 2016-04-07 12:28:22.241303
Finish       : 2016-04-07 12:28:46.573440 (0.16 Hz, 24331.26 ms)
IO Done (1)  : 2016-04-07 12:28:46.573440 (24332.39 ms)

如您所见,执行此写操作需要24秒(典型的写操作约为40毫秒).我要写入的HDD没有负载,因此该延迟不应由争用引起(运行时利用率约为7%).我已禁用对HDFStore写入的索引.我的应用程序运行许多其他线程,所有这些线程都打印状态字符串,因此IO任务似乎正在阻塞所有其他线程.我花了很多时间来逐步检查代码,以弄清哪里变慢了,而且总是在C扩展提供的方法之内,这引出了我的问题.

As you can see, it takes 24 seconds to perform this write (a typical write is about 40 ms). The HDD that I'm writing to is not under load, so this delay shouldn't be caused by contention (it's got about ~7% utilisation while running). I have disabled indexing on my HDFStore writes. My application runs numerous other threads, all of which print status strings, and therefore it seems like the IO task is blocking all other threads. I've spent quite a bit of time stepping through code to figure out where things are slowing down, and it's always within a method provided by a C extension, and this leads to my question..

  1. Python(我正在使用3.5)可以在C扩展中抢占执行吗? 并发性:Python扩展是以C语言编写的吗/C ++受全局解释器锁定的影响?似乎表明除非扩展明确产生,否则它不会.
  2. Pandas的HDF5 C代码是否对I/O产生任何收益?如果是这样,是否表示延迟是由于CPU限制的任务引起的?我已禁用索引编制.
  3. 关于如何获得一致的时间安排的任何建议?我正在考虑将HDF5代码移到另一个进程中.不过,这只能在一定程度上有所帮助,因为我无论如何也无法忍受约20秒的写入,尤其是当它们不可预测时.
  1. Can Python (I'm using 3.5) preempt execution in a C extension? Concurrency: Are Python extensions written in C/C++ affected by the Global Interpreter Lock? Seems to indicate that it doesn't unless the extension specifically yields.
  2. Does Pandas' HDF5 C code implement any yielding for I/O? If so, does this mean that the delay is due to a CPU bounded task? I have disabled indexing.
  3. Any suggestions for how I can get somewhat consistent timings? I'm thinking of moving the HDF5 code into another process. This only helps to a certain extent, though, as I can't really tolerate ~20 second writes anyway, especially when they're unpredictable.

以下是您可以运行以查看该问题的示例:

Here's an example you can run to see the issue:

import pandas as pd
import numpy as np
from timeit import default_timer as timer
import datetime
import random
import threading
import time

def write_samples(store, samples, overwrite):
    frame = pd.DataFrame(samples, dtype='float64')

    if not overwrite:
        store.append("df", frame, format='table', index=False)
    else:
        store.put("df", frame, format='table', index=False)

def begin_io():
    store = pd.HDFStore("D:\\slow\\test" + str(random.randint(0,100)) + ".h5", mode='w', complevel=0)

    counter = 0
    while True:
        data = np.random.rand(50000, 1)
        start_time = timer()
        write_samples(store, data, counter == 0)
        end_time = timer()

        print("IO Done      : %s (%.2f ms, %d)" % (datetime.datetime.now(), (end_time - start_time) * 1000, counter))

        counter += 1

    store.close()

def dummy_thread():
    previous = timer()
    while True:
        now = timer()
        print("Dummy Thread  : %s (%d ms)" % (datetime.datetime.now(), (now - previous) * 1000))
        previous = now
        time.sleep(0.01)


if __name__ == '__main__':
    threading.Thread(target=dummy_thread).start()
    begin_io()

您将获得类似于以下内容的输出:

You will get output similar to:

IO Done      : 2016-04-08 10:51:14.100479 (3.63 ms, 470)
Dummy Thread  : 2016-04-08 10:51:14.101484 (12 ms)
IO Done      : 2016-04-08 10:51:14.104475 (3.01 ms, 471)
Dummy Thread  : 2016-04-08 10:51:14.576640 (475 ms)
IO Done      : 2016-04-08 10:51:14.576640 (472.00 ms, 472)
Dummy Thread  : 2016-04-08 10:51:14.897756 (321 ms)
IO Done      : 2016-04-08 10:51:14.898782 (320.79 ms, 473)
IO Done      : 2016-04-08 10:51:14.901772 (3.29 ms, 474)
IO Done      : 2016-04-08 10:51:14.905773 (2.84 ms, 475)
IO Done      : 2016-04-08 10:51:14.908775 (2.96 ms, 476)
Dummy Thread  : 2016-04-08 10:51:14.909777 (11 ms)

推荐答案

答案是否定的,这些编写者没有发布GIL.请参见此处.我知道您实际上并不是在尝试使用 multiple 线程进行编写,但这应该可以为您提供线索.当发生写操作时,实际上会阻止多次写操作,因此会保留一些强大的锁. PyTablesh5py都将其作为HDF5标准的一部分.

The answer is no, these writers do not release the GIL. See the documentation here. I know you are not actually trying to write with multiple threads, but this should clue you in. There are strong locks that are held when writes happen to really to prevent multiple writing. Both PyTables and h5py do this as its part of the HDF5 standards.

您可以查看 SWMR ,尽管熊猫并没有直接支持. PyTables docs 此处

You can look at SWMR, though not directly supported by pandas. PyTables docs here and here point to solutions. These generally involved having a separate process pulling data off of queues and writing it.

无论如何,这通常是一种更具可扩展性的模式.

This is in generally a much more scalable pattern in any event.

这篇关于C扩展(HDF5)中的IO绑定线程的GIL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆