有什么优雅的方法来定义带有dtype数组列的数据框吗? [英] Is there any elegant way to define a dataframe with column of dtype array?

查看:76
本文介绍了有什么优雅的方法来定义带有dtype数组列的数据框吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要处理熊猫中的二级库存数据.为了简单起见,假设每行有四种数据:

I want to process stock level-2 data in pandas. Suppose there are four kinds data in each row for simplicity:

  • millis:时间戳记,int64
  • last_price:最后交易价格,float64,
  • ask_queue:ask端的容量,一个固定大小(200)的int32数组
  • bid_queue:出价方的数量,一个固定大小(200)的int32数组

在numpy中可以轻松地将其定义为结构化dtype:

Which can be easily defined as a structured dtype in numpy:

dtype = np.dtype([
   ('millis', 'int64'), 
   ('last_price', 'float64'), 
   ('ask_queue', ('int32', 200)), 
   ('bid_queue', ('int32', 200))
])

这样,我可以像这样访问ask_queuebid_queue:

And in that way, I can access the ask_queue and bid_queue like:

In [17]: data = np.random.randint(0, 100, 1616 * 5).view(dtype)

% compute the average of ask_queue level 5 ~ 10
In [18]: data['ask_queue'][:, 5:10].mean(axis=1)  
Out[18]: 
array([33.2, 51. , 54.6, 53.4, 15. , 37.8, 29.6, 58.6, 32.2, 51.6, 34.4,
       43.2, 58.4, 26.8, 54. , 59.4, 58.8, 38.8, 35.2, 71.2])

我的问题是如何定义DataFrame包含数据?

My question is how to define a DataFrame include the data?

这里有两种解决方案:

A.将ask_queuebid_queue设置为两列,其数组值如下:

A. set the ask_queue and bid_queue as two columns with array values as following:

In [5]: df = pd.DataFrame(data.tolist(), columns=data.dtype.names)

In [6]: df.dtypes
Out[6]: 
millis          int64
last_price    float64
ask_queue      object
bid_queue      object
dtype: object

但是,此解决方案至少存在两个问题:

However, there at least two problems in this solution:

  1. ask_queuebid_queue丢失了2D数组的dtype,并且所有 方便的方法;
  2. 性能,因为它变成了对象数组,而不是2D数组 数组.
  1. The ask_queue and bid_queue lost the dtype of 2D array and all the convenient methods;
  2. Performance, since it become a array of objects rather than a 2D array.

B.将ask_queuebid_quene展平为2 * 200列:

B. flatten the ask_queue and bid_quene to 2 * 200 columns:

In [8]: ntype = np.dtype([('millis', 'int64'), ('last_price', 'float64')] + 
   ...:                  [(f'{name}{i}', 'int32') for name in ['ask', 'bid'] for i in range(200)])

In [9]: df = pd.DataFrame.from_records(data.view(ntype))

In [10]: df.dtypes
Out[10]: 
millis          int64
last_price    float64
ask0            int32
ask1            int32
ask2            int32
ask3            int32
ask4            int32
ask5            int32
...

这比解决方案A更好.但是2 * 200列看起来很多余.

It's better than solution A. But the 2 * 200 columns looks redundant.

是否有任何解决方案可以利用numpy中的结构化dtype的优势? 我想知道ExtensionArray或`ExtensionDtype'是否可以解决这个问题.

Is there any solution can take the advantage as the structured dtype in numpy? I wonder if the ExtensionArray or `ExtensionDtype' can solve this.

推荐答案

Q:有什么解决方案可以像numpy中的结构化dtype那样利用吗?

Q : Is there any solution can take the advantage as the structured dtype in numpy?

使用L2-DoM数据与仅使用ToB(顶级)价格馈送数据相比,具有两个方面的复杂性. a)本机订阅源速度很快(非常快/FIX协议或其他私有数据订阅源提供了每秒数百,数千(主要事件发生在基础上的重大事件)L2-DoM变化的记录.存储必须以性能为导向 b),由于项目a)的性质,任何类型的离线分析都必须成功地操纵和有效地处理大型数据集.

Working with L2-DoM data has two-fold complications, compared to the just ToB ( Top-of-the-Book ) price-feed data. a) the native feed is fast ( very fast / FIX Protocol or other private data-feeds deliver records with hundreds, thousands ( more during fundamental events on majors ) L2-DoM changes per millisecond. Both processing and storage must be performance-oriented b) any kind of offline analyses has to successfully manipulate and efficiently process large data-sets, due to the nature of item a)

  • 存储首选项
  • 使用numpy -类似于语法首选项
  • 性能首选项
  • Storage preferences
  • Using numpy-alike syntax preferences
  • Performance preferences

鉴于pandas.DataFrame被设置为首选存储类型,尽管语法和性能首选项可能产生不利影响,我们也要尊重这一点.

Given pandas.DataFrame was set as the preferred storage type, let's respect that, even though syntax and performance preferences may take adverse impacts.

可以采取其他方式,但可能会带来未知的重构/重新设计成本,O/P的操作环境不需要或已经不愿承担.

Going other way is possible, yet may introduce unknown re-factoring / re-engineering costs, that the O/P's operational environment need not or already is not willing to bear.

话虽这么说,pandas功能限制必须纳入设计考虑因素,所有其他步骤都必须遵守,除非将来可能会更改此首选项.

Having said this, pandas feature limitations have to be put into the design considerations and all the other steps will have to live with it, unless this preference might get revised in some future time.

此请求是明确而明确的,因为numpy工具快速而智能地设计用于高性能的数字处理.给定已设置的存储首选项,我们将实施一对numpy技巧,以适合pandas 2D- DataFrame ,同时在.STORE路线:

This request is sound and clear, as numpy tools are fast and smart crafted for high-performance number-crunching. Given the set storage preference, we will implement a pair of numpy-tricks so as to fit into pandas 2D-DataFrame all at reasonable costs on both the .STORE and .RETRIEVE directions:

 # on .STORE:
 testDF['ask_DoM'][aRowIDX] = ask200.dumps()      # type(ask200) <class 'numpy.ndarray'>

 # on .RETRIEVE:
 L2_ASK = np.loads( testDF['ask_DoM'][aRowIDX] )  # type(L2_ASK) <class 'numpy.ndarray'>


性能偏好设置:已测试

针对.STORE.RETRIEVE两种方向的拟议解决方案的净附加成本经测试得出:


Performance preferences : TESTED

Net add-on costs of the proposed solution for both .STORE and .RETRIEVE directions were tested to take:

.STORE方向上的一次性成本,每个单元格不少于 70 [us] 并且不超过 ~ 160 [us] 对于给定比例的L2_DoM数组(avg:78 [ms] StDev:9-11 [ms]):

A one-time cost on .STORE direction of no less than 70 [us] and no more than ~ 160 [us] per cell for given scales of L2_DoM arrays ( avg: 78 [ms] StDev: 9-11 [ms] ):

>>> [ f( [testDUMPs() for _ in range(1000)] ) for f in (np.min,np.mean,np.std,np.max) ]
[72, 79.284, 11.004153942943548, 150]
[72, 78.048, 10.546135548152224, 160]
[71, 78.584,  9.887971227708949, 139]
[72, 76.9,    8.827332496286745, 132]

对于给定单元格,在.RETRIEVE方向上的

重复成本不少于 46 [us] ,并且不超过 ~ 123 [us] L2_DoM数组的比例(avg:50 [us] StDev:9.5 [us]):

A repeating cost on .RETRIEVE direction of no less than 46 [us] and no more than ~ 123 [us] per cell for given scales of L2_DoM arrays ( avg: 50 [us] StDev: 9.5 [us] ):

>>> [ f( [testLOADs() for _ in range(1000)] ) for f in (np.min,np.mean,np.std,np.max) ]
[46, 50.337, 9.655194197943405, 104]
[46, 49.649, 9.462272665697178, 123]
[46, 49.513, 9.504293766503643, 123]
[46, 49.77,  8.367165350344164, 114]
[46, 51.355, 6.162434583831296,  89]

如果使用更好的与体系结构对齐的int64数据类型,则可以望获得甚至更高的性能(是的,以两倍的存储成本为代价,但是计算成本将决定此举是否具有性能优势),并从有机会使用基于memoryview的操作,该操作可以减少嗓子,并使附加延迟减少到大约22 [us].

Even higher performance is to be expected if using better architecture-aligned int64 datatypes ( yes, at a cost of doubled costs of storage, yet the costs of computations will decide if this move has a performance edge ) and from a chance to use memoryview-based manipulations, that can cut the throat down and shave the add-on latency to about 22 [us].

test在py3.5.6,numpy v1.15.2下运行,使用:

>>> import numpy as np; ask200 = np.arange( 200, dtype = np.int32 ); s = ask200.dumps()
>>> from zmq import Stopwatch; aClk = Stopwatch()
>>> def testDUMPs():
...     aClk.start()
...     s = ask200.dumps()
...     return aClk.stop()
... 
>>> def testLOADs():
...     aClk.start()
...     a = np.loads( s )
...     return aClk.stop()
...

平台CPU,缓存层次结构和RAM详细信息:

>>> get_numexpr_cpuinfo_details_on_CPU()

'TLB size'______________________________:'1536 4K pages'
'address sizes'_________________________:'48 bits physical, 48 bits virtual'
'apicid'________________________________:'17'
'bogomips'______________________________:'7199.92'
'bugs'__________________________________:'fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2'
'cache size'____________________________:'2048 KB'
'cache_alignment'_______________________:'64'
'clflush size'__________________________:'64'
'core id'_______________________________:'1'
'cpu MHz'_______________________________:'1400.000'
'cpu cores'_____________________________:'2'
'cpu family'____________________________:'21'
'cpuid level'___________________________:'13'
'flags'_________________________________:'fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold'
'fpu'___________________________________:'yes'
'fpu_exception'_________________________:'yes'
'initial apicid'________________________:'1'
'microcode'_____________________________:'0x6000626'
'model'_________________________________:'1'
'model name'____________________________:'AMD FX(tm)-4100 Quad-Core Processor'
'physical id'___________________________:'0'
'power management'______________________:'ts ttp tm 100mhzsteps hwpstate cpb'
'processor'_____________________________:'1'
'siblings'______________________________:'4'
'stepping'______________________________:'2'
'vendor_id'_____________________________:'AuthenticAMD'
'wp'____________________________________:'yes'

这篇关于有什么优雅的方法来定义带有dtype数组列的数据框吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆