有效地写入 pandas 中的多个相邻列 [英] Writing to multiple adjacent columns in pandas efficiently

查看:97
本文介绍了有效地写入 pandas 中的多个相邻列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用numpy ndarray可以一次写入多个列而无需先进行复制(只要它们相邻).如果我想写一个数组的前三列,我会写

With a numpy ndarray it is possible to write to multiple columns at a time without making a copy first (as long as they are adjacent). If I wanted to write to the first three columns of an array I would write

a[0,0:3] = 1,2,3 # this is very fast ('a' is a numpy ndarray)

我希望在大熊猫中,我能够像这样通过标签切片"选择多个相邻的列(假设前三列被标记为"a","b","c")

I was hoping that in pandas I would similarly be able to select multiple adjacent columns by "label-slicing" like so (assuming the first 3 columns are labeled 'a','b','c')

a.loc[0,'a':'c'] = 1,2,3 # this works but is very slow ('a' is a pandas DataFrame)

或类似地

a.iloc[0,3:6] = 1,2,3 # this is equally as slow

但是,与写入仅花费几微秒的numpy数组相比,这要花费几百毫秒.我不清楚熊猫是否正在复制引擎盖下的阵列.我发现以这种良好速度写入数据帧的唯一方法是直接在基础ndarray上工作

However, this takes several 100s of milliseconds as compared to writing to a numpy array which takes only a few microseconds. I'm unclear on whether pandas is making a copy of the array under the hood. The only way I could find to write to the dataframe in this way that gives good speed is to work on the underlying ndarray directly

a.values[0,0:3] = 1,2,3 # this works fine and is fast

我是否错过了Pandas文档中的某些内容,还是他们没有办法以与numpy相当的速度对Pandas数据帧进行多个相邻列索引?

Have I missed something in the Pandas docs or is their no way to do multiple adjacent column indexing on a Pandas dataframe with speed comparable to numpy?

修改

这是我正在使用的实际数据框.

Here's the actual dataframe I am working with.

>> conn = sqlite3.connect('prath.sqlite')
>> prath = pd.read_sql("select image_id,pixel_index,skin,r,g,b from pixels",conn)
>> prath.shape
(5913307, 6)
>> prath.head()
   image_id  pixel_index  skin    r    g    b
0        21       113764     0    0    0    0
1        13       187789     0  183  149  173
2        17       535758     0  147   32   35
3        31         6255     0  116    1   16
4        15       119272     0  238  229  224
>> prath.dtypes
image_id       int64
pixel_index    int64
skin           int64
r              int64
g              int64
b              int64
dtype: object

以下是针对不同索引方法的一些运行时比较(同样,熊猫索引的运行速度非常很慢)

Here is some runtime comparisons for the different indexing methods (again, pandas indexing is very slow)

>> %timeit prath.loc[0,'r':'b'] = 4,5,6
1 loops, best of 3: 888 ms per loop
>> %timeit prath.iloc[0,3:6] = 4,5,6
1 loops, best of 3: 894 ms per loop
>> %timeit prath.values[0,3:6] = 4,5,6
100000 loops, best of 3: 4.8 µs per loop

推荐答案

我们添加了即使在多dtype框架中也可以直接建立索引的功能.这是现在掌握的,将是0.17.0.您可以在< 0.17.0,但需要对内部进行更多的操作.

We are adding the ability to index directly even in a multi-dtype frame. This is in master now and will be in 0.17.0. You can do this in < 0.17.0, but it requires (more) manipulation of the internals.

In [1]: df = DataFrame({'A' : range(5), 'B' : range(6,11), 'C' : 'foo'})

In [2]: df.dtypes
Out[2]: 
A     int64
B     int64
C    object
dtype: object

copy=False标志是新的.这给了你dtypes-> blocks的字典(它们是dtype可分离的)

The copy=False flag is new. This gives you a dict of dtypes->blocks (which are dtype separable)

In [3]: b = df.as_blocks(copy=False)

In [4]: b
Out[4]: 
{'int64':    A   B
 0  0   6
 1  1   7
 2  2   8
 3  3   9
 4  4  10, 'object':      C
 0  foo
 1  foo
 2  foo
 3  foo
 4  foo}

这是基础的numpy数组.

Here is the underlying numpy array.

In [5]: b['int64'].values
Out[5]: 
array([[ 0,  6],
       [ 1,  7],
       [ 2,  8],
       [ 3,  9],
       [ 4, 10]])

这是原始数据集中的数组

This is the array in the original data set

In [7]: id(df._data.blocks[0].values)
Out[7]: 4429267232

这是我们对此的看法.他们是一样的

Here is our view on it. They are the same

In [8]: id(b['int64'].values.base)
Out[8]: 4429267232

现在,您可以访问框架,并使用pandas set操作进行修改. 您还可以通过.values直接访问numpy数组,现在它是原始视图的VIEW.

Now you can access the frame, and use pandas set operations to modify. You can also directly access the numpy array via .values, which is now a VIEW into the original.

只要您不更改数据本身的dtype(例如,不要尝试在此处放置字符串,就不会进行复制),就不会进行复制,因此不会对修改造成任何速度损失(例如,不要尝试在此处放置字符串;它可以工作,但是视图将丢失)

You will not incur any speed penalty for modifications as copies won't be made as long as you don't change the dtype of the data itself (e.g. don't try to put a string here; it will work but the view will be lost)

In [9]: b['int64'].loc[0,'A'] = -1

In [11]: b['int64'].values[0,1] = -2

有了视图,您就可以更改基础数据了.

Since we have a view, you can then change the underlying data.

In [12]: df
Out[12]: 
   A   B    C
0 -1  -2  foo
1  1   7  foo
2  2   8  foo
3  3   9  foo
4  4  10  foo

请注意,如果您修改数据的形状(例如,如果添加一列),则视图将丢失.

Note that if you modify the shape of the data (e.g. if you add a column for example) then the views will be lost.

这篇关于有效地写入 pandas 中的多个相邻列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆