有效地写入 pandas 中的多个相邻列 [英] Writing to multiple adjacent columns in pandas efficiently
问题描述
使用numpy ndarray可以一次写入多个列而无需先进行复制(只要它们相邻).如果我想写一个数组的前三列,我会写
With a numpy ndarray it is possible to write to multiple columns at a time without making a copy first (as long as they are adjacent). If I wanted to write to the first three columns of an array I would write
a[0,0:3] = 1,2,3 # this is very fast ('a' is a numpy ndarray)
我希望在大熊猫中,我能够像这样通过标签切片"选择多个相邻的列(假设前三列被标记为"a","b","c")
I was hoping that in pandas I would similarly be able to select multiple adjacent columns by "label-slicing" like so (assuming the first 3 columns are labeled 'a','b','c')
a.loc[0,'a':'c'] = 1,2,3 # this works but is very slow ('a' is a pandas DataFrame)
或类似地
a.iloc[0,3:6] = 1,2,3 # this is equally as slow
但是,与写入仅花费几微秒的numpy数组相比,这要花费几百毫秒.我不清楚熊猫是否正在复制引擎盖下的阵列.我发现以这种良好速度写入数据帧的唯一方法是直接在基础ndarray上工作
However, this takes several 100s of milliseconds as compared to writing to a numpy array which takes only a few microseconds. I'm unclear on whether pandas is making a copy of the array under the hood. The only way I could find to write to the dataframe in this way that gives good speed is to work on the underlying ndarray directly
a.values[0,0:3] = 1,2,3 # this works fine and is fast
我是否错过了Pandas文档中的某些内容,还是他们没有办法以与numpy相当的速度对Pandas数据帧进行多个相邻列索引?
Have I missed something in the Pandas docs or is their no way to do multiple adjacent column indexing on a Pandas dataframe with speed comparable to numpy?
修改
这是我正在使用的实际数据框.
Here's the actual dataframe I am working with.
>> conn = sqlite3.connect('prath.sqlite')
>> prath = pd.read_sql("select image_id,pixel_index,skin,r,g,b from pixels",conn)
>> prath.shape
(5913307, 6)
>> prath.head()
image_id pixel_index skin r g b
0 21 113764 0 0 0 0
1 13 187789 0 183 149 173
2 17 535758 0 147 32 35
3 31 6255 0 116 1 16
4 15 119272 0 238 229 224
>> prath.dtypes
image_id int64
pixel_index int64
skin int64
r int64
g int64
b int64
dtype: object
以下是针对不同索引方法的一些运行时比较(同样,熊猫索引的运行速度非常很慢)
Here is some runtime comparisons for the different indexing methods (again, pandas indexing is very slow)
>> %timeit prath.loc[0,'r':'b'] = 4,5,6
1 loops, best of 3: 888 ms per loop
>> %timeit prath.iloc[0,3:6] = 4,5,6
1 loops, best of 3: 894 ms per loop
>> %timeit prath.values[0,3:6] = 4,5,6
100000 loops, best of 3: 4.8 µs per loop
推荐答案
我们添加了即使在多dtype框架中也可以直接建立索引的功能.这是现在掌握的,将是0.17.0.您可以在< 0.17.0,但需要对内部进行更多的操作.
We are adding the ability to index directly even in a multi-dtype frame. This is in master now and will be in 0.17.0. You can do this in < 0.17.0, but it requires (more) manipulation of the internals.
In [1]: df = DataFrame({'A' : range(5), 'B' : range(6,11), 'C' : 'foo'})
In [2]: df.dtypes
Out[2]:
A int64
B int64
C object
dtype: object
copy=False
标志是新的.这给了你dtypes-> blocks的字典(它们是dtype可分离的)
The copy=False
flag is new. This gives you a dict of dtypes->blocks (which are dtype separable)
In [3]: b = df.as_blocks(copy=False)
In [4]: b
Out[4]:
{'int64': A B
0 0 6
1 1 7
2 2 8
3 3 9
4 4 10, 'object': C
0 foo
1 foo
2 foo
3 foo
4 foo}
这是基础的numpy数组.
Here is the underlying numpy array.
In [5]: b['int64'].values
Out[5]:
array([[ 0, 6],
[ 1, 7],
[ 2, 8],
[ 3, 9],
[ 4, 10]])
这是原始数据集中的数组
This is the array in the original data set
In [7]: id(df._data.blocks[0].values)
Out[7]: 4429267232
这是我们对此的看法.他们是一样的
Here is our view on it. They are the same
In [8]: id(b['int64'].values.base)
Out[8]: 4429267232
现在,您可以访问框架,并使用pandas set操作进行修改.
您还可以通过.values
直接访问numpy数组,现在它是原始视图的VIEW.
Now you can access the frame, and use pandas set operations to modify.
You can also directly access the numpy array via .values
, which is now a VIEW into the original.
只要您不更改数据本身的dtype(例如,不要尝试在此处放置字符串,就不会进行复制),就不会进行复制,因此不会对修改造成任何速度损失(例如,不要尝试在此处放置字符串;它可以工作,但是视图将丢失)
You will not incur any speed penalty for modifications as copies won't be made as long as you don't change the dtype of the data itself (e.g. don't try to put a string here; it will work but the view will be lost)
In [9]: b['int64'].loc[0,'A'] = -1
In [11]: b['int64'].values[0,1] = -2
有了视图,您就可以更改基础数据了.
Since we have a view, you can then change the underlying data.
In [12]: df
Out[12]:
A B C
0 -1 -2 foo
1 1 7 foo
2 2 8 foo
3 3 9 foo
4 4 10 foo
请注意,如果您修改数据的形状(例如,如果添加一列),则视图将丢失.
Note that if you modify the shape of the data (e.g. if you add a column for example) then the views will be lost.
这篇关于有效地写入 pandas 中的多个相邻列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!