更改 pandas 数据框中的值不起作用 [英] Changing values in pandas dataframe does not work
问题描述
我在更改数据框中的值时遇到问题.我还想咨询有关我需要解决的问题以及使用熊猫解决该问题的正确方法.我会很感激这两个方面的帮助. 我有一个文件,其中包含有关音频文件与扬声器的匹配程度的信息.该文件看起来像这样:
I’m having a problem changing values in a dataframe. I also want to consult regarding a problem I need to solve and the proper way to use pandas to solve it. I'll appreciate help on both. I have a file containing information about matching degree of audio files to speakers. The file looks something like that:
wave_path spk_name spk_example# score mark comments isUsed
190 122_65_02.04.51.800.wav idoD idoD 88 NaN NaN False
191 121_110_20.17.27.400.wav idoD idoD 87 NaN NaN False
192 121_111_00.34.57.300.wav idoD idoD 87 NaN NaN False
193 103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
194 131_101_02.08.06.500.wav idoD idoD_0 96 HIT VP False
我需要做的是某种复杂的计数.我需要按说话者分组结果,并为每个说话者进行一些计算.然后,我着手为我做出最佳计算的扬声器,但是在继续之前,我需要将我用于计算的所有文件标记为已使用,即,更改它们出现的每一行的isUsed值(文件可以出现不止一次)设为TRUE.然后再进行一次迭代.为每个发言人计算,标记使用的文件,依此类推,直到不再有要计算的发言人为止.
What I need to do, is some kind of a sophisticated counting. I need to group the results by speaker, and calculate for each speaker some calculation. I then proceed with the speaker that made the best calculation for me, but before proceeding I need to mark all the files which I used for the calculation as being used, i.e. changing the isUsed value for each row in which they appear (files can appear more than once) to TRUE. Then I make another iteration. Calculate for each speaker, mark the used files and so on until no more speakers left to be calculated.
我想了很多关于如何使用pandas来实现该过程的想法(在常规python中实现起来很容易,但是这需要大量的循环和数据结构,我的猜测会大大降低该过程的速度,而且我m通过这个过程来更深入地学习熊猫的能力)
I thought a lot about how to implement that process using pandas (it is quite easy to implement in regular python but it will take a lot of looping and data structuring that my guess will slow the process down significantly, and also I’m using this process to get to learn pandas abilities more deeply)
我提出了以下解决方案.作为准备步骤,我将按发言人姓名分组,并通过set_index方法将文件名设置为index.然后,我将遍历groupbyObj并应用计算函数,该函数将返回选定的发言人和要标记为已使用的文件.
I came out with the following solution. As preparation steps, I’ll group by speaker name and set the file name as index by the set_index method. I will then iterate over the groupbyObj and apply the calculation function, which will return the selected speaker and the files to be marked as used.
然后,我将遍历文件并将其标记为已使用(由于我已将它们预先设置为索引,因此这将是快速而简单的操作),依此类推,直到完成计算为止.
Then I’ll iterate over the files and mark them as used (this would be fast and simple since I set them as indexes beforehand), and so on until I finish calculating.
首先,我不确定该解决方案,所以请随时告诉我您对此的想法. 现在,我尝试实施此操作,但遇到了麻烦:
First, I’m not sure about this solution, so feel free to tell me your thoughts on it. Now, I’ve tried implementing this, and got into trouble:
首先我按文件名索引,这里没问题:
First I indexed by file name, no problem here:
In [53]:
marked_results['isUsed'] = False
ind_res = marked_results.set_index('wave_path')
ind_res.head()
Out[53]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
131_101_02.08.06.500.wav idoD idoD 99 HIT VP False
144_35_22.46.38.700.wav idoD idoD 96 HIT VP False
41_09_17.10.11.700.wav idoD idoD 93 HIT TEST False
122_188_03.19.20.400.wav idoD idoD 93 NaN NaN False
然后,我选择一个文件并检查是否获得了与该文件相关的条目:
Then I choose a file and checked that I get the entries relevant to that file:
In [54]:
example_file = ind_res.index[0];
ind_res.ix[example_file]
Out[54]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_1 97 HIT VP False
103_31_18.59.12.800.wav idoD idoD_2 95 HIT VP False
现在这里也有问题.然后,我尝试将该文件的isUsed值更改为True,并从中获取了问题所在:
Now problems here too. Then I tried to change the isUsed value for that file to True, and that where I got the problem:
In [56]:
ind_res.ix[example_file]['isUsed'] = True
ind_res.ix[example_file].isUsed = True
ind_res.ix[example_file]
Out[56]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_1 97 HIT VP False
103_31_18.59.12.800.wav idoD idoD_2 95 HIT VP False
因此,您看到了问题.什么也没有变.我究竟做错了什么?是否应该使用熊猫解决上述问题?
So, you see the problem. Nothing has changed. What am I doing wrong? Is the problem described above should be solved using pandas?
并且: 1.如何通过groupby对象接近特定的组?我想也许不是将文件设置为索引,按文件分组以及使用该groupby obj将更改功能应用于所有出现的情况,而不是bcz.但是我没有找到一种方法来接近特定的组,并将组名作为参数传递并调用适用于所有组,然后仅对其中一个组执行操作,这对我来说似乎不正确.
And also: 1. How can I approach a specific group by a groupby object? bcz I thought maybe instead of setting the files as indexed, grouping by a file, and the using that groupby obj to apply a changing function to all of its occurrences. But I didn’t find a way to approach a specific group and passing the group name as parameter and calling apply on all the groups and then acting only on one of them seemed not "right" to me.
我希望不会太久...:)
I hope it is not to long... :)
推荐答案
索引熊猫对象可以返回两个根本不同的对象:视图或副本.
Indexing Panda objects can return two fundamentally different objects: a view or a copy.
如果mask
是基本切片,则df.ix[mask]
返回df
的 view .视图与原始对象(df
)共享相同的基础数据.因此,修改视图也将修改原始对象.
If mask
is a basic slice, then df.ix[mask]
returns a view of df
. Views share the same underlying data as the original object (df
). So modifying the view, also modifies the original object.
如果mask
比较复杂,例如任意索引序列,则df.ix[mask]
返回
If mask
is something more complicated, such as an arbitrary sequence of indices, then df.ix[mask]
returns a copy of some rows in df
. Modifying the copy has no affect on the original.
在您的情况下,由于共享同一wave_path
的行出现在任意位置,因此ind_res.ix[example_file]
返回一个副本.所以
In your case, since the rows which share the same wave_path
occur at arbitrary locations, ind_res.ix[example_file]
returns a copy. So
ind_res.ix[example_file]['isUsed'] = True
对ind_res
没有影响.
相反,您可以使用
ind_res.ix[example_file, 'isUsed'] = True
修改ind_res
.但是,请参见下面的groupby
建议,我认为它可能更接近您的真正需求.
to modify ind_res
. However, see below for a groupby
suggestion which I think might be closer to what you really want.
Jeff已经提供了链接到熊猫文档声明
Jeff has already provided a link to the Pandas docs which state that
关于何时返回数据视图的规则完全是 取决于NumPy.
The rules about when a view on the data is returned are entirely dependent on NumPy.
以下是(复杂的)规则,它们描述了何时视图或副本将返回.但是,基本上,规则是,如果索引请求底层数组的规则间隔切片,则返回视图,否则返回副本(不必要).
Here are the (complicated) rules which describe when a view or copy is returned. Basically, however, the rule is if the index is requesting a regularly spaced slice of the underlying array then a view is returned, otherwise a copy (out of necessity) is returned.
这是一个使用基本切片的简单示例. df.ix
返回一个视图,因此修改subdf
也会同时修改df
:
Here is a simple example which uses basic slice. A view is returned by df.ix
, so modifying subdf
modifies df
as well:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(4,3),
columns=list('ABC'), index=[0,1,2,3])
subdf = df.ix[0]
print(subdf.values)
# [0 1 2]
subdf.values[0] = 100
print(subdf)
# A 100
# B 1
# C 2
# Name: 0, dtype: int32
print(df) # df is modified
# A B C
# 0 100 1 2
# 1 3 4 5
# 2 6 7 8
# 3 9 10 11
这是一个使用花式索引"(选择任意行)的简单示例. df.ix
返回一个副本.因此修改subdf
不会影响df
.
Here is a simple example which uses "fancy indexing" (arbitrary rows selected). A copy is returned by df.ix
. So modifying subdf
does not affect df
.
df = pd.DataFrame(np.arange(12).reshape(4,3),
columns=list('ABC'), index=[0,1,0,3])
subdf = df.ix[0]
print(subdf.values)
# [[0 1 2]
# [6 7 8]]
subdf.values[0] = 100
print(subdf)
# A B C
# 0 100 100 100
# 0 6 7 8
print(df) # df is NOT modified
# A B C
# 0 0 1 2
# 1 3 4 5
# 0 6 7 8
# 3 9 10 11
请注意,两个示例之间的唯一区别是,在第一个示例中,返回视图时,索引为[0,1,2,3],而在第二个示例中,返回副本时,索引为[0,1,0,3].
Notice the only difference between the two examples is that in the first, where a view is returned, the index was [0,1,2,3], whereas in the second, where a copy is returned, the index was [0,1,0,3].
由于我们选择了索引为0的行,因此在第一个示例中,我们可以使用基本切片来实现.在第二个示例中,索引等于0的行可能出现在任意位置,因此必须返回副本.
Since we are selected rows where the index is 0, in the first example, we can do that with a basic slice. In th second example, the rows where index equals 0 could appear at arbitrary locations, so a copy has to be returned.
尽管对Pandas/NumPy切片的微妙之处大加赞赏,但我真的不这么认为
Despite having ranted on about the subtlety of Pandas/NumPy slicing, I really don't think that
ind_res.ix[example_file, 'isUsed'] = True
是您最终要寻找的.您可能想做更多类似的事情
is what you are ultimately looking for. You probably want to do something more like
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(4,3),
columns=list('ABC'))
df['A'] = df['A']%2
print(df)
# A B C
# 0 0 1 2
# 1 1 4 5
# 2 0 7 8
# 3 1 10 11
def calculation(grp):
grp['C'] = True
return grp
newdf = df.groupby('A').apply(calculation)
print(newdf)
产生
A B C
0 0 1 True
1 1 4 True
2 0 7 True
3 1 10 True
这篇关于更改 pandas 数据框中的值不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!