在大 pandas 数据框中改变价值 [英] Changing values in pandas dataframe doenst work

查看:91
本文介绍了在大 pandas 数据框中改变价值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在数据帧中更改值时遇到问题。我也想咨询一下我需要解决的问题,以及使用大熊猫解决问题的正确方法。我会欣赏两者的帮助。
我有一个文件包含音频文件与扬声器的匹配程度的信息。该文件看起来像这样:

  wave_path spk_name spk_example#得分标记评论isUsed 
190 122_65_02.04.51.800。 wav idoD idoD 88 NaN NaN False
191 121_110_20.17.27.400.wav idoD idoD 87 NaN NaN False
192 121_111_00.34.57.300.wav idoD idoD 87 NaN NaN False
193 103_31_18。 59.12.800.wav idoD idoD_0 99 HIT VP False
194 131_101_02.08.06.500.wav idoD idoD_0 96 HIT VP False

我需要做的是某种复杂的计数。我需要通过讲话者对结果进行分组,并为每个演讲者计算一些计算。然后我继续对我做出最好的计算的演讲者,但在继续之前,我需要将我用于计算的所有文件标记为使用,即更改其出现的每一行的isUsed值(文件可以显示多次)为TRUE。然后再进行一次迭代。计算每个扬声器,标记使用的文件等等,直到没有更多的扬声器剩下要计算。



我以为很多关于如何使用大熊猫来实现这个过程(这很容易在普通的python中实现,但是它需要大量的循环和数据结构才能使我的猜测会使这个过程显着减慢,我也在使用这个过程更深入地学习熊猫的能力)



我出来了以下解决方案。作为准备步骤,我将按扬声器名称进行分组,并通过set_index方法将文件名设置为索引。然后,我将遍历groupbyObj并应用计算功能,这将返回所选择的扬声器和要标记为使用的文件。



然后我将遍历文件并将其标记为已使用(这将是快速和简单的,因为我预先将它们设置为索引),依此类推,直到完成计算。



首先,我不知道这个解决方案,所以随时告诉我你的想法。
现在,我已经尝试实现了这一点,并遇到麻烦:



首先我按文件名索引,这里没有问题:

 在[53]中:

marked_results ['isUsed'] = False
ind_res = marked_results.set_index(' wave_path')
ind_res.head()

输出[53]:
spk_name spk_example#得分标记评论isUsed
wave_path
103_31_18.59.12.800 .wav idoD idoD 99 HIT VP False
131_101_02.08.06.500.wav idoD idoD 99 HIT VP False
144_35_22.46.38.700.wav idoD idoD 96 HIT VP False
41_09_17.10.11。 700.wav idoD idoD 93 HIT TEST False
122_188_03.19.20.400.wav idoD idoD 93 NaN NaN False

然后我选择一个文件,并检查我是否收到与该文件相关的条目:

 在[54] :

example_file = ind_res.index [0];
ind_res.ix [example_file]

输出[54]:
spk_name spk_example#分数标记评论isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_1 97 HIT VP False
103_31_18.59.12.800。 wav idoD idoD_2 95 HIT VP False

现在在这里也出现问题。然后我尝试将该文件的isUsed值更改为True,并在那里遇到问题:

 在[56] :

ind_res.ix [example_file] ['isUsed'] = True
ind_res.ix [example_file] .isUsed = True
ind_res.ix [example_file]
Out [56]:
spk_name spk_example#得分标记评论isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_1 97 HIT VP False
103_31_18.59.12.800.wav idoD idoD_2 95 HIT VP False

所以,你看到这个问题。什么也没有变。我究竟做错了什么?上面描述的问题是否应该使用熊猫来解决?



还有:
1.如何通过groupby对象来处理特定的组? bcz我认为可能不是将文件设置为索引,通过文件进行分组,并使用groupby obj将更改的函数应用于其所有事件。但是我没有找到一种方法来处理一个特定的组,并将组名称作为参数传递,并且对所有组进行调用,然后仅对其中的一个似乎对我而言是不正确的。



我希望不要长...:)

解决方案

p>索引熊猫对象可以返回两个根本不同的对象:视图或副本。



如果 mask 是基本切片,然后 df.ix [mask] 返回 df 视图。视图共享与原始对象相同的底层数据( df )。因此,修改视图也会修改原始对象。



如果 mask 是更复杂的东西,如任意索引序列,然后 df.ix [mask] 返回 df 副本 c>。



在您的情况下,由于共享相同 wave_path 的行发生在任意位置, ind_res.ix [example_file] 返回副本。所以

  ind_res.ix [example_file] ['isUsed'] = True 
/ pre>

ind_res 没有影响。



相反,你可以使用

  ind_res.ix [example_file,'isUsed'] = True 

修改 ind_res 。但是,请参阅下面的 groupby 建议,我认为可能更接近你真正想要的。



Jeff has已经提供了一个熊猫文档链接这说明


关于数据视图返回的规则完全是
,取决于NumPy。


这是(复杂的)这些规则描述何时返回视图或副本。然而,基本上,规则是如果索引要求基础数组的规则间隔的片段,那么返回视图,否则返回一个副本(不必要)。



< hr>

这是一个使用基本切片的简单示例。视图由 df.ix 返回,因此修改 subdf 修改 df 以及

 导入熊猫为pd 
导入numpy为np

df = pd.DataFrame(np.arange(12).reshape(4,3),
columns = list('ABC'),index = [0,1,2,3])

subdf = df.ix [0]
print(subdf.values)
#[0 1 2]
subdf.values [0] = 100
print(subdf )
#A 100
#B 1
#C 2
#名称:0,dtype:int32

print(df)#df被修改
#ABC
#0 100 1 2
#1 3 4 5
#2 6 7 8
#3 9 10 11






这是一个使用花式索引(选择任意行)的简单示例。副本由 df.ix 返回。所以修改 subdf 不影响 df

  df = pd.DataFrame(np.arange(12).reshape(4,3),
columns = list('ABC'),index = [0,1,0, 3])

subdf = df.ix [0]
print(subdf.values)
#[[0 1 2]
#[6 7 8] ]

subdf.values [0] = 100
print(subdf)
#ABC
#0 100 100 100
#0 6 7 8

打印(df)#df未修改
#ABC
#0 0 1 2
#1 3 4 5
#0 6 7 8
#3 9 10 11

注意两个例子之间唯一的区别是,在第一个在返回视图的情况下,索引为[0,1,2,3],而在第二个情况下,返回一个副本,索引为[0,1,0,3]。



由于我们是索引为0的选定行,在第一个示例中,我们可以使用基本切片。在第二个例子中,索引等于0的行可以出现在任意位置,因此必须返回副本。






尽管已经讨论了Pandas / NumPy切片的微妙之处,但我真的不认为

  ind_res.ix [example_file ,'isUsed'] = True 

是您最终寻找的内容。你可能想做更多的事情像

 导入熊猫为pd 
导入numpy为np

df = pd.DataFrame(np.arange(12).reshape(4,3),
columns = list('ABC'))
df ['A'] = df ['A ']%2
打印(df)
#ABC
#0 0 1 2
#1 1 4 5
#2 0 7 8
# 3 1 10 11

def计算(grp):
grp ['C'] = True
返回grp

newdf = df.groupby 'a')。apply(计算)
print(newdf)

p>

  ABC 
0 0 1 True
1 1 4 True
2 0 7 True
3 1 10 True


I’m having a problem changing values in a dataframe. I also want to consult regarding a problem I need to solve and the proper way to use pandas to solve it. I'll appreciate help on both. I have a file containing information about matching degree of audio files to speakers. The file looks something like that:

wave_path   spk_name    spk_example#    score   mark    comments    isUsed
190  122_65_02.04.51.800.wav     idoD    idoD    88  NaN     NaN     False
191  121_110_20.17.27.400.wav    idoD    idoD    87  NaN     NaN     False
192  121_111_00.34.57.300.wav    idoD    idoD    87  NaN     NaN     False
193  103_31_18.59.12.800.wav     idoD    idoD_0  99  HIT     VP  False
194  131_101_02.08.06.500.wav    idoD    idoD_0  96  HIT     VP  False

What I need to do, is some kind of a sophisticated counting. I need to group the results by speaker, and calculate for each speaker some calculation. I then proceed with the speaker that made the best calculation for me, but before proceeding I need to mark all the files which I used for the calculation as being used, i.e. changing the isUsed value for each row in which they appear (files can appear more than once) to TRUE. Then I make another iteration. Calculate for each speaker, mark the used files and so on until no more speakers left to be calculated.

I thought a lot about how to implement that process using pandas (it is quite easy to implement in regular python but it will take a lot of looping and data structuring that my guess will slow the process down significantly, and also I’m using this process to get to learn pandas abilities more deeply)

I came out with the following solution. As preparation steps, I’ll group by speaker name and set the file name as index by the set_index method. I will then iterate over the groupbyObj and apply the calculation function, which will return the selected speaker and the files to be marked as used.

Then I’ll iterate over the files and mark them as used (this would be fast and simple since I set them as indexes beforehand), and so on until I finish calculating.

First, I’m not sure about this solution, so feel free to tell me your thoughts on it. Now, I’ve tried implementing this, and got into trouble:

First I indexed by file name, no problem here:

In [53]:

    marked_results['isUsed'] = False
    ind_res = marked_results.set_index('wave_path')
    ind_res.head()

Out[53]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav      idoD    idoD    99  HIT     VP  False
    131_101_02.08.06.500.wav     idoD    idoD    99  HIT     VP  False
    144_35_22.46.38.700.wav      idoD    idoD    96  HIT     VP  False
    41_09_17.10.11.700.wav       idoD    idoD    93  HIT     TEST    False
    122_188_03.19.20.400.wav     idoD    idoD    93  NaN     NaN     False

Then I choose a file and checked that I get the entries relevant to that file:

In [54]:

    example_file = ind_res.index[0];
    ind_res.ix[example_file]

Out[54]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav  idoD    idoD    99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_0  99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_1  97  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_2  95  HIT     VP  False

Now problems here too. Then I tried to change the isUsed value for that file to True, and that where I got the problem:

In [56]:

    ind_res.ix[example_file]['isUsed'] = True
    ind_res.ix[example_file].isUsed = True
    ind_res.ix[example_file]
Out[56]:
    spk_name    spk_example#    score   mark    comments    isUsed
    wave_path                       
    103_31_18.59.12.800.wav  idoD    idoD    99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_0  99  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_1  97  HIT     VP  False
    103_31_18.59.12.800.wav  idoD    idoD_2  95  HIT     VP  False

So, you see the problem. Nothing has changed. What am I doing wrong? Is the problem described above should be solved using pandas?

And also: 1. How can I approach a specific group by a groupby object? bcz I thought maybe instead of setting the files as indexed, grouping by a file, and the using that groupby obj to apply a changing function to all of its occurrences. But I didn’t find a way to approach a specific group and passing the group name as parameter and calling apply on all the groups and then acting only on one of them seemed not "right" to me.

I hope it is not to long... :)

解决方案

Indexing Panda objects can return two fundamentally different objects: a view or a copy.

If mask is a basic slice, then df.ix[mask] returns a view of df. Views share the same underlying data as the original object (df). So modifying the view, also modifies the original object.

If mask is something more complicated, such as an arbitrary sequence of indices, then df.ix[mask] returns a copy of some rows in df. Modifying the copy has no affect on the original.

In your case, since the rows which share the same wave_path occur at arbitrary locations, ind_res.ix[example_file] returns a copy. So

ind_res.ix[example_file]['isUsed'] = True

has no effect on ind_res.

Instead, you could use

ind_res.ix[example_file, 'isUsed'] = True

to modify ind_res. However, see below for a groupby suggestion which I think might be closer to what you really want.

Jeff has already provided a link to the Pandas docs which state that

The rules about when a view on the data is returned are entirely dependent on NumPy.

Here are the (complicated) rules which describe when a view or copy is returned. Basically, however, the rule is if the index is requesting a regularly spaced slice of the underlying array then a view is returned, otherwise a copy (out of necessity) is returned.


Here is a simple example which uses basic slice. A view is returned by df.ix, so modifying subdf modifies df as well:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(12).reshape(4,3), 
         columns=list('ABC'), index=[0,1,2,3])

subdf = df.ix[0]
print(subdf.values)
# [0 1 2]
subdf.values[0] = 100
print(subdf)
# A    100
# B      1
# C      2
# Name: 0, dtype: int32

print(df)           # df is modified
#      A   B   C
# 0  100   1   2
# 1    3   4   5
# 2    6   7   8
# 3    9  10  11


Here is a simple example which uses "fancy indexing" (arbitrary rows selected). A copy is returned by df.ix. So modifying subdf does not affect df.

df = pd.DataFrame(np.arange(12).reshape(4,3), 
         columns=list('ABC'), index=[0,1,0,3])

subdf = df.ix[0]
print(subdf.values)
# [[0 1 2]
#  [6 7 8]]

subdf.values[0] = 100
print(subdf)
#      A    B    C
# 0  100  100  100
# 0    6    7    8

print(df)          # df is NOT modified
#    A   B   C
# 0  0   1   2
# 1  3   4   5
# 0  6   7   8
# 3  9  10  11

Notice the only difference between the two examples is that in the first, where a view is returned, the index was [0,1,2,3], whereas in the second, where a copy is returned, the index was [0,1,0,3].

Since we are selected rows where the index is 0, in the first example, we can do that with a basic slice. In th second example, the rows where index equals 0 could appear at arbitrary locations, so a copy has to be returned.


Despite having ranted on about the subtlety of Pandas/NumPy slicing, I really don't think that

ind_res.ix[example_file, 'isUsed'] = True

is what you are ultimately looking for. You probably want to do something more like

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(12).reshape(4,3), 
                  columns=list('ABC'))
df['A'] = df['A']%2
print(df)
#    A   B   C
# 0  0   1   2
# 1  1   4   5
# 2  0   7   8
# 3  1  10  11

def calculation(grp):
    grp['C'] = True
    return grp

newdf = df.groupby('A').apply(calculation)
print(newdf)

which yields

   A   B     C
0  0   1  True
1  1   4  True
2  0   7  True
3  1  10  True

这篇关于在大 pandas 数据框中改变价值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆