pandas 单列多重值集的差异 [英] Difference of sets of multiple values for single column in pandas

查看:114
本文介绍了 pandas 单列多重值集的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些分组的表格数据,在这个数据中有一列,每个数据点实际上可以有一组不同的值。我试图计算该集合与其前一个数据点的差异。例如,给定下面的数据,我试图计算令牌 的值对于 Timestep value n 的值来自 Tokens $ c> value n - 对于每个 2,参与者组合:

  | Dyad |参与者| Timestep |令牌| 
| ------ | ------------- | ---------- | ------------- ------ |
| 1 | A | 1 |苹果,香蕉|
| 1 | B | 1 |苹果,桔子|
| 1 | A | 2 |香蕉|
| 1 | B | 2 |橙,金橘|
| 1 | A | 3 |橙色|
| 1 | B | 3 |橙,梨|
| 2 | A | 1 |橙,梨|
| 2 | B | 1 |苹果,香蕉,梨|
| 2 | A | 2 |香蕉,柿子|
| 2 | B | 2 |苹果|
| 2 | A | 3 |香蕉|
| 2 | B | 3 |苹果|

如何使用熊猫最好地完成这项工作?

预期结果



我最终希望创建一个新的列,并输出一个函数 token_overlap(data)它计算 Token 值与之前数据点的值重叠的比率:

  | Dyad |参与者| Timestep |令牌| TokenOverlap | 
| ------ | ------------- | ---------- | ------------- ------ | ------------- |
| 1 | A | 1 |苹果,香蕉| (没有值)|
| 1 | B | 1 |苹果,桔子| (没有值)|
| 1 | A | 2 |香蕉| 0.5 |
| 1 | B | 2 |橙,金橘| 0.333 |
| 1 | A | 3 |橙色| 0 |
| 1 | B | 3 |橙,梨| 0.333 |
| 2 | A | 1 |橙,梨| (没有值)|
| 2 | B | 1 |苹果,香蕉,梨| (没有值)|
| 2 | A | 2 |香蕉,柿子| 0 |
| 2 | B | 2 |苹果| 0.333 |
| 2 | A | 3 |香蕉| 0.5 |
| 2 | B | 3 |苹果| 1 |



目前的做法

通过使用转换器 <关键字将<-c $ c> frozenset org / pandas-docs / stable / generated / pandas.read_csv.htmlrel =nofollow noreferrer> pandas.read_csv(...) : / p>

  def parse_set(cell_value:str) - > FrozenSet [str]:
returnsetset(cell_value.split(','))

round_tokens = pandas.read_csv(inpath,converters = {Tokens:parse_set})

然后,我使用>创建 Dyad,Participant a href =https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html =nofollow noreferrer> pandas.DataFrame.groupby(。 。)

  round_tokens.sort_values([Dyad,Timestep ])
dyad_participants = round_tokens.groupby([Dyad,Participant])

然而,我不确定如何获取每一行,并且它的优先级是 Tokens 值(它应该是 frozenset ):我有一些函数试图这样做,但我不确定函数本身是否错误,或者如果我不正确地提取行数据。

  def token_overlap(data):
own_relevant_tokens = data [Tokens]
prev_token s = data.shift(-1)[Tokens]
overlap = own_relevant_tokens.intersection(prev_tokens)
union = own_relevant_tokens.union(prev_tokens)
return len(overlap)/ len(联合)

round_tokens [TokenOverlap] = dyad_participants.apply(token_overlap)

然而,这实际上并没有工作:实际的错误是


AttributeError:'系列'对象没有属性'union'


但我知道我没有正确使用/理解/维护熊猫API,因此是一个非常冗长的问题。 如何对数据进行分组,然后在每个组内使用一行中的类似值和前一行中相同列的值计算度量标准?




在真实数据中, Tokens ,所以,至少对我来说,如果我列举每个标记作为布尔值存在,这个任务会更加困难,例如 Token_Apple Token_Banana 等。

解决方案

  df 
Dyad Participant Timestep Tokens $ b
$ b 0 1 A 1苹果,香蕉
1 1 B 1苹果,橙
2 1 A 2香蕉
3 1 B 2橙,金橘
4 1 A 3橙
5 1 B 3橙色,梨
6 2 A 1橙色,梨
7 2 B 1苹果,香蕉,梨
8 2 A 2香蕉,柿子
9 2 B 2 apple
10 2 A 3 banana
11 2 B 3 apple

tokens = df.Tokens.str.split(',',expand = False).apply (frozenset)

令牌
0(苹果,香蕉)
1(橙色,苹果)
2(香蕉)
3(橙色,金橘)
4(橙色)
5(橙色,梨子)
6(橙色,梨)
7(苹果,香蕉,梨)
8(柿子,香蕉)
9(苹果)
10(香蕉)
11(apple)
名称:Tokens,dtype:object






 #联合逻辑 -  https://stackoverflow.com/a/46402781/4909087 
df = df.assign(令牌=令牌)\
.groupby(['Dyad','Participant'])。apply(\
lambda x:(x.Tokens.str.len() -
x.Tokens.diff()。 str.len())\
/ pd.Series([len(k [0] .union(k [1]))
for zip in zip(x.Tokens,x.Tokens。 shift(1).fillna(''))],index = x.index))\
.reset_index(level = [0,1],name ='TokenOverlap')\
。分配(时间步='df.Timestep,Tokens = df.Tokens)\
.sort_values(['Dyad','Timestep','Participant'])\
.fillna('(no value)') \
[['Dyad','Participant','Timestep','Tokens','TokenOverlap']]

df

Dyad Participant Timestep Tokens TokenOverlap
0 1 A 1苹果,香蕉(无价值)
1 1 B 1苹果,橙色(无价值)
2 1 A 2香蕉0.5
3 1 B 2橙色,金橘0.333333
4 1 A 3橙色0
5 1 B 3橙色,梨0.333333
6 2 A 1橙色,梨(无价值)
7 2 B 1苹果,香蕉,梨(无价值)
8 2 A 2香蕉,柿子0
9 2 B 2苹果0.333333
10 2 A 3香蕉0.5
11 2 B 3苹果1

简而言之,这段代码正在做的是,按 Dyad 参与者进行分组,然后找到成对比率。这需要一些复杂的 groupby apply ,因为我们需要做一些 union 和差异操作。核心逻辑在 groupby.apply 中,其余部分仅仅是美化。



这段代码运行在:

  10个循环,最好是3:每循环19.2 ms 






细分

<$ p $应用(\
lambda x: df2 = df.assign(Tokens = tokens)
df2 = df2.groupby(['Dyad','Participant'])。 (x.Tokens.str.len() -
x.Tokens.diff()。str.len())\
/ pd.Series([len(k [0] .union() (1).fillna(''))],index = x.index))#for循环是在zip中的k的一部分这条巨大的线

df2 = df2.reset_index(level = [0,1],name ='TokenOverlap')
df2 = df2.assign(Timestep = df.Timestep,Tokens = df .tokens)
df2 = df2.sort_values(['Dyad','Timestep','Participant'])。fillna('(no value)')
df2 = df2 [[''' 党cipant','Timestep','Tokens','TokenOverlap']]


I've got some grouped tabular data, and in this data there's a column for which each data point can actually have a set of different values. I'm trying to calculate the difference of that set from that of its preceding data point in the group it's a member of. For example, given the data below, I'm trying to calculate the difference of the values of Tokens for Timestep value n from the values of Tokens for the row with Timestamp value n - 1 for each Dyad,Participant combination:

| Dyad | Participant | Timestep | Tokens            |
|------|-------------|----------|-------------------|
| 1    | A           | 1        | apple,banana      |
| 1    | B           | 1        | apple,orange      |
| 1    | A           | 2        | banana            |
| 1    | B           | 2        | orange,kumquat    |
| 1    | A           | 3        | orange            |
| 1    | B           | 3        | orange,pear       |
| 2    | A           | 1        | orange,pear       |
| 2    | B           | 1        | apple,banana,pear |
| 2    | A           | 2        | banana,persimmon  |
| 2    | B           | 2        | apple             |
| 2    | A           | 3        | banana            |
| 2    | B           | 3        | apple             |

How can I best accomplish this using pandas?

Expected result

I ultimately want to create a new column with the output of a function token_overlap(data) which computes the ratio of Token values that overlap with the values of the datapoint preceding it:

| Dyad | Participant | Timestep | Tokens            | TokenOverlap |
|------|-------------|----------|-------------------| -------------|
| 1    | A           | 1        | apple,banana      | (no value)   |
| 1    | B           | 1        | apple,orange      | (no value)   |
| 1    | A           | 2        | banana            | 0.5          |
| 1    | B           | 2        | orange,kumquat    | 0.333        |
| 1    | A           | 3        | orange            | 0            |
| 1    | B           | 3        | orange,pear       | 0.333        |            
| 2    | A           | 1        | orange,pear       | (no value)   |
| 2    | B           | 1        | apple,banana,pear | (no value)   |
| 2    | A           | 2        | banana,persimmon  | 0            |
| 2    | B           | 2        | apple             | 0.333        |
| 2    | A           | 3        | banana            | 0.5          |
| 2    | B           | 3        | apple             | 1            |

Current approach

I convert the multi-values into a frozenset by using the converters keyword of pandas.read_csv(...):

def parse_set(cell_value: str) -> FrozenSet[str]:
    return frozenset(cell_value.split(','))

round_tokens = pandas.read_csv(inpath, converters={"Tokens": parse_set})

I then create groups of Dyad,Participant datapoints using pandas.DataFrame.groupby(..):

round_tokens.sort_values(["Dyad", "Timestep"])
dyad_participants = round_tokens.groupby(["Dyad", "Participant"])

However, I am unsure how to get each row and it's precedessor's Tokens value (which should be a frozenset): I have some function which tries to do so but I'm unsure if the function itself is wrong or if I'm extracting the row data incorrectly.

def token_overlap(data):
    own_relevant_tokens = data["Tokens"]
    prev_tokens = data.shift(-1)["Tokens"]
    overlap = own_relevant_tokens.intersection(prev_tokens)
    union = own_relevant_tokens.union(prev_tokens)
    return len(overlap) / len(union)

round_tokens["TokenOverlap"] = dyad_participants.apply(token_overlap)

However, this doesn't actually work: The actual error is

AttributeError: 'Series' object has no attribute 'union'

but I know I'm not using/understanding/grokking the pandas API correctly, hence the extremely long-winded question. How can I group my data and then, inside each group, calculate a metric using set-like values from one row and the same column's values for the row preceding it?


In the real data, there are over 1,000 possible values for Tokens, so, at least to me, this task would be even harder if I enumerated the presence of each token as Boolean values, e.g. Token_Apple, Token_Banana, etc.

解决方案

Setup

df
    Dyad Participant  Timestep             Tokens
0      1           A         1       apple,banana
1      1           B         1       apple,orange
2      1           A         2             banana
3      1           B         2     orange,kumquat
4      1           A         3             orange
5      1           B         3        orange,pear
6      2           A         1        orange,pear
7      2           B         1  apple,banana,pear
8      2           A         2   banana,persimmon
9      2           B         2              apple
10     2           A         3             banana
11     2           B         3              apple

tokens = df.Tokens.str.split(',', expand=False).apply(frozenset) 

tokens
0           (apple, banana)
1           (orange, apple)
2                  (banana)
3         (orange, kumquat)
4                  (orange)
5            (orange, pear)
6            (orange, pear)
7     (apple, banana, pear)
8       (persimmon, banana)
9                   (apple)
10                 (banana)
11                  (apple)
Name: Tokens, dtype: object


# union logic - https://stackoverflow.com/a/46402781/4909087
df =  df.assign(Tokens=tokens)\
        .groupby(['Dyad', 'Participant']).apply(\
               lambda x: (x.Tokens.str.len() - 
                      x.Tokens.diff().str.len()) \
                    / pd.Series([len(k[0].union(k[1])) 
   for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index))\
        .reset_index(level=[0, 1], name='TokenOverlap')\
        .assign(Timestep=df.Timestep, Tokens=df.Tokens)\
        .sort_values(['Dyad', 'Timestep', 'Participant'])\
        .fillna('(no value)')\
         [['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

df

    Dyad Participant  Timestep             Tokens TokenOverlap
0      1           A         1       apple,banana   (no value)
1      1           B         1       apple,orange   (no value)
2      1           A         2             banana          0.5
3      1           B         2     orange,kumquat     0.333333
4      1           A         3             orange            0
5      1           B         3        orange,pear     0.333333
6      2           A         1        orange,pear   (no value)
7      2           B         1  apple,banana,pear   (no value)
8      2           A         2   banana,persimmon            0
9      2           B         2              apple     0.333333
10     2           A         3             banana          0.5
11     2           B         3              apple            1

In a nutshell, what this code is doing is, grouping by Dyad and Participant, and then finding pairwise ratio. This needs some complicated groupby and apply, since we need to do a few set union and difference operations. The The core logic is inside the groupby.apply, while the rest is just prettification.

This code runs in:

10 loops, best of 3: 19.2 ms per loop


Breakdown

df2 = df.assign(Tokens=tokens)
df2 = df2.groupby(['Dyad', 'Participant']).apply(\
                   lambda x: (x.Tokens.str.len() - 
                          x.Tokens.diff().str.len()) \
                        / pd.Series([len(k[0].union(k[1])) 
       for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index)) # the for loop is part of this huge line

df2 = df2.reset_index(level=[0, 1], name='TokenOverlap')    
df2 = df2.assign(Timestep=df.Timestep, Tokens=df.Tokens)
df2 = df2.sort_values(['Dyad', 'Timestep', 'Participant']).fillna('(no value)')    
df2 = df2[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

这篇关于 pandas 单列多重值集的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆