pandas 单列多重值集的差异 [英] Difference of sets of multiple values for single column in pandas
问题描述
我有一些分组的表格数据,在这个数据中有一列,每个数据点实际上可以有一组不同的值。我试图计算该集合与其前一个数据点的差异。例如,给定下面的数据,我试图计算令牌
† 的值对于 Timestep
value n 的值来自 Tokens
$ c> value n - 对于每个 2,参与者
组合:
| Dyad |参与者| Timestep |令牌|
| ------ | ------------- | ---------- | ------------- ------ |
| 1 | A | 1 |苹果,香蕉|
| 1 | B | 1 |苹果,桔子|
| 1 | A | 2 |香蕉|
| 1 | B | 2 |橙,金橘|
| 1 | A | 3 |橙色|
| 1 | B | 3 |橙,梨|
| 2 | A | 1 |橙,梨|
| 2 | B | 1 |苹果,香蕉,梨|
| 2 | A | 2 |香蕉,柿子|
| 2 | B | 2 |苹果|
| 2 | A | 3 |香蕉|
| 2 | B | 3 |苹果|
如何使用熊猫最好地完成这项工作?
预期结果
我最终希望创建一个新的列,并输出一个函数 token_overlap(data)
它计算 Token
值与之前数据点的值重叠的比率:
| Dyad |参与者| Timestep |令牌| TokenOverlap |
| ------ | ------------- | ---------- | ------------- ------ | ------------- |
| 1 | A | 1 |苹果,香蕉| (没有值)|
| 1 | B | 1 |苹果,桔子| (没有值)|
| 1 | A | 2 |香蕉| 0.5 |
| 1 | B | 2 |橙,金橘| 0.333 |
| 1 | A | 3 |橙色| 0 |
| 1 | B | 3 |橙,梨| 0.333 |
| 2 | A | 1 |橙,梨| (没有值)|
| 2 | B | 1 |苹果,香蕉,梨| (没有值)|
| 2 | A | 2 |香蕉,柿子| 0 |
| 2 | B | 2 |苹果| 0.333 |
| 2 | A | 3 |香蕉| 0.5 |
| 2 | B | 3 |苹果| 1 |
目前的做法
通过使用转换器
< pandas.read_csv(...)
: / p>
def parse_set(cell_value:str) - > FrozenSet [str]:
returnsetset(cell_value.split(','))
round_tokens = pandas.read_csv(inpath,converters = {Tokens:parse_set})
然后,我使用>创建 Dyad,Participant
a href =https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html =nofollow noreferrer> pandas.DataFrame.groupby(。 。)
:
round_tokens.sort_values([Dyad,Timestep ])
dyad_participants = round_tokens.groupby([Dyad,Participant])
然而,我不确定如何获取每一行,并且它的优先级是 Tokens
值(它应该是 frozenset
):我有一些函数试图这样做,但我不确定函数本身是否错误,或者如果我不正确地提取行数据。
def token_overlap(data):
own_relevant_tokens = data [Tokens]
prev_token s = data.shift(-1)[Tokens]
overlap = own_relevant_tokens.intersection(prev_tokens)
union = own_relevant_tokens.union(prev_tokens)
return len(overlap)/ len(联合)
round_tokens [TokenOverlap] = dyad_participants.apply(token_overlap)
然而,这实际上并没有工作:实际的错误是
AttributeError:'系列'对象没有属性'union'
但我知道我没有正确使用/理解/维护熊猫API,因此是一个非常冗长的问题。 如何对数据进行分组,然后在每个组内使用一行中的类似值和前一行中相同列的值计算度量标准?
† 在真实数据中, Tokens
,所以,至少对我来说,如果我列举每个标记作为布尔值存在,这个任务会更加困难,例如 Token_Apple
, Token_Banana
等。
df
Dyad Participant Timestep Tokens $ b
$ b 0 1 A 1苹果,香蕉1 1 B 1苹果,橙
2 1 A 2香蕉
3 1 B 2橙,金橘
4 1 A 3橙
5 1 B 3橙色,梨
6 2 A 1橙色,梨
7 2 B 1苹果,香蕉,梨
8 2 A 2香蕉,柿子
9 2 B 2 apple
10 2 A 3 banana
11 2 B 3 apple
tokens = df.Tokens.str.split(',',expand = False).apply (frozenset)
令牌
0(苹果,香蕉)
1(橙色,苹果)
2(香蕉)
3(橙色,金橘)
4(橙色)
5(橙色,梨子)
6(橙色,梨)
7(苹果,香蕉,梨)
8(柿子,香蕉)
9(苹果)
10(香蕉)
11(apple)
名称:Tokens,dtype:object
#联合逻辑 - https://stackoverflow.com/a/46402781/4909087
df = df.assign(令牌=令牌)\
.groupby(['Dyad','Participant'])。apply(\
lambda x:(x.Tokens.str.len() -
x.Tokens.diff()。 str.len())\
/ pd.Series([len(k [0] .union(k [1]))
for zip in zip(x.Tokens,x.Tokens。 shift(1).fillna(''))],index = x.index))\
.reset_index(level = [0,1],name ='TokenOverlap')\
。分配(时间步='df.Timestep,Tokens = df.Tokens)\
.sort_values(['Dyad','Timestep','Participant'])\
.fillna('(no value)') \
[['Dyad','Participant','Timestep','Tokens','TokenOverlap']]
df
Dyad Participant Timestep Tokens TokenOverlap
0 1 A 1苹果,香蕉(无价值)
1 1 B 1苹果,橙色(无价值)
2 1 A 2香蕉0.5
3 1 B 2橙色,金橘0.333333
4 1 A 3橙色0
5 1 B 3橙色,梨0.333333
6 2 A 1橙色,梨(无价值)
7 2 B 1苹果,香蕉,梨(无价值)
8 2 A 2香蕉,柿子0
9 2 B 2苹果0.333333
10 2 A 3香蕉0.5
11 2 B 3苹果1
简而言之,这段代码正在做的是,按 Dyad
和参与者
进行分组,然后找到成对比率。这需要一些复杂的 groupby
和 apply
,因为我们需要做一些差异
操作。核心逻辑在 groupby.apply
中,其余部分仅仅是美化。
这段代码运行在:
10个循环,最好是3:每循环19.2 ms
细分
<$ p $应用(\
lambda x: df2 = df.assign(Tokens = tokens)
df2 = df2.groupby(['Dyad','Participant'])。 (x.Tokens.str.len() -
x.Tokens.diff()。str.len())\
/ pd.Series([len(k [0] .union() (1).fillna(''))],index = x.index))#for循环是在zip中的k的一部分这条巨大的线
df2 = df2.reset_index(level = [0,1],name ='TokenOverlap')
df2 = df2.assign(Timestep = df.Timestep,Tokens = df .tokens)
df2 = df2.sort_values(['Dyad','Timestep','Participant'])。fillna('(no value)')
df2 = df2 [[''' 党cipant','Timestep','Tokens','TokenOverlap']]
I've got some grouped tabular data, and in this data there's a column for which each data point can actually have a set of different values. I'm trying to calculate the difference of that set from that of its preceding data point in the group it's a member of. For example, given the data below, I'm trying to calculate the difference of the values of Tokens
† for Timestep
value n from the values of Tokens
for the row with Timestamp
value n - 1 for each Dyad,Participant
combination:
| Dyad | Participant | Timestep | Tokens |
|------|-------------|----------|-------------------|
| 1 | A | 1 | apple,banana |
| 1 | B | 1 | apple,orange |
| 1 | A | 2 | banana |
| 1 | B | 2 | orange,kumquat |
| 1 | A | 3 | orange |
| 1 | B | 3 | orange,pear |
| 2 | A | 1 | orange,pear |
| 2 | B | 1 | apple,banana,pear |
| 2 | A | 2 | banana,persimmon |
| 2 | B | 2 | apple |
| 2 | A | 3 | banana |
| 2 | B | 3 | apple |
How can I best accomplish this using pandas?
Expected result
I ultimately want to create a new column with the output of a function token_overlap(data)
which computes the ratio of Token
values that overlap with the values of the datapoint preceding it:
| Dyad | Participant | Timestep | Tokens | TokenOverlap |
|------|-------------|----------|-------------------| -------------|
| 1 | A | 1 | apple,banana | (no value) |
| 1 | B | 1 | apple,orange | (no value) |
| 1 | A | 2 | banana | 0.5 |
| 1 | B | 2 | orange,kumquat | 0.333 |
| 1 | A | 3 | orange | 0 |
| 1 | B | 3 | orange,pear | 0.333 |
| 2 | A | 1 | orange,pear | (no value) |
| 2 | B | 1 | apple,banana,pear | (no value) |
| 2 | A | 2 | banana,persimmon | 0 |
| 2 | B | 2 | apple | 0.333 |
| 2 | A | 3 | banana | 0.5 |
| 2 | B | 3 | apple | 1 |
Current approach
I convert the multi-values into a frozenset
by using the converters
keyword of pandas.read_csv(...)
:
def parse_set(cell_value: str) -> FrozenSet[str]:
return frozenset(cell_value.split(','))
round_tokens = pandas.read_csv(inpath, converters={"Tokens": parse_set})
I then create groups of Dyad,Participant
datapoints using pandas.DataFrame.groupby(..)
:
round_tokens.sort_values(["Dyad", "Timestep"])
dyad_participants = round_tokens.groupby(["Dyad", "Participant"])
However, I am unsure how to get each row and it's precedessor's Tokens
value (which should be a frozenset
): I have some function which tries to do so but I'm unsure if the function itself is wrong or if I'm extracting the row data incorrectly.
def token_overlap(data):
own_relevant_tokens = data["Tokens"]
prev_tokens = data.shift(-1)["Tokens"]
overlap = own_relevant_tokens.intersection(prev_tokens)
union = own_relevant_tokens.union(prev_tokens)
return len(overlap) / len(union)
round_tokens["TokenOverlap"] = dyad_participants.apply(token_overlap)
However, this doesn't actually work: The actual error is
AttributeError: 'Series' object has no attribute 'union'
but I know I'm not using/understanding/grokking the pandas API correctly, hence the extremely long-winded question. How can I group my data and then, inside each group, calculate a metric using set-like values from one row and the same column's values for the row preceding it?
†In the real data, there are over 1,000 possible values for Tokens
, so, at least to me, this task would be even harder if I enumerated the presence of each token as Boolean values, e.g. Token_Apple
, Token_Banana
, etc.
Setup
df
Dyad Participant Timestep Tokens
0 1 A 1 apple,banana
1 1 B 1 apple,orange
2 1 A 2 banana
3 1 B 2 orange,kumquat
4 1 A 3 orange
5 1 B 3 orange,pear
6 2 A 1 orange,pear
7 2 B 1 apple,banana,pear
8 2 A 2 banana,persimmon
9 2 B 2 apple
10 2 A 3 banana
11 2 B 3 apple
tokens = df.Tokens.str.split(',', expand=False).apply(frozenset)
tokens
0 (apple, banana)
1 (orange, apple)
2 (banana)
3 (orange, kumquat)
4 (orange)
5 (orange, pear)
6 (orange, pear)
7 (apple, banana, pear)
8 (persimmon, banana)
9 (apple)
10 (banana)
11 (apple)
Name: Tokens, dtype: object
# union logic - https://stackoverflow.com/a/46402781/4909087
df = df.assign(Tokens=tokens)\
.groupby(['Dyad', 'Participant']).apply(\
lambda x: (x.Tokens.str.len() -
x.Tokens.diff().str.len()) \
/ pd.Series([len(k[0].union(k[1]))
for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index))\
.reset_index(level=[0, 1], name='TokenOverlap')\
.assign(Timestep=df.Timestep, Tokens=df.Tokens)\
.sort_values(['Dyad', 'Timestep', 'Participant'])\
.fillna('(no value)')\
[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]
df
Dyad Participant Timestep Tokens TokenOverlap
0 1 A 1 apple,banana (no value)
1 1 B 1 apple,orange (no value)
2 1 A 2 banana 0.5
3 1 B 2 orange,kumquat 0.333333
4 1 A 3 orange 0
5 1 B 3 orange,pear 0.333333
6 2 A 1 orange,pear (no value)
7 2 B 1 apple,banana,pear (no value)
8 2 A 2 banana,persimmon 0
9 2 B 2 apple 0.333333
10 2 A 3 banana 0.5
11 2 B 3 apple 1
In a nutshell, what this code is doing is, grouping by Dyad
and Participant
, and then finding pairwise ratio. This needs some complicated groupby
and apply
, since we need to do a few set union
and difference
operations. The The core logic is inside the groupby.apply
, while the rest is just prettification.
This code runs in:
10 loops, best of 3: 19.2 ms per loop
Breakdown
df2 = df.assign(Tokens=tokens)
df2 = df2.groupby(['Dyad', 'Participant']).apply(\
lambda x: (x.Tokens.str.len() -
x.Tokens.diff().str.len()) \
/ pd.Series([len(k[0].union(k[1]))
for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index)) # the for loop is part of this huge line
df2 = df2.reset_index(level=[0, 1], name='TokenOverlap')
df2 = df2.assign(Timestep=df.Timestep, Tokens=df.Tokens)
df2 = df2.sort_values(['Dyad', 'Timestep', 'Participant']).fillna('(no value)')
df2 = df2[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]
这篇关于 pandas 单列多重值集的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!