pandas ,对于一列中的每个唯一值,在另一列中获得唯一值 [英] Pandas, for each unique value in one column, get unique values in another column
问题描述
我有一个数据框,其中的每一行都包含与单个Reddit注释有关的各种元数据(例如作者,subreddit,注释文本).
I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).
我想执行以下操作:对于每个作者,我想获取他们有注释的所有子reddit的列表,并将此数据转换为熊猫数据框,其中每一行都对应于一个作者,并列出所有列表.他们评论的唯一子提示.
I want to do the following: for each author, I want to grab a list of all the subreddits they have comments in, and transform this data into a pandas dataframe where each row corresponds to an author, and a list of all the unique subreddits they comment in.
我目前正在尝试以下几种组合,但无法解决:
I am currently trying some combination of the following, but can't get it down:
尝试1:
group = df['subreddit'].groupby(df['author']).unique()
list(group)
尝试2:
from collections import defaultdict
subreddit_dict = defaultdict(list)
for index, row in df.iterrows():
author = row['author']
subreddit = row['subreddit']
subreddit_dict[author].append(subreddit)
for key, value in subreddit_dict.items():
subreddit_dict[key] = set(value)
subreddit_df = pd.DataFrame.from_dict(subreddit_dict,
orient = 'index')
推荐答案
这里有两种策略可以做到这一点.毫无疑问,还有其他方法.
Here are two strategies to do it. No doubt, there are other ways.
假设您的数据框看起来像这样 (显然有更多列):
Assuming your dataframe looks something like this (obviously with more columns):
df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})
>>> df
author subreddit
0 a sr1
1 a sr2
2 b sr2
...
解决方案1:分组方式
比解决方案2更直接,并且类似于您的第一次尝试:
More straightforward than solution 2, and similar to your first attempt:
group = df.groupby('author')
df2 = group.apply(lambda x: x['subreddit'].unique())
# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
结果:
>>> df2
author
a [sr1, sr2]
b [sr2]
作者是索引,单列是其活动所在的所有子索引的列表(根据您的描述,这就是我解释您希望输出的方式).
The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).
如果您希望每个子redredit放在一个单独的列中,这可能会更有用,具体取决于您要执行的操作,您可以在执行以下操作之后
If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:
df2 = df2.apply(pd.Series)
结果:
>>> df2
0 1
author
a sr1 sr2
b sr2 NaN
解决方案2:遍历数据框
您可以创建一个包含所有唯一作者的新数据框:
you can make a new dataframe with all unique authors:
df2 = pd.DataFrame({'author':df.author.unique()})
然后只需获取它们处于活动状态的所有唯一子reddit的列表,并将其分配到新列:
And then just get the list of all unique subreddits they are active in, assigning it to a new column:
df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']]))
for _, x in df2.iterrows()]
这给你这个:
>>> df2
author subreddits
0 a [sr2, sr1]
1 b [sr2]
这篇关于 pandas ,对于一列中的每个唯一值,在另一列中获得唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!