pandas ,对于一列中的每个唯一值,在另一列中获得唯一值 [英] Pandas, for each unique value in one column, get unique values in another column

查看:75
本文介绍了 pandas ,对于一列中的每个唯一值,在另一列中获得唯一值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中的每一行都包含与单个Reddit注释有关的各种元数据(例如作者,subreddit,注释文本).

I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).

我想执行以下操作:对于每个作者,我想获取他们有注释的所有子reddit的列表,并将此数据转换为熊猫数据框,其中每一行都对应于一个作者,并列出所有列表.他们评论的唯一子提示.

I want to do the following: for each author, I want to grab a list of all the subreddits they have comments in, and transform this data into a pandas dataframe where each row corresponds to an author, and a list of all the unique subreddits they comment in.

我目前正在尝试以下几种组合,但无法解决:

I am currently trying some combination of the following, but can't get it down:

尝试1:

group = df['subreddit'].groupby(df['author']).unique()
list(group) 

尝试2:

from collections import defaultdict
subreddit_dict  = defaultdict(list)

for index, row in df.iterrows():
    author = row['author']
    subreddit = row['subreddit']
    subreddit_dict[author].append(subreddit)

for key, value in subreddit_dict.items():
    subreddit_dict[key] = set(value)

subreddit_df = pd.DataFrame.from_dict(subreddit_dict, 
                            orient = 'index')

推荐答案

这里有两种策略可以做到这一点.毫无疑问,还有其他方法.

Here are two strategies to do it. No doubt, there are other ways.

假设您的数据框看起来像这样 (显然有更多列):

Assuming your dataframe looks something like this (obviously with more columns):

df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})

>>> df
  author subreddit
0      a       sr1
1      a       sr2
2      b       sr2
...

解决方案1:分组方式

比解决方案2更直接,并且类似于您的第一次尝试:

More straightforward than solution 2, and similar to your first attempt:

group = df.groupby('author')

df2 = group.apply(lambda x: x['subreddit'].unique())

# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())

结果:

>>> df2
author
a    [sr1, sr2]
b         [sr2]

作者是索引,单列是其活动所在的所有子索引的列表(根据您的描述,这就是我解释您希望输出的方式).

The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).

如果您希望每个子redredit放在一个单独的列中,这可能会更有用,具体取决于您要执行的操作,您可以在执行以下操作之后

If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:

df2 = df2.apply(pd.Series)

结果:

>>> df2
          0    1
author          
a       sr1  sr2
b       sr2  NaN

解决方案2:遍历数据框

您可以创建一个包含所有唯一作者的新数据框:

you can make a new dataframe with all unique authors:

df2 = pd.DataFrame({'author':df.author.unique()})

然后只需获取它们处于活动状态的所有唯一子reddit的列表,并将其分配到新列:

And then just get the list of all unique subreddits they are active in, assigning it to a new column:

df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) 
    for _, x in df2.iterrows()]

这给你这个:

>>> df2
  author  subreddits
0      a  [sr2, sr1]
1      b       [sr2]

这篇关于 pandas ,对于一列中的每个唯一值,在另一列中获得唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆