如何根据条件从DataFrame中获取单词数 [英] How to get count of words from DataFrame based on conditions

查看:45
本文介绍了如何根据条件从DataFrame中获取单词数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下两个数据帧 badgescomments.我从 badges 数据框中创建了一个黄金用户"列表,其 Class=1.

I have the following two dataframes badges and comments. I have created a list of 'gold users' from badges dataframe whose Class=1.

这里Name表示徽章名称",Class表示徽章等级(1=金,2=银,3=铜).

Here Name means the 'Name of Badge' and Class means the level of Badge (1=Gold, 2=Silver, 3=Bronze).

我已经对 comments['Text'] 进行了文本预处理,现在想从 comments['Text'] 中找到金牌用户的前 10 个单词的数量代码>.

I have already done the text preprocessing on comments['Text']and now want to find the count of top 10 words for gold users from comments['Text'].

我尝试了给定的代码,但出现错误
"KeyError: "[Index(['1532', '290', '1946', '1459', '6094', '766', '10446', '3106', '1',\n'1587',\n ...\n '35760', '45979', '113061', '35306', '104330', '40739', '4181', '58888',\n '2833',58158'],\n dtype='object', length=1708)] 在 [index]"中.请为我提供解决此问题的方法.

I tried the given code but am getting error
"KeyError: "None of [Index(['1532', '290', '1946', '1459', '6094', '766', '10446', '3106', '1',\n '1587',\n ...\n '35760', '45979', '113061', '35306', '104330', '40739', '4181', '58888',\n '2833', '58158'],\n dtype='object', length=1708)] are in the [index]". Please provide me a way to fix this.

注意我从 datascience.stackexchange 得到了一些答案,但它们不起作用.StackExchange 问题链接

Note I had some answers from datascience.stackexchange but they did not work. Link to StackExchange Problem

数据框 1(徽章)

   Id | UserId |  Name          |        Date              |Class | TagBased
   2  | 23     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   3  | 22     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   4  | 21     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   5  | 20     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   6  | 19     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False

数据框 2(评论)

   Id|                    Text                             |    UserId  
    6|  [2006, course, allen, knutsons, 2001, course, ...  |    3   
    8|  [also, theo, johnsonfreyd, note, mark, haimans...  |    1

代码

#Classifying Users
df_gold_users = badges[(badges['Class'] == '1')]
df_silver_users = badges[(badges['Class'] != '1') & (badges['Class'] == '2') ]
df_bronze_users = badges[(badges['Class'] != '1') & (badges['Class'] != '2') & (badges['Class'] == '3')]

gold_users = df_gold_users['UserId'].value_counts().index
silver_users = df_silver_users['UserId'].value_counts().index
bronze_users = df_bronze_users['UserId'].value_counts().index

#Text Cleaning (clean_text function tokenizes and lemmatizes)
comments['Text'] = comments['Text'].apply(lambda x: clean_text(x))

#Getting comments made by Gold Users
for index,rows in comments.iterrows():
  gold_comments = rows[comments.Text.loc[gold_users]]
  Counter(gold_comments)

预期产出

#Top 10 Words that appear the most in the comments made by gold users with their count.
 [['scholar',20],['school',18],['bus',15],['class',14],['teacher',14],['bell',13],['time',12],['books',11],['bag',9],'student',7]]

推荐答案

import itertools

df_gold_users = badges[(badges['Class'] == '1')]
df=pd.merge(df_gold_users,comments,on='UserId')
gold_text=list(itertools.chain.from_iterable(df['Text'].to_list()))
gold_text=list(map(lambda x:[x,1],gold_text))
gold_text_df=pd.DataFrame(gold_text,columns=['Text','xyz'])
gold_text_df=gold_text_df.groupby('Text')['xyz'].count().reset_index().sort_values(by=['xyz'], ascending=False)
gold_text_df(10).values.tolist()

这篇关于如何根据条件从DataFrame中获取单词数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆