计算 pandas 中一列字符串中的单词 [英] Count words in a column of strings in Pandas

查看:49
本文介绍了计算 pandas 中一列字符串中的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas数据框,其中包含给定时间段内的查询和计数,我希望将此数据框转换为许多唯一字.例如,如果数据框包含以下内容:

I have a pandas dataframe that contains queries and counts for a given time period and I'm hoping to turn this dataframe into a count of unique words. For example, if the dataframe contained the below:

query          count
foo bar        10
super          8 
foo            4
super foo bar  2

我希望收到以下数据框.例如"foo"一词在表格中恰好出现了16次.

I'm looking to receive the below dataframe. e.g. the word 'foo' appears exactly 16 times within the table.

word    count
foo     16
bar     12
super   10

我正在使用下面的函数,但是这似乎不是执行此操作的最佳方法,并且它忽略了每一行的总数.

I'm working with the below function, but it hardly seems like the optimal way to do this and it ignores the total count for each row.

def _words(df):
  return Counter(re.findall(r'\w+', ' '.join(df['query'])))

任何帮助将不胜感激.

提前谢谢!

推荐答案

选项1

df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

bar      12
foo      16
super    10
dtype: int64


选项2

df['query'].str.get_dummies(sep=' ').mul(df['count'], axis=0).sum()

bar      12
foo      16
super    10
dtype: int64


选项3
numpy.bincount + pd.factorize
还重点介绍了 cytoolz.mapcat 的用法.它返回一个迭代器,在该映射器中映射一个函数并连接结果.这很酷!


Option 3
numpy.bincount + pd.factorize
also highlighting the use of cytoolz.mapcat. It returns an iterator where it maps a function and concatenates the results. That's cool!

import pandas as pd, numpy as np, cytoolz

q = df['query'].values
c = df['count'].values

f, u = pd.factorize(list(cytoolz.mapcat(str.split, q.tolist())))
l = np.core.defchararray.count(q.astype(str), ' ') + 1

pd.Series(np.bincount(f, c.repeat(l)).astype(int), u)

foo      16
bar      12
super    10
dtype: int64


选项4
东西的荒谬使用...仅使用选项1.


Option 4
Absurd use of stuff... just use option 1.

pd.DataFrame(dict(
    query=' '.join(df['query']).split(),
    count=df['count'].repeat(df['query'].str.count(' ') + 1)
)).groupby('query')['count'].sum()

query
bar      12
foo      16
super    10
Name: count, dtype: int64

这篇关于计算 pandas 中一列字符串中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆