如何计算DataFrame中字符串中的单词数? [英] How to calculate number of words in a string in DataFrame?

查看:167
本文介绍了如何计算DataFrame中字符串中的单词数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有简单的数据框

Suppose we have simple Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']

如何计算关键字中的单词数,类似于:

how to calculate number of words in keywords, similar to:

1 word: 2
2 words: 2
3 words: 1
4 words: 1

推荐答案

IIUC,然后您可以执行以下操作:

IIUC then you can do the following:

In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[89]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

此处,我们使用矢量化的 str.split 分割空格,然后 len以获取元素数量的计数,然后我们可以调用

Here we use the vectorised str.split to split on spaces, and then apply len to get the count of the number of elements, we can then call value_counts to aggregate the frequency count.

然后我们重命名索引并对其进行排序以获得所需的输出

We then rename the index and sort it to get the desired output

更新

也可以使用str.len而不是apply来完成,这应该可以更好地扩展:

This can also be done using str.len rather than apply which should scale better:

In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[41]:
0 words:    2
1 words:    1
2 words:    3
3 words:    4
4 words:    2
5 words:    1
Name: fruits, dtype: int64

时间

In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop

对于6K df:

In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop

这篇关于如何计算DataFrame中字符串中的单词数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆