如何计算DataFrame中字符串中的单词数? [英] How to calculate number of words in a string in DataFrame?
本文介绍了如何计算DataFrame中字符串中的单词数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
假设我们有简单的数据框
Suppose we have simple Dataframe
df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']
如何计算关键字中的单词数,类似于:
how to calculate number of words in keywords, similar to:
1 word: 2
2 words: 2
3 words: 1
4 words: 1
推荐答案
IIUC,然后您可以执行以下操作:
IIUC then you can do the following:
In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count
Out[89]:
1 words: 2
2 words: 2
3 words: 1
4 words: 1
Name: fruits, dtype: int64
此处,我们使用矢量化的 str.split
分割空格,然后 len
以获取元素数量的计数,然后我们可以调用
Here we use the vectorised str.split
to split on spaces, and then apply
len
to get the count of the number of elements, we can then call value_counts
to aggregate the frequency count.
然后我们重命名索引并对其进行排序以获得所需的输出
We then rename the index and sort it to get the desired output
更新
也可以使用str.len
而不是apply
来完成,这应该可以更好地扩展:
This can also be done using str.len
rather than apply
which should scale better:
In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count
Out[41]:
0 words: 2
1 words: 1
2 words: 3
3 words: 4
4 words: 2
5 words: 1
Name: fruits, dtype: int64
时间
In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()
1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop
对于6K df:
In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()
100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop
这篇关于如何计算DataFrame中字符串中的单词数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文