计算 pandas 数据帧中每个特定单词的出现次数 [英] Count occurrences of each of certain words in pandas dataframe

查看:74
本文介绍了计算 pandas 数据帧中每个特定单词的出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算数据框中每个特定单词的出现次数.我目前使用 str.contains:

I want to count the number of occurrences of each of certain words in a data frame. I currently do it using str.contains:

a = df2[df2['col1'].str.contains("sample")].groupby('col2').size()
n = a.apply(lambda x: 1).sum()

是否有匹配正则表达式并获取出现次数的方法?就我而言,我有一个大数据框,我想匹配大约 100 个字符串.

Is there a method to match regular expression and get the count of occurrences? In my case I have a large dataframe and I want to match around 100 strings.

推荐答案

更新:原始答案计算那些包含子字符串的行.

Update: Original answer counts those rows which contain a substring.

要计算子字符串的所有出现次数,您可以使用 .str.count:

To count all the occurrences of a substring you can use .str.count:

In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])

In [22]: df.words.str.count("he|wo")
Out[22]:
0    1
1    1
2    2
Name: words, dtype: int64

In [23]: df.words.str.count("he|wo").sum()
Out[23]: 4

<小时>

str.contains 方法接受一个正则表达式:


The str.contains method accepts a regular expression:

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array

Parameters
----------
pat : string
    Character sequence or regular expression
case : boolean, default True
    If True, case sensitive
flags : int, default 0 (no flags)
    re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.

例如:

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])

In [12]: df
Out[12]:
   words
0  hello
1  world

In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0    True
1    True
Name: words, dtype: bool

In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0    True
1    True
Name: words, dtype: bool

要计算出现的次数,您可以对这个布尔系列求和:

To count the occurences you can just sum this boolean Series:

In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2

In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1

这篇关于计算 pandas 数据帧中每个特定单词的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆