Python pandas计算字符串中正则表达式匹配的数量 [英] Python pandas count number of Regex matches in a string
问题描述
我有一个带有句子的数据框和一个按主题分组的术语词典,我想计算每个主题的术语匹配数。
I have a dataframe with sentences and a dictionary of terms grouped into topics, where I want to count the number of term matches for each topic.
import pandas as pd
terms = {'animals':["fox","deer","eagle"],
'people':['John', 'Rob','Steve'],
'games':['basketball', 'football', 'hockey']
}
df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['The quick brown fox was playing basketball today','John and Rob visited the eagles nest, the foxes ran away','Bill smells like a wet dog','Steve threw the football at a deer. But the football missed','Sheriff John does not like hockey']
})
到目前为止,我已经为主题创建了列,如果通过遍历字典存在单词,则将其标记为1。
So far I have created columns for the topics and marked it with 1 if a word is present by iterating over the dictionary.
df = pd.concat([df, pd.DataFrame(columns=list(terms.keys()))])
for k, v in terms.items():
for val in v:
df.loc[df.Foo.str.contains(val), k] = 1
print (df)
我得到:
>>>
Foo Score animals games \
0 The quick brown fox was playing basketball today 4 1 1
1 John and Rob visited the eagles nest, the foxe... 6 1 NaN
2 Bill smells like a wet dog 2 NaN NaN
3 Steve threw the football at a deer. But the fo... 7 1 1
4 Sheriff John does not like hockey 8 NaN 1
people
0 NaN
1 1
2 NaN
3 1
4 1
计算数字的最佳方式是什么?句子中出现的每个主题的单词?是否有一种更有效的循环字典而不使用 cython
?
What is the best way to count the number of words for each topic that appears in the sentence? and is there a more efficient way of looping over the dictionary without using cython
?
推荐答案
您可以使用 拆分
与 stack
计数器
解决方案的速度提高了5倍:
You can use split
with stack
what is 5 times faster as Counter
solution:
df1 = df.Foo.str.split(expand=True).stack()
.reset_index(level=1, drop=True)
.reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
#print df1
print df1.groupby('index').sum().astype(int)
games animals people
index
0 1 1 0
1 0 2 2
2 0 0 0
3 2 1 1
4 1 0 1
时间:
In [233]: %timeit a(df)
100 loops, best of 3: 4.9 ms per loop
In [234]: %timeit b(df)
10 loops, best of 3: 25.2 ms per loop
代码:
def a(df):
df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
return df1.groupby('index').sum().astype(int)
def b(df):
from collections import Counter
df1 = pd.DataFrame(terms)
res = []
for i,r in df.iterrows():
s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
res.append(pd.DataFrame(s).T)
return pd.concat(res)
这篇关于Python pandas计算字符串中正则表达式匹配的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!