pandas 在系列中找到共同的字符串 [英] pandas find strings in common among Series

查看:54
本文介绍了 pandas 在系列中找到共同的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从一个更大的 DataFrame 和一个 DataFrame 中提取了一系列关键字,其中包括一列字符串.我想屏蔽 DataFrame 发现哪些字符串至少包含一个关键字.关键词"系列如下(怪词见谅):

I have a Series of keywords extracted from a bigger DataFrame and a DataFrame with, among others, a column of strings. I would like to mask the DataFrame finding which strings contains at least one keyword. The "Keywords" Series is as follows (sorry for the weird words):

Skilful
Wilful
Somewhere
Thing
Strange

DataFrame 如下所示:

The DataFrame looks as follows:

User_ID;Tweet
01;hi all
02;see you somewhere
03;So weird
04;hi all :-)
05;next big thing
06;how can i say no?
07;so strange
08;not at all

到目前为止,我使用了 Pandas 中的 str.contains() 函数,例如:

So far I used a str.contains() function from pandas like:

mask = df['Tweet'].str.contains(str(Keywords['Keyword'][4]), case=False)

在 DataFrame 中找到Strange"字符串并返回效果很好:

which works well finding the "Strange" string in the DataFrame and returns:

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
Name: Tweet, dtype: bool

我想要做的是用 all Keywords 数组屏蔽整个 DataFrame,所以我可以有这样的东西:

What I would like to do is to mask the whole DataFrame with the all Keywords array, so I can have something like this:

0    False
1     True
2    False
3    False
4     True
5    False
6     True
7    False
Name: Tweet, dtype: bool

是否可以不循环遍历数组?在我的真实案例中,我必须搜索数百万个字符串,因此我正在寻找一种快速的方法.

Is it possible without looping through the array? In my real case I have to search through millions of strings, so I'm looking for a fast method.

感谢您的帮助.

推荐答案

实现此目的的另一种方法是将 pd.Series.isin()ma​​p 和 <强>申请,您的样本将如下所示:

Another way to achieve this is to use pd.Series.isin() with map and apply, with your sample it will be like:

df    # DataFrame

   User_ID              Tweet
0        1             hi all
1        2  see you somewhere
2        3           So weird
3        4         hi all :-)
4        5     next big thing
5        6  how can i say no?
6        7         so strange
7        8         not at all

<小时>

w    # Series

0      Skilful
1       Wilful
2    Somewhere
3        Thing
4      Strange
dtype: object

<小时>

# list
masked = map(lambda x: any(w.apply(str.lower).isin(x)), \                 
             df['Tweet'].apply(str.lower).apply(str.split))

df['Tweet_masked'] = masked

结果:

df
Out[13]: 
   User_ID              Tweet Tweet_masked
0        1             hi all        False
1        2  see you somewhere         True
2        3           So weird        False
3        4         hi all :-)        False
4        5     next big thing         True
5        6  how can i say no?        False
6        7         so strange         True
7        8         not at all        False

附带说明,isin 仅在整个字符串与值匹配时才有效,以防您只对 str.contains 感兴趣,这是变体:

As a side note, isin only works if the whole string matches the values, in case you are only interested in str.contains, here's the variant:

masked = map(lambda x: any(_ in x for _ in w.apply(str.lower)), \
             df['Tweet'].apply(str.lower))

更新:正如@Alex 指出的那样,将 map 和 regexp 结合起来可能会更有效,实际上我不太喜欢 ma​​p + lambda,我们开始:

Updated: as @Alex pointed out, it could be even more efficient to combine both map and regexp, in fact I don't quite like map + lambda neither, here we go:

import re

r = re.compile(r'.*({}).*'.format('|'.join(w.values)), re.IGNORECASE)

masked = map(bool, map(r.match, df['Tweet']))

这篇关于 pandas 在系列中找到共同的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆