使用 pandas 过滤具有多个值的单元格中的字符串 [英] Using Pandas to Filter String In Cell with Multiple Values
问题描述
我正在使用pandas通过str.contains()过滤数据帧,但是我的逻辑是删除了我可能想要保留给定字符串的值。我不知道如何使用熊猫来解决这个问题。
I am using pandas to filter a data frame using str.contains() but my logic is dropping values that I might want to keep given the string. I don't know how to use Pandas to sort this out.
我正在使用的excel工作表中的示例单元格如下所示:
A sample cell in the excel sheet that I am working with would look like:
案例1:不要标记此内容,因为存在另一个收件人bob@gmail.com
Case #1: Don't flag this because there is a different recipient, bob@gmail.com
Recipient
---------
joe@work.com, bob@gmail.com, sally@work.com
案例2:标记此为因为每个收件人都包含@ work.com
Case #2: Flag this because every recipient contains @work.com
Recipient
---------
mike@work.com, taylor@work.com, barbra@work.com
我遇到的情况是,如果出现特定值,我只需要使用它进行过滤。
例如,如果收件人包含电子邮件joe@work.com,则删除该值。但是,如果收件人列中包含 joe @ work.com,bob @ gmail.com(是的,则值以逗号分隔,就像在单个单元格中一样)。最终,该数据框将从最终报告中删除。因此,我想删除仅包含@ work.com的所有内容,但不要删除包含@ gmail.com,@ work.com的所有内容。
I have a situation where I only need it to filter if a specific value occurs. For example, if 'Recipient' contains the email joe@work.com, drop this value. But if Recipient column contains 'joe@work.com, bob@gmail.com' (Yes, the values are separated in a comma like that in a single cell.) and keep it. Eventually, this dataframe will be dropped from a final report. So I want to drop everything that just contains @work.com, but don't drop if it contains a @gmail.com, @work.com.
即使收件人列中包含 gmail.com,以下查询也会删除所有内容
This query below is dropping everything even if the Recipient column contains 'gmail.com'
df['EMAIL10'] = df['Type'].str.contains('Email') & df['Type'].str.contains(
'Tracking | Data') & df[
'Recipient'].str.contains('@work.com')
让我知道是否需要澄清
推荐答案
您可以创建一个布尔掩码,指示是否所有
单独的单词包含'@ work'
。
You can create a Boolean Mask that indicates whether or not all
separate words contain '@work'
.
首先, split
,以便将每个单词放入一个单独的单元格中,然后 explode
会将其变成一个大系列,其中索引重复且指向原始DataFrame的索引。 .str.contains
检查您的情况, all(level = 0)
检查是否对每个单词都正确从原始的DataFrame。
First, split
so that each word is placed into a separate cell, and explode
will turn this into one big Series, with the index duplicated and pointing back to the index of your original DataFrame. .str.contains
checks your condition and all(level=0)
checks whether it's True for every word in a row from your original DataFrame.
import pandas as pd
df = pd.DataFrame({'col': ['joe@work.com, bob@gmail.com, sally@work.com',
'mike@work.com, taylor@work.com, barbra@work.com']})
df['all_work'] = df['col'].str.split(', ').explode().str.contains('@work').all(level=0)
print(df)
col all_work
0 joe@work.com, bob@gmail.com, sally@work.com False
1 mike@work.com, taylor@work.com, barbra@work.com True
为说明起见,分割后和爆炸
,我们有:
df['col'].str.split(', ').explode()
0 joe@work.com
0 bob@gmail.com # Each item split separately
0 sally@work.com
1 mike@work.com
1 taylor@work.com
1 barbra@work.com
#|
#Index corresponds to Index of the original DataFrame
这篇关于使用 pandas 过滤具有多个值的单元格中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!