使用 pandas 过滤具有多个值的单元格中的字符串 [英] Using Pandas to Filter String In Cell with Multiple Values

查看:83
本文介绍了使用 pandas 过滤具有多个值的单元格中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用pandas通过str.contains()过滤数据帧,但是我的逻辑是删除了我可能想要保留给定字符串的值。我不知道如何使用熊猫来解决这个问题。

I am using pandas to filter a data frame using str.contains() but my logic is dropping values that I might want to keep given the string. I don't know how to use Pandas to sort this out.

我正在使用的excel工作表中的示例单元格如下所示:

A sample cell in the excel sheet that I am working with would look like:

案例1:不要标记此内容,因为存在另一个收件人bob@gmail.com

Case #1: Don't flag this because there is a different recipient, bob@gmail.com

Recipient
---------
joe@work.com, bob@gmail.com, sally@work.com

案例2:标记此为因为每个收件人都包含@ work.com

Case #2: Flag this because every recipient contains @work.com

Recipient
---------
mike@work.com, taylor@work.com, barbra@work.com

我遇到的情况是,如果出现特定值,我只需要使用它进行过滤。
例如,如果收件人包含电子邮件joe@work.com,则删除该值。但是,如果收件人列中包含 joe @ work.com,bob @ gmail.com(是的,则值以逗号分隔,就像在单个单元格中一样)。最终,该数据框将从最终报告中删除。因此,我想删除仅包含@ work.com的所有内容,但不要删除包含@ gmail.com,@ work.com的所有内容。

I have a situation where I only need it to filter if a specific value occurs. For example, if 'Recipient' contains the email joe@work.com, drop this value. But if Recipient column contains 'joe@work.com, bob@gmail.com' (Yes, the values are separated in a comma like that in a single cell.) and keep it. Eventually, this dataframe will be dropped from a final report. So I want to drop everything that just contains @work.com, but don't drop if it contains a @gmail.com, @work.com.

即使收件人列中包含 gmail.com,以下查询也会删除所有内容

This query below is dropping everything even if the Recipient column contains 'gmail.com'

df['EMAIL10'] = df['Type'].str.contains('Email') & df['Type'].str.contains(
                'Tracking | Data') & df[
                                'Recipient'].str.contains('@work.com') 

让我知道是否需要澄清

推荐答案

您可以创建一个布尔掩码,指示是否所有单独的单词包含'@ work'

You can create a Boolean Mask that indicates whether or not all separate words contain '@work'.

首先, split ,以便将每个单词放入一个单独的单元格中,然后 explode 会将其变成一个大系列,其中索引重复且指向原始DataFrame的索引。 .str.contains 检查您的情况, all(level = 0)检查是否对每个单词都正确从原始的DataFrame。

First, split so that each word is placed into a separate cell, and explode will turn this into one big Series, with the index duplicated and pointing back to the index of your original DataFrame. .str.contains checks your condition and all(level=0) checks whether it's True for every word in a row from your original DataFrame.

import pandas as pd

df = pd.DataFrame({'col': ['joe@work.com, bob@gmail.com, sally@work.com', 
                           'mike@work.com, taylor@work.com, barbra@work.com']})

df['all_work'] = df['col'].str.split(', ').explode().str.contains('@work').all(level=0)







print(df)
                                               col  all_work
0      joe@work.com, bob@gmail.com, sally@work.com     False
1  mike@work.com, taylor@work.com, barbra@work.com      True






为说明起见,分割后和爆炸,我们有:

df['col'].str.split(', ').explode()

 0       joe@work.com 
 0      bob@gmail.com   # Each item split separately
 0     sally@work.com
 1      mike@work.com
 1    taylor@work.com
 1    barbra@work.com
#|
#Index corresponds to Index of the original DataFrame

这篇关于使用 pandas 过滤具有多个值的单元格中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆