Python:UserWarning:此模式具有匹配组.要实际获得组,请使用str.extract [英] Python: UserWarning: This pattern has match groups. To actually get the groups, use str.extract

查看:190
本文介绍了Python:UserWarning:此模式具有匹配组.要实际获得组,请使用str.extract的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我尝试获取字符串,其中列的其中包含一些字符串 DF看起来像

I have a dataframe and I try to get string, where on of column contain some string Df looks like

member_id,event_path,event_time,event_duration
30595,"2016-03-30 12:27:33",yandex.ru/,1
30595,"2016-03-30 12:31:42",yandex.ru/,0
30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:49",kinogo.co/,1
30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0

另一个带有网址的df

And another df with urls

url
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_
003\.ru\/sonyxperia
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23
1click\.ru\/sonyxperia
1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola

我使用

urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False)
substr = urls.url.values.tolist()
data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000)
result = pd.DataFrame()
for i, df in enumerate(data):
    res = df[df['event_time'].str.contains('|'.join(substr), regex=True)]

但它回报了我

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

我该如何解决?

推荐答案

urls中的至少一个正则表达式模式必须使用捕获组. str.contains仅对df['event_time']中的每一行返回True或False- 它不使用捕获组.因此,UserWarning会警告您 正则表达式使用捕获组,但不使用匹配项.

At least one of the regex patterns in urls must use a capturing group. str.contains only returns True or False for each row in df['event_time'] -- it does not make use of the capturing group. Thus, the UserWarning is alerting you that the regex uses a capturing group but the match is not used.

如果要删除UserWarning,则可以从正则表达式模式中找到并删除捕获组.它们不会显示在您发布的正则表达式模式中,但是它们必须在您的实际文件中.在字符类之外查找括号.

If you wish to remove the UserWarning you could find and remove the capturing group from the regex pattern(s). They are not shown in the regex patterns you posted, but they must be there in your actual file. Look for parentheses outside of the character classes.

或者,您可以通过放置

import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

在调用str.contains之前.

这是一个简单的示例,演示了问题(和解决方案):

Here is a simple example which demonstrates the problem (and solution):

# import warnings
# warnings.filterwarnings("ignore", 'This pattern has match groups') # uncomment to suppress the UserWarning

import pandas as pd

df = pd.DataFrame({ 'event_time': ['gouda', 'stilton', 'gruyere']})

urls = pd.DataFrame({'url': ['g(.*)']})   # With a capturing group, there is a UserWarning
# urls = pd.DataFrame({'url': ['g.*']})   # Without a capturing group, there is no UserWarning. Uncommenting this line avoids the UserWarning.

substr = urls.url.values.tolist()
df[df['event_time'].str.contains('|'.join(substr), regex=True)]

打印

  script.py:10: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  df[df['event_time'].str.contains('|'.join(substr), regex=True)]

从正则表达式模式中删除捕获组:

Removing the capturing group from the regex pattern:

urls = pd.DataFrame({'url': ['g.*']})   

避免用户警告.

这篇关于Python:UserWarning:此模式具有匹配组.要实际获得组,请使用str.extract的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆