如何在列表理解中使用正则表达式re.compile Match()或findall() [英] How to use regex re.compile Match() or findall() in list comprehension

查看:147
本文介绍了如何在列表理解中使用正则表达式re.compile Match()或findall()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在列表理解中使用正则表达式,而无需使用熊猫extract()函数.

I am trying to use regex in list comprehension without needing to use the pandas extract() functions.

我想使用正则表达式,因为我的代码可能需要更改,在我需要使用更复杂的模式匹配的地方.一位友善的用户在这里建议我使用str访问器函数,但再次使用它主要是因为当前模式足够简单.

I want to use regex because my code might need to change where I need to use more complex pattern matching. A kind user here suggested I use the str accessor functions but again it mainly works because the current pattern is simple enough.

到目前为止,我需要返回包含nanODFS_FILE_CREATE_DATETIME下的值不是10个字符串数字的熊猫行,即:与当前格式不匹配:2020012514.为此,我尝试绕过str方法并使用正则表达式.但是,这没有任何作用.即使我告诉它只将仅包含nanbool(regex.search())不是true的值放入值,它也会将所有内容都放入我的元组列表中:

As of now, I need to return pandas rows that either contain nan or whose values under ODFS_FILE_CREATE_DATETIME are not 10 string numbers i.e.: does not match the current format: 2020012514. To this intent I tried to bypass the str method and use regex. However this doesn't do anything. It puts everything into my list of tuples even though I told it to only put values that only contain nan or where the bool(regex.search()) is not true:

def process_csv_formatting(csv):
odfscsv_df = pd.read_csv(csv, header=None,names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'], dtype={'ODFS_FILE_CREATE_DATETIME': str})
odfscsv_df['CSV_FILENAME'] = csv.name
odfscdate_re = re.compile(r"\d{10}")
errortup = [(odfsname, "Bad_ODFS_FILE_CREATE_DATETIME= " + str(cdatetime), csv.name) for odfsname,cdatetime in zip(odfscsv_df['ODFS_LOG_FILENAME'], odfscsv_df['ODFS_FILE_CREATE_DATETIME']) if not odfscdate_re.search(str(cdatetime))]
emptypdf = pd.DataFrame(columns=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])

#print([tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1) | odfscdate_re.search(str(odfscsv_df['ODFS_FILE_CREATE_DATETIME'])) ].values])
m1 = odfscsv_df.isna().any(1)

m1 = odfscsv_df.isna().any(1)
s = odfscsv_df['ODFS_FILE_CREATE_DATETIME']
m2 = ~s.astype(str).str.isnumeric()
m2 = bool(odfscdate_re.search(str(s)))
m4 = not m2
print(m4)
m3 = s.astype(str).str.len().ne(10)

#print([tuple(x) for x in odfscsv_df[m1 | m2 | m3].values])
print([tuple(x) for x in odfscsv_df[m1 | ~bool(odfscdate_re.search(str(s)))].values])

if len(errortup) != 0:
    #print(errortup)  #put this in log file statement somehow
    #print(errortup[0][2])
    return emptypdf
else:

    return odfscsv_df

推荐答案

如果要使用re模块.您需要将其与map一起使用.对于10位数字的字符串,请使用以下模式r"^\d{10}$"

If you want to use re module. You need to use it with map. For 10-digit strings, use this pattern r"^\d{10}$"

import re

odfscdate_re = re.compile(r"^\d{10}$")

m1 = odfscsv_df.isna().any(1)
m2 = odfscsv_df['ODFS_FILE_CREATE_DATETIME'].map(lambda x: 
                                                 odfscdate_re.search(str(x)) == None)
[tuple(x) for x in odfscsv_df[m1 | m2].values]

注意:根据您的要求,我认为您也可以使用match代替search.

Note: depend on your requirement, I think you may also use match instead of search.

这篇关于如何在列表理解中使用正则表达式re.compile Match()或findall()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆