将正则表达式匹配到另一个数据框中的类型 [英] Match regex to its type in another dataframe

查看:95
本文介绍了将正则表达式匹配到另一个数据框中的类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将数据值与其正则表达式类型匹配,但正则表达式位于另一个数据框中?这是样本数据df和正则表达式df。请注意,这两个df具有不同的形状,因为正则表达式df仅是参考df,并且仅包含唯一值。

How to match data value with its regex type but the regex is in another dataframe? Here is the sample Data df and Regex df. Note that these two df have different shape as the regex df is just reference df and only contain unique value.

           **Data df**                                          **Regex df**

  **Country    Type      Data**                       **Country    Type       Regex**
      MY       ABC     MY1234567890                        MY       ABC    ^MY[0-9]{10}
      IT       ABC     IT1234567890                        IT       ABC    ^IT[0-9]{10}
      PL       PQR     PL123456                            PL       PQR    ^PL
      MY       ABC     456792abc                           MY       DEF    ^\w{6,10}$
      IT       ABC     MY45889976                          IT       XYZ    ^\w{6,10}$
      IT       ABC     IT56788897

对于与其正则表达式不匹配的数据,我如何找到与之匹配的数据数据及其国家/地区,但可以扫描该国家/地区拥有的所有类型。例如,此数据 MY45889976未遵循其正则表达式(IT)国家和(ABC)类型。但它与所在国家/地区的另一种类型(XYZ)相匹配。因此,它将添加另一列并提供与其匹配的类型。

For the data that is not match to its own regex, how can I find match for the data with its Country but scan through all the type that the country has. For example, this data 'MY45889976' does not follow its regex (IT) country and (ABC) type. But it match with another type for its country which is the (XYZ) type. So it will add another column and give the type that it match with.

我想要的输出是这样的,

My desired output is something like this,

    Country Type          Data     Data Quality   Suggestion
0      MY    ABC  MY1234567890          1            0
1      IT    ABC  IT1234567890          1            0
2      IT    ABC    MY45889976          0           XYZ
3      IT    ABC   IT567888976          0           XYZ
4      PL    PQR      PL123456          1            0
5      MY    XYZ     456792abc          0           DEF

这是我为匹配正则表达式而获得的数据质量列(在连接后),

This is what I have done to match the regex to get the data quality column (after concatenation),

df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)

但是我不确定如何前进。是否有任何简单的方法可以执行此操作而不进行连接,以及如何通过扫描整个正则表达式来查找匹配的正则表达式,但仅绑定其国家/地区。谢谢

But I'm not sure how to move forward. Is there any easy way to do this without concatenation and how to find matching regex by scanning its whole type but tie to its country only. Thanks

推荐答案

请参阅:在另一列Python中将其自己的正则表达式与另一列匹配

仅应用新的Coumun建议,其逻辑取决于您的描述。

just apply a new Coumun suggestion, it's logic depend on your description.

def func(dfRow):
    #find the same Country and Type
    sameDF = regexDF.loc[(regexDF['Country'] == dfRow['Country']) & (regexDF['Type'] == dfRow['Type'])]
    if sameDF.size > 0 and re.match(sameDF.iloc[0]["Regex"],dfRow["Data"]):
        return 0
    #find the same Country, then find mathec Type
    sameCountryDF = regexDF.loc[(regexDF['Country'] == dfRow['Country'])]
    for index, row in sameCountryDF.iterrows():
        if re.match(row["Regex"], dfRow["Data"]):
            return row["Type"]

df["Suggestion"]=df.apply(func, axis=1)

这篇关于将正则表达式匹配到另一个数据框中的类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆