将正则表达式匹配到另一个数据框中的类型 [英] Match regex to its type in another dataframe
问题描述
如何将数据值与其正则表达式类型匹配,但正则表达式位于另一个数据框中?这是样本数据df和正则表达式df。请注意,这两个df具有不同的形状,因为正则表达式df仅是参考df,并且仅包含唯一值。
How to match data value with its regex type but the regex is in another dataframe? Here is the sample Data df and Regex df. Note that these two df have different shape as the regex df is just reference df and only contain unique value.
**Data df** **Regex df**
**Country Type Data** **Country Type Regex**
MY ABC MY1234567890 MY ABC ^MY[0-9]{10}
IT ABC IT1234567890 IT ABC ^IT[0-9]{10}
PL PQR PL123456 PL PQR ^PL
MY ABC 456792abc MY DEF ^\w{6,10}$
IT ABC MY45889976 IT XYZ ^\w{6,10}$
IT ABC IT56788897
对于与其正则表达式不匹配的数据,我如何找到与之匹配的数据数据及其国家/地区,但可以扫描该国家/地区拥有的所有类型。例如,此数据 MY45889976未遵循其正则表达式(IT)国家和(ABC)类型。但它与所在国家/地区的另一种类型(XYZ)相匹配。因此,它将添加另一列并提供与其匹配的类型。
For the data that is not match to its own regex, how can I find match for the data with its Country but scan through all the type that the country has. For example, this data 'MY45889976' does not follow its regex (IT) country and (ABC) type. But it match with another type for its country which is the (XYZ) type. So it will add another column and give the type that it match with.
我想要的输出是这样的,
My desired output is something like this,
Country Type Data Data Quality Suggestion
0 MY ABC MY1234567890 1 0
1 IT ABC IT1234567890 1 0
2 IT ABC MY45889976 0 XYZ
3 IT ABC IT567888976 0 XYZ
4 PL PQR PL123456 1 0
5 MY XYZ 456792abc 0 DEF
这是我为匹配正则表达式而获得的数据质量列(在连接后),
This is what I have done to match the regex to get the data quality column (after concatenation),
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
但是我不确定如何前进。是否有任何简单的方法可以执行此操作而不进行连接,以及如何通过扫描整个正则表达式来查找匹配的正则表达式,但仅绑定其国家/地区。谢谢
But I'm not sure how to move forward. Is there any easy way to do this without concatenation and how to find matching regex by scanning its whole type but tie to its country only. Thanks
推荐答案
请参阅:在另一列Python中将其自己的正则表达式与另一列匹配
仅应用新的Coumun建议,其逻辑取决于您的描述。
just apply a new Coumun suggestion, it's logic depend on your description.
def func(dfRow):
#find the same Country and Type
sameDF = regexDF.loc[(regexDF['Country'] == dfRow['Country']) & (regexDF['Type'] == dfRow['Type'])]
if sameDF.size > 0 and re.match(sameDF.iloc[0]["Regex"],dfRow["Data"]):
return 0
#find the same Country, then find mathec Type
sameCountryDF = regexDF.loc[(regexDF['Country'] == dfRow['Country'])]
for index, row in sameCountryDF.iterrows():
if re.match(row["Regex"], dfRow["Data"]):
return row["Type"]
df["Suggestion"]=df.apply(func, axis=1)
这篇关于将正则表达式匹配到另一个数据框中的类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!