过滤字符串是否包含子字符串pyspark [英] Filter if String contain sub-string pyspark
问题描述
我有2个数据集.在每一行中,我都有几列.但是我只想在每个数据集中只使用 2列,而不必在两个数据集之间进行任何联接,合并或组合.
I have 2 datasets. In each one I have several columns. But I want to use only 2 columns from each dataset, without doing any join, merge or combination between the both of the datasets.
示例数据集1:
column_dataset_1 <String> | column_dataset_1_normalized <String>
-----------------------------------------------------------------------
11882621-V021BRP161305-1 | 11882621V021BRP1613051
-----------------------------------------------------------------------
W-B.7120RP1605794 | WB7120RP1605794
-----------------------------------------------------------------------
D/57RP.1534421 | D57RP1534421
-----------------------------------------------------------------------
125858G_022BR/P070751 | 125858G022BRP070751
-----------------------------------------------------------------------
300B.5190C57/51507 | 300B5190C5751507
-----------------------------------------------------------------------
示例数据集2
column_dataset_2 <String> | column_dataset_2_normalized <String>
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Por ejemplo, si W-B.7120RP1605794se trata de un archivo de texto, | PorejemplosiWB7120RP1605794setratadeunarchivodetexto
-------------------------------------------------------------------------------------------------------------------------------------------------------------
se abrirá en un programa de procesamiento de texto. | seabrirenunprogramadeprocesamientodetexto
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-------------------------------------------------------------------------------------------------------------------------------------------------------------
utilizados 125858G_022BR/P070751 frecuentemente (por ejemplo, un texto que describe | utilizados125858G022BRP070751frecuentementeporejemplountextoquedescribe
--------------------------------------------------------------------------------------------------------------------------------------------------------------
column_dataset_1_normalized是column_dataset_1的结果被规范化column_dataset_2_normalized是column_dataset_2的结果已规范化
column_dataset_1_normalized is the result of column_dataset_1 is normalized column_dataset_2_normalized is the resut of column_dataset_2 is normalized
我想比较 column_dataset_1_normalized
中是否存在 column_dataset_2_normalized
.如果是,我应该从column_dataset_2
I want to compare column_dataset_1_normalized
if is exist in column_dataset_2_normalized
.
If yes I should extract it from column_dataset_2
示例:
WB7120RP1605794
位于 column_dataset_1_normalized
的第二行
中,存在于column_dataset_2_normalized 的第一行中,因此我应该从
column_dataset_2
中将其实际值[WB.7120RP1605794]
提取出来,并将其存储在数据集2的新列中.
WB7120RP1605794
is in the second line
of column_dataset_1_normalized
, is exist in the first line of column_dataset_2_normalized
, so I should extract
it's real value [W-B.7120RP1605794]
, from column_dataset_2
and store it in a new column in dataset 2.
与 125858G022BRP070751
相同的是 column_dataset_2_normalized
中的第四行,我应该从column_dataset_2 [125858G_022BR/P070751]
中提取它.比较应该以 column_dataset_1_normalized
的值一个一的值,然后在 column_dataset_2_normalized
的所有单元格中进行搜索.
And the same for 125858G022BRP070751
is in forth line in column_dataset_2_normalized
, I should extract it from column_dataset_2 [125858G_022BR/P070751]
.
The comparaison should, take one by one value of column_dataset_1_normalized
and search it in all the cell of column_dataset_2_normalized
.
为了进行标准化,我使用以下代码仅保留数字和字母:
For normalization I used this code to kepp only number and letter:
df = df.withColumn(
"column_normalized",
F.regexp_replace(F.col("column_to_normalize"), "[^a-zA-Z0-9]+", ""))
有人可以向我提出建议,我该怎么做?谢谢
Someone can propose me a suggestion how can I do it ? Thank you
推荐答案
有两种连接两个数据框的方法:
There are various way to join two dataframes:
(1)使用SQL函数 instr ,
(1) find the location/position of string column_dataset_1_normalized in column_dataset_2_normalized by using SQL function locate, instr, position etc, return a position (1-based) if exists
from pyspark.sql.functions import expr
cond1 = expr('locate(column_dataset_1_normalized,column_dataset_2_normalized)>0')
cond2 = expr('instr(column_dataset_2_normalized,column_dataset_1_normalized)>0')
cond3 = expr('position(column_dataset_1_normalized IN column_dataset_2_normalized)>0')
(2)使用正则表达式 rlike 来从 column_dataset_2_normalized 中找到 column_dataset_1_normalized ,这仅在 column_dataset_1_normalized
(2) use regex rlike to find column_dataset_1_normalized from column_dataset_2_normalized, this is only valid when no regex meta-characters is shown in column_dataset_1_normalized
cond4 = expr('column_dataset_2_normalized rlike column_dataset_1_normalized')
运行以下代码,并使用上述条件之一,例如:
Run the following code and use one of the above conditions, for example:
df1.join(df2, cond1).select('column_dataset_1').show(truncate=False)
+---------------------+
|column_dataset_1 |
+---------------------+
|W-B.7120RP1605794 |
|125858G_022BR/P070751|
+---------------------+
根据注释,匹配的子字符串可能与 df1.column_dataset_1 不同,因此我们需要对子字符串进行反向工程规范化的字符串.根据进行标准化的方式,以下udf可能会有所帮助(请注意,这将不包括匹配项中可能存在的任何前导/后跟非数字).基本上,我们将通过char遍历字符串,并在原始字符串中找到规范化字符串的开始/结束索引,然后采用子字符串:
Per comments, the matched sub-string might not be the same as df1.column_dataset_1, so we will need to reverse-engineer the sub-string from the normalized string. Based on how the normalization is conducted, the following udf might help (notice this will not cover any leading/trailing non-alnum that might be in the matched). Basically, we will iterate through the string by chars and find the start/end index of the normalized string in the original string, then take the sub-string:
from pyspark.sql.functions import udf
@udf('string')
def find_matched(orig, normalized):
n, d = ([], [])
for i in range(len(orig)):
if orig[i].isalnum():
n.append(orig[i])
d.append(i)
idx = ''.join(n).find(normalized)
return orig[d[idx]:d[idx+len(normalized)]] if idx >= 0 else None
df1.join(df2, cond3) \
.withColumn('matched', find_matched('column_dataset_2', 'column_dataset_1_normalized')) \
.select('column_dataset_2', 'matched', 'column_dataset_1_normalized') \
.show(truncate=False)
+------------------------------------------------------------------------------------+-----------------------+---------------------------+
|column_dataset_2 |matched |column_dataset_1_normalized|
+------------------------------------------------------------------------------------+-----------------------+---------------------------+
|Por ejemplo, si W-B.7120RP-1605794se trata de un archivo de texto, |W-B.7120RP-1605794 |WB7120RP1605794 |
|utilizados 125858G_022BR/P-070751 frecuentemente (por ejemplo, un texto que describe|125858G_022BR/P-070751 |125858G022BRP070751 |
+------------------------------------------------------------------------------------+-----------------------+---------------------------+
这篇关于过滤字符串是否包含子字符串pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!