过滤字符串是否包含子字符串 pyspark [英] Filter if String contain sub-string pyspark

查看:58
本文介绍了过滤字符串是否包含子字符串 pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个数据集.在每一列中,我都有几列.但我想使用 每个数据集中只使用 2 列,而不在两个数据集之间进行任何连接、合并或组合.

I have 2 datasets. In each one I have several columns. But I want to use only 2 columns from each dataset, without doing any join, merge or combination between the both of the datasets.

示例数据集 1:

column_dataset_1 <String>    |      column_dataset_1_normalized <String>
-----------------------------------------------------------------------
11882621-V021BRP161305-1     |      11882621V021BRP1613051
-----------------------------------------------------------------------
W-B.7120RP1605794            |      WB7120RP1605794
-----------------------------------------------------------------------
D/57RP.1534421               |      D57RP1534421
-----------------------------------------------------------------------
125858G_022BR/P070751        |      125858G022BRP070751
-----------------------------------------------------------------------
300B.5190C57/51507           |      300B5190C5751507
-----------------------------------------------------------------------

示例数据集 2

column_dataset_2 <String>                                                           |       column_dataset_2_normalized <String>
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Por ejemplo, si W-B.7120RP1605794se trata de un archivo de texto,                   |  PorejemplosiWB7120RP1605794setratadeunarchivodetexto 
-------------------------------------------------------------------------------------------------------------------------------------------------------------     
se abrirá en un programa de procesamiento de texto.                                 |  seabrirenunprogramadeprocesamientodetexto
-------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                    |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
utilizados 125858G_022BR/P070751 frecuentemente (por ejemplo, un texto que describe |  utilizados125858G022BRP070751frecuentementeporejemplountextoquedescribe

--------------------------------------------------------------------------------------------------------------------------------------------------------------

column_dataset_1_normalized 是 column_dataset_1 归一化的结果column_dataset_2_normalized 是 column_dataset_2 归一化的结果

column_dataset_1_normalized is the result of column_dataset_1 is normalized column_dataset_2_normalized is the resut of column_dataset_2 is normalized

我想比较 column_dataset_1_normalized 是否存在于 column_dataset_2_normalized 中.如果是,我应该 从 column_dataset_2 中提取它

I want to compare column_dataset_1_normalized if is exist in column_dataset_2_normalized. If yes I should extract it from column_dataset_2

示例:

WB7120RP1605794位于column_dataset_1_normalized第二行,存在于column_dataset_2_normalized的第一行,所以我应该从 column_dataset_2提取它是 real value [WB.7120RP1605794] 并将其存储在数据集 2 的新列中.

WB7120RP1605794 is in the second line of column_dataset_1_normalized, is exist in the first line of column_dataset_2_normalized, so I should extract it's real value [W-B.7120RP1605794], from column_dataset_2 and store it in a new column in dataset 2.

125858G022BRP070751 也是 在 column_dataset_2_normalized 的第四行,我应该从 column_dataset_2 [125858G_022BR/P070751] 中提取它.比较应该是,取column_dataset_1_normalized的一个值,在column_dataset_2_normalized的所有单元格中搜索.

And the same for 125858G022BRP070751 is in forth line in column_dataset_2_normalized, I should extract it from column_dataset_2 [125858G_022BR/P070751]. The comparaison should, take one by one value of column_dataset_1_normalized and search it in all the cell of column_dataset_2_normalized.

为了规范化,我使用此代码只保留数字和字母:

For normalization I used this code to kepp only number and letter:

df = df.withColumn(
        "column_normalized",
        F.regexp_replace(F.col("column_to_normalize"), "[^a-zA-Z0-9]+", ""))

有人可以向我提出建议,我该怎么做?谢谢

Someone can propose me a suggestion how can I do it ? Thank you

推荐答案

连接两个数据框有多种方式:

There are various way to join two dataframes:

(1) 使用 SQL 函数 定位, instr, position 等,如果存在则返回位置(从 1 开始)

(1) find the location/position of string column_dataset_1_normalized in column_dataset_2_normalized by using SQL function locate, instr, position etc, return a position (1-based) if exists

    from pyspark.sql.functions import expr

    cond1 = expr('locate(column_dataset_1_normalized,column_dataset_2_normalized)>0')
    cond2 = expr('instr(column_dataset_2_normalized,column_dataset_1_normalized)>0')
    cond3 = expr('position(column_dataset_1_normalized IN column_dataset_2_normalized)>0')

(2) 使用正则表达式 rlikecolumn_dataset_2_normalized 中找到 column_dataset_1_normalized,这仅在 column_dataset_1_normalized

(2) use regex rlike to find column_dataset_1_normalized from column_dataset_2_normalized, this is only valid when no regex meta-characters is shown in column_dataset_1_normalized

    cond4 = expr('column_dataset_2_normalized rlike column_dataset_1_normalized')

运行以下代码并使用上述条件之一,例如:

Run the following code and use one of the above conditions, for example:

df1.join(df2, cond1).select('column_dataset_1').show(truncate=False)
+---------------------+
|column_dataset_1     |
+---------------------+
|W-B.7120RP1605794    |
|125858G_022BR/P070751|
+---------------------+

编辑:根据评论,匹配的子字符串可能与 df1.column_dataset_1 不同,因此我们需要从规范化的字符串.根据规范化的进行方式,以下 udf 可能会有所帮助(请注意,这不会涵盖匹配中可能存在的任何前导/尾随非 anum).基本上,我们将通过字符遍历字符串并在原始字符串中找到规范化字符串的开始/结束索引,然后取子字符串:

Per comments, the matched sub-string might not be the same as df1.column_dataset_1, so we will need to reverse-engineer the sub-string from the normalized string. Based on how the normalization is conducted, the following udf might help (notice this will not cover any leading/trailing non-alnum that might be in the matched). Basically, we will iterate through the string by chars and find the start/end index of the normalized string in the original string, then take the sub-string:

from pyspark.sql.functions import udf

@udf('string')
def find_matched(orig, normalized):
  n, d = ([], [])
  for i in range(len(orig)):
    if orig[i].isalnum(): 
      n.append(orig[i])
      d.append(i)
  idx = ''.join(n).find(normalized)
  return orig[d[idx]:d[idx+len(normalized)]] if idx >= 0 else None

df1.join(df2, cond3) \
   .withColumn('matched', find_matched('column_dataset_2', 'column_dataset_1_normalized')) \
   .select('column_dataset_2', 'matched', 'column_dataset_1_normalized') \
   .show(truncate=False)

+------------------------------------------------------------------------------------+-----------------------+---------------------------+
|column_dataset_2                                                                    |matched                |column_dataset_1_normalized|
+------------------------------------------------------------------------------------+-----------------------+---------------------------+
|Por ejemplo, si W-B.7120RP-1605794se trata de un archivo de texto,                  |W-B.7120RP-1605794     |WB7120RP1605794            |
|utilizados 125858G_022BR/P-070751 frecuentemente (por ejemplo, un texto que describe|125858G_022BR/P-070751 |125858G022BRP070751        |
+------------------------------------------------------------------------------------+-----------------------+---------------------------+

这篇关于过滤字符串是否包含子字符串 pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆