从字符串中提取火花列 [英] spark extract columns from string

查看:33
本文介绍了从字符串中提取火花列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在解析字符串时需要帮助,该字符串包含每个属性的值.下面是我的示例字符串...

Need help in parsing a string, where it contains values for each attribute. below is my sample string...

otherPartofString Name=<Series VR> Type=<1Ac4> SqVal=<34> conn ID=<2>

有时候,字符串可以包含其他带有不同定界符的值,例如

sometimes, the string can include other values with a different delimiter like

otherPartofString Name=<Series X> Type=<1B3> SqVal=<34> conn ID=<2> conn Loc=sfo dest=chc bridge otherpartofString.. 

输出列将

Name      | Type | SqVal | ID | Loc  | dest 
-------------------------------------------
Series VR | 1Ac4 | 34    | 2  | null | null
Series X  | 1B3  | 34    | 2  | sfo  | chc 

推荐答案

正如我们所讨论的,要使用

As we discussed, to use str_to_map function on your sample data, we can setup pairDelim and keyValueDelim to the following:

pairDelim: '(?i)>? *(?=Name|Type|SqVal|conn ID|conn Loc|dest|$)'
keyValueDelim: '=<?'

pariDelim 不区分大小写的(?i),带有可选的> ,后跟零个或多个SPACE,然后跟一个预定义键(我们使用'|'.join(keys)动态生成它)或字符串锚点 $ 的末尾. keyValueDelim 是带有可选< 的'='.

Where pariDelim is case-insensitive (?i) with an optional > followed by zero or more SPACEs, then followed by one of the pre-defined keys (we use '|'.join(keys) to generate it dynamically) or the end of string anchor $. keyValueDelim is an '=' with an optional <.

from pyspark.sql import functions as F

df = spark.createDataFrame([                                               
   ("otherPartofString Name=<Series VR> Type=<1Ac4> SqVal=<34> conn ID=<2>",),   
   ("otherPartofString Name=<Series X> Type=<1B3> SqVal=<34> conn ID=<2> conn Loc=sfo dest=chc bridge otherpartofString..",)
],["value"])

keys = ["Name", "Type", "SqVal", "conn ID", "conn Loc", "dest"]

# add the following conf for Spark 3.0 to overcome duplicate map key ERROR
#spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN")

df.withColumn("m", F.expr("str_to_map(value, '(?i)>? *(?={}|$)', '=<?')".format('|'.join(keys)))) \
    .select([F.col('m')[k].alias(k) for k in keys]) \
    .show()
+---------+----+-----+-------+--------+--------------------+
|     Name|Type|SqVal|conn ID|conn Loc|                dest|
+---------+----+-----+-------+--------+--------------------+
|Series VR|1Ac4|   34|      2|    null|                null|
| Series X| 1B3|   34|      2|     sfo|chc bridge otherp...|
+---------+----+-----+-------+--------+--------------------+

我们将需要对最后一个映射键的值进行一些后期处理,因为没有锚或模式可以将它们与其他无关的文本区分开(这可能是个问题,因为可能会在任何键上发生),请告诉我是否可以指定任何模式.

We will need to do some post-processing to the values of the last mapped-key, since there is no anchor or pattern to distinguish them from other unrelated text (this could be a problem as it might happen on any keys), please let me know if you can specify any pattern.

编辑:如果使用地图对于不区分大小写的搜索而言效率较低,因为它需要进行一些昂贵的预处理,请尝试以下操作:

If using map is less efficient for case-insensitive search since it requires some expensive pre-processing, try the following:

ptn = '|'.join(keys)
df.select("*", *[F.regexp_extract('value', r'(?i)\b{0}=<?([^=>]+?)>? *(?={1}|$)'.format(k,ptn), 1).alias(k) for k in keys]).show()

如果仅当值或它们的下一个相邻键包含任何非单词字符时才使用尖括号< > 预处理:

In case the angle brackets < and > are used only when values or their next adjacent key contain any non-word chars, it can be simplified with some pre-processing:

df.withColumn('value', F.regexp_replace('value','=(\w+)','=<$1>')) \
    .select("*", *[F.regexp_extract('value', r'(?i)\b{0}=<([^>]+)>'.format(k), 1).alias(k) for k in keys]) \
    .show()

:添加了字典来处理键别名:

Edit-2: added a dictionary to handle key aliases:

keys = ["Name", "Type", "SqVal", "ID", "Loc", "dest"]

# aliases are case-insensitive and added only if exist
key_aliases = {
    'Type': [ 'ThisType', 'AnyName' ],
    'ID': ['conn ID'],
    'Loc': ['conn Loc']
}

# set up regex pattern for each key differently
key_ptns = [ (k, '|'.join([k, *key_aliases[k]]) if k in key_aliases else k) for k in keys ]  
#[('Name', 'Name'),
# ('Type', 'Type|ThisType|AnyName'),
# ('SqVal', 'SqVal'),
# ('ID', 'ID|conn ID'),
# ('Loc', 'Loc|conn Loc'),
# ('dest', 'dest')]  

df.withColumn('value', F.regexp_replace('value','=(\w+)','=<$1>')) \
    .select("*", *[F.regexp_extract('value', r'(?i)\b(?:{0})=<([^>]+)>'.format(p), 1).alias(k) for k,p in key_ptns]) \
    .show()
+--------------------+---------+----+-----+---+---+----+
|               value|     Name|Type|SqVal| ID|Loc|dest|
+--------------------+---------+----+-----+---+---+----+
|otherPartofString...|Series VR|1Ac4|   34|  2|   |    |
|otherPartofString...| Series X| 1B3|   34|  2|sfo| chc|
+--------------------+---------+----+-----+---+---+----+

这篇关于从字符串中提取火花列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆