Pyspark 在主数组中查找整个数组并使用另一个数组替换 [英] Pyspark find entire array in master array and replace using another array

查看:85
本文介绍了Pyspark 在主数组中查找整个数组并使用另一个数组替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个问题的熊猫实现 解决方案

这里有一种方法可以使用 pyspark.sql.functions.regexp_replace() 和一个简单的循环:

首先,创建一个示例数据集:

data = [(我一直在这里担任商店经理.",),("我是销售经理.",),(我以 IT 分析师的身份加入,并被提升为 IT 经理.",)]df = sqlCtx.createDataFrame(data, ["current_title"])df.show(截断=假)#+---------------------------------------------------------+#|current_title |#+---------------------------------------------------------+#|我一直在这里担任商店经理.|#|我是销售经理.|#|我以 IT 分析师的身份加入并晋升为 IT 经理.|#+---------------------------------------------------------+

现在应用每个替换:

import pyspark.sql.functions as f关键字 = ['IT 经理'、'销售经理'、'IT 分析师'、'商店经理']lookupid = ['##10##','##13##','##12##','##13##']对于 k,在 zip(keyword, lookupid) 中替换:模式 = r'\b(?i)' + k + r'\b'df = df.withColumn('当前的标题',f.regexp_replace(f.col('current_title'), 模式, 替换))

不要担心这里的循环,因为 spark 是懒惰的.如果您查看执行计划,您会发现将这些操作链接起来非常聪明,因此它们都在一次传递数据时发生:

df.explain()

<块引用>

== 物理计划 ==*项目 [regexp_replace(regexp_replace(regexp_replace(regexp_replace(current_title#737,\b(?i)IT 经理\b, ##10##), \b(?i)销售经理\b, ##13##), \b(?i)ITAnalyst\b, ##12##), \b(?i)Store Manager\b, ##13##) AScurrent_title#752]+- 扫描现有RDD[current_title#737]

最后,输出:

df.show(truncate=False)#+-------------------------------------------------+#|current_title |#+-------------------------------------------------+#|我一直在这里工作##13##.|#|我是##13##.|#|我以##12##的身份加入并被提升为##10##.|#+-------------------------------------------------+

I have a pandas implementation of this question here. I want to implement this using pyspark for spark environment.

I have 2 csv files. first csv has keyword and corresponding lookipid column. I converted this into 2 lists in pure python.

keyword = ['IT Manager', 'Sales Manager', 'IT Analyst', 'Store Manager']
lookupid = ['##10##','##13##','##12##','##13##']

Second csv file has a title column with sample data below

current_title
I have been working here as a store manager since after I passed from college
I am sales manager and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a IT analyst and because of my sheer drive and dedication, I was promoted to IT manager position within 3 years

I want to do find and replace using regular expression as well and return below output

current_title
I have been working here as a ##13## since after I passed from college
I am ##13## and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a ##12## and because of my sheer drive and dedication, I was promoted to ##10## position within 3 years

How to do this using pyspark? Please suggest

解决方案

Here's a way to do this using pyspark.sql.functions.regexp_replace() and a simple loop:

First, create a sample dataset:

data = [
    ("I have been working here as a store manager.",),
    ("I am sales manager.",),
    ("I joined as an IT analyst and was promoted to IT manager.",)
]

df = sqlCtx.createDataFrame(data, ["current_title"])
df.show(truncate=False)
#+---------------------------------------------------------+
#|current_title                                            |
#+---------------------------------------------------------+
#|I have been working here as a store manager.             |
#|I am sales manager.                                      |
#|I joined as an IT analyst and was promoted to IT manager.|
#+---------------------------------------------------------+

Now apply the each replacement:

import pyspark.sql.functions as f

keyword = ['IT Manager', 'Sales Manager', 'IT Analyst', 'Store Manager']
lookupid = ['##10##','##13##','##12##','##13##']

for k, replacement in zip(keyword, lookupid):
    pattern = r'\b(?i)' + k + r'\b'
    df = df.withColumn(
        'current_title',
        f.regexp_replace(f.col('current_title'), pattern, replacement)
    )

Don't worry about the loops here as spark is lazy. If you look at the execution plan, you will see that it's smart enough to chain these operations so they all happen in one pass through the data:

df.explain()  

== Physical Plan == *Project [regexp_replace(regexp_replace(regexp_replace(regexp_replace(current_title#737, \b(?i)IT Manager\b, ##10##), \b(?i)Sales Manager\b, ##13##), \b(?i)IT Analyst\b, ##12##), \b(?i)Store Manager\b, ##13##) AS current_title#752] +- Scan ExistingRDD[current_title#737]

Finally, the output:

df.show(truncate=False)
#+-------------------------------------------------+
#|current_title                                    |
#+-------------------------------------------------+
#|I have been working here as a ##13##.            |
#|I am ##13##.                                     |
#|I joined as an ##12## and was promoted to ##10##.|
#+-------------------------------------------------+

这篇关于Pyspark 在主数组中查找整个数组并使用另一个数组替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆