Pyspark 在主数组中查找整个数组并使用另一个数组替换 [英] Pyspark find entire array in master array and replace using another array
这里有一种方法可以使用 pyspark.sql.functions.regexp_replace()
和一个简单的循环:
首先,创建一个示例数据集:
data = [(我一直在这里担任商店经理.",),("我是销售经理.",),(我以 IT 分析师的身份加入,并被提升为 IT 经理.",)]df = sqlCtx.createDataFrame(data, ["current_title"])df.show(截断=假)#+---------------------------------------------------------+#|current_title |#+---------------------------------------------------------+#|我一直在这里担任商店经理.|#|我是销售经理.|#|我以 IT 分析师的身份加入并晋升为 IT 经理.|#+---------------------------------------------------------+
现在应用每个替换:
import pyspark.sql.functions as f关键字 = ['IT 经理'、'销售经理'、'IT 分析师'、'商店经理']lookupid = ['##10##','##13##','##12##','##13##']对于 k,在 zip(keyword, lookupid) 中替换:模式 = r'\b(?i)' + k + r'\b'df = df.withColumn('当前的标题',f.regexp_replace(f.col('current_title'), 模式, 替换))
不要担心这里的循环,因为 spark 是懒惰的.如果您查看执行计划,您会发现将这些操作链接起来非常聪明,因此它们都在一次传递数据时发生:
df.explain()
<块引用>
== 物理计划 ==*项目 [regexp_replace(regexp_replace(regexp_replace(regexp_replace(current_title#737,\b(?i)IT 经理\b, ##10##), \b(?i)销售经理\b, ##13##), \b(?i)ITAnalyst\b, ##12##), \b(?i)Store Manager\b, ##13##) AScurrent_title#752]+- 扫描现有RDD[current_title#737]
最后,输出:
df.show(truncate=False)#+-------------------------------------------------+#|current_title |#+-------------------------------------------------+#|我一直在这里工作##13##.|#|我是##13##.|#|我以##12##的身份加入并被提升为##10##.|#+-------------------------------------------------+
I have a pandas implementation of this question here. I want to implement this using pyspark
for spark environment.
I have 2 csv
files. first csv
has keyword
and corresponding lookipid
column. I converted this into 2 lists in pure python.
keyword = ['IT Manager', 'Sales Manager', 'IT Analyst', 'Store Manager']
lookupid = ['##10##','##13##','##12##','##13##']
Second csv
file has a title
column with sample data below
current_title
I have been working here as a store manager since after I passed from college
I am sales manager and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a IT analyst and because of my sheer drive and dedication, I was promoted to IT manager position within 3 years
I want to do find and replace using regular expression
as well and return below output
current_title
I have been working here as a ##13## since after I passed from college
I am ##13## and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a ##12## and because of my sheer drive and dedication, I was promoted to ##10## position within 3 years
How to do this using pyspark? Please suggest
Here's a way to do this using pyspark.sql.functions.regexp_replace()
and a simple loop:
First, create a sample dataset:
data = [
("I have been working here as a store manager.",),
("I am sales manager.",),
("I joined as an IT analyst and was promoted to IT manager.",)
]
df = sqlCtx.createDataFrame(data, ["current_title"])
df.show(truncate=False)
#+---------------------------------------------------------+
#|current_title |
#+---------------------------------------------------------+
#|I have been working here as a store manager. |
#|I am sales manager. |
#|I joined as an IT analyst and was promoted to IT manager.|
#+---------------------------------------------------------+
Now apply the each replacement:
import pyspark.sql.functions as f
keyword = ['IT Manager', 'Sales Manager', 'IT Analyst', 'Store Manager']
lookupid = ['##10##','##13##','##12##','##13##']
for k, replacement in zip(keyword, lookupid):
pattern = r'\b(?i)' + k + r'\b'
df = df.withColumn(
'current_title',
f.regexp_replace(f.col('current_title'), pattern, replacement)
)
Don't worry about the loops here as spark is lazy. If you look at the execution plan, you will see that it's smart enough to chain these operations so they all happen in one pass through the data:
df.explain()
== Physical Plan == *Project [regexp_replace(regexp_replace(regexp_replace(regexp_replace(current_title#737, \b(?i)IT Manager\b, ##10##), \b(?i)Sales Manager\b, ##13##), \b(?i)IT Analyst\b, ##12##), \b(?i)Store Manager\b, ##13##) AS current_title#752] +- Scan ExistingRDD[current_title#737]
Finally, the output:
df.show(truncate=False)
#+-------------------------------------------------+
#|current_title |
#+-------------------------------------------------+
#|I have been working here as a ##13##. |
#|I am ##13##. |
#|I joined as an ##12## and was promoted to ##10##.|
#+-------------------------------------------------+
这篇关于Pyspark 在主数组中查找整个数组并使用另一个数组替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!