计算pyspark df列中子字符串列表的出现次数 [英] Count occurrences of a list of substrings in a pyspark df column
问题描述
我想计算子字符串列表的出现次数,并基于pyspark df中包含长字符串的列创建一列.
I want to count the occurrences of list of substrings and create a column based on a column in the pyspark df which contains a long string.
Input:
ID History
1 USA|UK|IND|DEN|MAL|SWE|AUS
2 USA|UK|PAK|NOR
3 NOR|NZE
4 IND|PAK|NOR
lst=['USA','IND','DEN']
Output :
ID History Count
1 USA|UK|IND|DEN|MAL|SWE|AUS 3
2 USA|UK|PAK|NOR 1
3 NOR|NZE 0
4 IND|PAK|NOR 1
推荐答案
# Importing requisite packages and creating a DataFrame
from pyspark.sql.functions import split, col, size, regexp_replace
values = [(1,'USA|UK|IND|DEN|MAL|SWE|AUS'),(2,'USA|UK|PAK|NOR'),(3,'NOR|NZE'),(4,'IND|PAK|NOR')]
df = sqlContext.createDataFrame(values,['ID','History'])
df.show(truncate=False)
+---+--------------------------+
|ID |History |
+---+--------------------------+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|
|2 |USA|UK|PAK|NOR |
|3 |NOR|NZE |
|4 |IND|PAK|NOR |
+---+--------------------------+
这个想法是基于以下三个delimiters
:lst=['USA','IND','DEN']
分割字符串,然后计算产生的子字符串的数量.
The idea is to split the string based on these three delimiters
: lst=['USA','IND','DEN']
and then count the number of substrings produced.
例如;字符串USA|UK|IND|DEN|MAL|SWE|AUS
像-,
,|UK|
,|
,|MAL|SWE|AUS
一样被拆分.由于创建了4个子字符串,并且有3个分隔符匹配,因此4-1 = 3
给出了出现在列字符串中的这些字符串的计数.
For eg; the string USA|UK|IND|DEN|MAL|SWE|AUS
gets split like - ,
, |UK|
, |
, |MAL|SWE|AUS
. Since, there were 4 substrings created and there were 3 delimiters matches, so 4-1 = 3
gives the count of these strings appearing in the column string.
我不确定Spark是否支持多字符定界符,因此,第一步,我们用标记/虚拟值%
替换列表['USA','IND','DEN']
中的这3个子字符串中的任何一个.您也可以使用其他东西.以下代码可以做到 replacement
-
I am not sure if multi character delimiters are supported in Spark, so as a first step, we replace any of these 3 sub-strings in the list ['USA','IND','DEN']
with a flag/dummy value %
. You could use something else as well. The following code does this replacement
-
df = df.withColumn('History_X',col('History'))
lst=['USA','IND','DEN']
for i in lst:
df = df.withColumn('History_X', regexp_replace(col('History_X'), i, '%'))
df.show(truncate=False)
+---+--------------------------+--------------------+
|ID |History |History_X |
+---+--------------------------+--------------------+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|%|UK|%|%|MAL|SWE|AUS|
|2 |USA|UK|PAK|NOR |%|UK|PAK|NOR |
|3 |NOR|NZE |NOR|NZE |
|4 |IND|PAK|NOR |%|PAK|NOR |
+---+--------------------------+--------------------+
最后,我们计算由 size
函数,最后从中减去1.
Finally, we count the number of substrings created by splitting
it first with %
being the delimiter, then counting the number of substrings created with size
function and finally subtracting 1 from it.
df = df.withColumn('Count', size(split(col('History_X'), "%")) - 1).drop('History_X')
df.show(truncate=False)
+---+--------------------------+-----+
|ID |History |Count|
+---+--------------------------+-----+
|1 |USA|UK|IND|DEN|MAL|SWE|AUS|3 |
|2 |USA|UK|PAK|NOR |1 |
|3 |NOR|NZE |0 |
|4 |IND|PAK|NOR |1 |
+---+--------------------------+-----+
这篇关于计算pyspark df列中子字符串列表的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!