Spark Dataframe-Python-计算字符串中的子字符串 [英] Spark Dataframe - Python - count substring in string
问题描述
我有一个Spark数据帧,其中的一列("assigned_products")为字符串类型,其中包含如下值:
I have a Spark dataframe with a column ("assigned_products") of type string that contains values such as the following:
"POWER BI PRO+Power BI (free)+AUDIO CONFERENCING+OFFICE 365 ENTERPRISE E5 WITHOUT AUDIO CONFERENCING"
我想计算字符串中"+"
的出现次数,并在新列中返回该值.
I would like to count the occurrences of "+"
in the string for and return that value in a new column.
我尝试了以下操作,但是我一直返回错误.
I tried the following, but I keep returning errors.
from pyspark.sql.functions import col
DF.withColumn('Number_Products_Assigned', col("assigned_products").count("+"))
我在运行Apache Spark 2.3.1的群集上的Azure Databricks中运行代码.
I'm running my code in Azure Databricks on a cluster running Apache Spark 2.3.1.
推荐答案
这里是非udf解决方案.在要计算的字符上分割字符串,所需的值是结果数组的长度减去1:
Here's a non-udf solution. Split your string on the character you are trying to count and the value you want is the length of the resultant array minus 1:
from pyspark.sql.functions import col, size, split
DF.withColumn('Number_Products_Assigned', size(split(col("assigned_products"), r"\+")) - 1)
您必须转义+
,因为它是一个特殊的正则表达式字符.
You have to escape the +
because it's a special regex character.
+--------------------+------------------------+
| assigned_products|Number_Products_Assigned|
+--------------------+------------------------+
|POWER BI PRO+Powe...| 3|
+--------------------+------------------------+
这篇关于Spark Dataframe-Python-计算字符串中的子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!