Spark Dataframe-Python-计算字符串中的子字符串 [英] Spark Dataframe - Python - count substring in string

查看:275
本文介绍了Spark Dataframe-Python-计算字符串中的子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark数据帧,其中的一列("assigned_products")为字符串类型,其中包含如下值:

I have a Spark dataframe with a column ("assigned_products") of type string that contains values such as the following:

"POWER BI PRO+Power BI (free)+AUDIO CONFERENCING+OFFICE 365 ENTERPRISE E5 WITHOUT AUDIO CONFERENCING"

我想计算字符串中"+"的出现次数,并在新列中返回该值.

I would like to count the occurrences of "+" in the string for and return that value in a new column.

我尝试了以下操作,但是我一直返回错误.

I tried the following, but I keep returning errors.

from pyspark.sql.functions import col
DF.withColumn('Number_Products_Assigned', col("assigned_products").count("+"))

我在运行Apache Spark 2.3.1的群集上的Azure Databricks中运行代码.

I'm running my code in Azure Databricks on a cluster running Apache Spark 2.3.1.

推荐答案

这里是非udf解决方案.在要计算的字符上分割字符串,所需的值是结果数组的长度减去1:

Here's a non-udf solution. Split your string on the character you are trying to count and the value you want is the length of the resultant array minus 1:

from pyspark.sql.functions import col, size, split
DF.withColumn('Number_Products_Assigned', size(split(col("assigned_products"), r"\+")) - 1)

您必须转义+,因为它是一个特殊的正则表达式字符.

You have to escape the + because it's a special regex character.

+--------------------+------------------------+
|   assigned_products|Number_Products_Assigned|
+--------------------+------------------------+
|POWER BI PRO+Powe...|                       3|
+--------------------+------------------------+

这篇关于Spark Dataframe-Python-计算字符串中的子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆