Spark Dataframe-Python-计算字符串中的子字符串 [英] Spark Dataframe - Python - count substring in string

查看：275 发布时间：2020/9/4 4:32:21 python string apache-spark pyspark

本文介绍了Spark Dataframe-Python-计算字符串中的子字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Spark数据帧，其中的一列("assigned_products")为字符串类型，其中包含如下值:

I have a Spark dataframe with a column ("assigned_products") of type string that contains values such as the following:

"POWER BI PRO+Power BI (free)+AUDIO CONFERENCING+OFFICE 365 ENTERPRISE E5 WITHOUT AUDIO CONFERENCING"

我想计算字符串中"+"的出现次数，并在新列中返回该值.

I would like to count the occurrences of "+" in the string for and return that value in a new column.

我尝试了以下操作，但是我一直返回错误.

I tried the following, but I keep returning errors.

from pyspark.sql.functions import col
DF.withColumn('Number_Products_Assigned', col("assigned_products").count("+"))

我在运行Apache Spark 2.3.1的群集上的Azure Databricks中运行代码.

I'm running my code in Azure Databricks on a cluster running Apache Spark 2.3.1.

推荐答案

这里是非udf解决方案.在要计算的字符上分割字符串，所需的值是结果数组的长度减去1:

Here's a non-udf solution. Split your string on the character you are trying to count and the value you want is the length of the resultant array minus 1:

from pyspark.sql.functions import col, size, split
DF.withColumn('Number_Products_Assigned', size(split(col("assigned_products"), r"\+")) - 1)

您必须转义+，因为它是一个特殊的正则表达式字符.

You have to escape the + because it's a special regex character.

+--------------------+------------------------+
|   assigned_products|Number_Products_Assigned|
+--------------------+------------------------+
|POWER BI PRO+Powe...|                       3|
+--------------------+------------------------+

这篇关于Spark Dataframe-Python-计算字符串中的子字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark Dataframe-Python-计算字符串中的子字符串 [英] Spark Dataframe - Python - count substring in string

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark Dataframe-Python-计算字符串中的子字符串 [英] Spark Dataframe - Python - count substring in string

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭