计算多个列中的值,这些列包含基于列表字符串pyspark的子字符串 [英] count values in multiple columns that contain a substring based on strings of lists pyspark
问题描述
我在 Pyspark
中有一个数据框,如下所示.我想根据一些 lists
在两列中 count
个值,并为每个 list
I have a data frame in Pyspark
like below. I want to count
values in two columns based on some lists
and populate new columns for each list
df.show()
+---+-------------+-------------_+
| id| device| device_model|
+---+-------------+--------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 2| iphone| apple iphone|
| 3| spy camera| |
| 3| cctv| cctv|
+---+-------------+--------------+
列表在下面
:
phone_list = ['iphone', 'android', 'nokia']
pc_list = ['windows', 'mac']
security_list = ['camera', 'cctv']
我要为每个 id
和 pivot
device 和 device_model
进行 count
>新数据框中的值.
I want to count
the device
and device_model
for each id
and pivot
the values in a new data frame.
我想为每个匹配的 id
的 device_model
和 device
列中的值进行 count
列表中的 strings
.
I want to count
the values in the both the device_model
and device
columns for each id
that match the strings
in the list.
例如:在 phone_list
中,我有一个 iphone
字符串,该字符串应计算两个值 iphone
和 iphone5
For example: in phone_list
I have a iphone
string this should count values for both values iphone
and iphone5
我想要的结果
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 4| 2| 2|
| 2| 2|null| 1|
| 3| null| 2| 3|
+---+------+----+--------+
我已经完成了以下操作
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
使用上述方法,仅当 string
完全匹配时,我才可以对 device
列进行操作.但是无法弄清楚如何对这两个列以及何时值包含 string
.
Using the above I can only do for device
column and only if the string
matches exactly. But unable to figure out how to do for both the columns and when value contains the string
.
如何获得想要的结果?
推荐答案
这是有效的解决方案.我已经使用udf函数检查字符串和计算总和.如果可能,您可以使用内置函数.(提供注释作为解释的手段)
Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)
#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}
#udf function for checking the strings and summing them
from pyspark.sql import functions as F
from pyspark.sql import types as t
def checkDevices(device, deviceModel, name):
sum = 0
for x in columnLists[name]:
if x in device:
sum += 1
if x in deviceModel:
sum += 1
return sum
checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())
#populating the sum returned from udf function to respective columns
for x in columnLists:
df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))
#finally grouping and sum
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()
应该给您
+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
| 3| 0| 2| 3|
| 1| 4| 2| 2|
| 2| 2| 0| 1|
+---+-----+---+--------+
聚集部分可以概括为其余部分.改进和修改都在您的手中.:)
这篇关于计算多个列中的值,这些列包含基于列表字符串pyspark的子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!