计算多个列中的值,这些列包含基于列表字符串pyspark的子字符串 [英] count values in multiple columns that contain a substring based on strings of lists pyspark

查看:106
本文介绍了计算多个列中的值,这些列包含基于列表字符串pyspark的子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Pyspark 中有一个数据框,如下所示.我想根据一些 lists 在两列中 count 个值,并为每个 list

I have a data frame in Pyspark like below. I want to count values in two columns based on some lists and populate new columns for each list

df.show()

+---+-------------+-------------_+
| id|       device|  device_model|
+---+-------------+--------------+
|  3|      mac pro|           mac|
|  1|       iphone|       iphone5|
|  1|android phone|       android|
|  1|   windows pc|       windows|
|  1|   spy camera|    spy camera|
|  2|             |        camera|
|  2|       iphone|  apple iphone|
|  3|   spy camera|              |
|  3|         cctv|          cctv|
+---+-------------+--------------+

列表在下面:

phone_list = ['iphone', 'android', 'nokia']
pc_list = ['windows', 'mac']
security_list = ['camera', 'cctv']

我要为每个 id pivot device 和 device_model 进行 count >新数据框中的值.

I want to count the device and device_model for each id and pivot the values in a new data frame.

我想为每个匹配的 id device_model device 列中的值进行 count 列表中的 strings .

I want to count the values in the both the device_model and device columns for each id that match the strings in the list.

例如:在 phone_list 中,我有一个 iphone 字符串,该字符串应计算两个值 iphone iphone5

For example: in phone_list I have a iphone string this should count values for both values iphone and iphone5

我想要的结果

+---+------+----+--------+
| id|phones|  pc|security|
+---+------+----+--------+
|  1|     4|   2|       2|
|  2|     2|null|       1|
|  3|  null|   2|       3|
+---+------+----+--------+

我已经完成了以下操作

df.withColumn('cat', 
    F.when(df.device.isin(phone_list), 'phones').otherwise(
    F.when(df.device.isin(pc_list), 'pc').otherwise(
    F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()

使用上述方法,仅当 string 完全匹配时,我才可以对 device 列进行操作.但是无法弄清楚如何对这两个列以及何时值包含 string .

Using the above I can only do for device column and only if the string matches exactly. But unable to figure out how to do for both the columns and when value contains the string.

如何获得想要的结果?

推荐答案

这是有效的解决方案.我已经使用udf函数检查字符串和计算总和.如果可能,您可以使用内置函数.(提供注释作为解释的手段)

Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)

#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}

#udf function for checking the strings and summing them
from pyspark.sql import functions as F
from pyspark.sql import types as t
def checkDevices(device, deviceModel, name):
    sum = 0
    for x in columnLists[name]:
        if x in device:
            sum += 1
        if x in deviceModel:
            sum += 1
    return sum

checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())

#populating the sum returned from udf function to respective columns
for x in columnLists:
    df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))

#finally grouping and sum 
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()

应该给您

+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
|  3|    0|  2|       3|
|  1|    4|  2|       2|
|  2|    2|  0|       1|
+---+-----+---+--------+

聚集部分可以概括为其余部分.改进和修改都在您的手中.:)

这篇关于计算多个列中的值,这些列包含基于列表字符串pyspark的子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆