根据值在PySpark中出现的次数进行过滤 [英] Filtering on number of times a value appears in PySpark

查看:166
本文介绍了根据值在PySpark中出现的次数进行过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,其中的列包含ID.通常,一个ID仅出现一次,但有时它们与多个记录相关联.我想计算给定ID出现了多少次,然后分成两个单独的df,以便可以在两个文件上运行不同的操作.一个df应该是ID只能出现一次的地方,而一个df应该是ID可以出现多次的地方.

I have a file with a column containing IDs. Usually, an ID appears only once, but occasionally, they're associated with multiple records. I want to count how many times a given ID appeared, and then split into two separate dfs so I can run different operations on both. One df should be where IDs only appear once, and one should be where IDs appear multiple times.

通过对ID进行分组并将计数重新添加到原始df上,我能够成功计算ID出现的实例数,就像这样:

I was able to successfully count the number of instances an ID appeared by grouping on ID and joining the counts back onto the original df, like so:

newdf = df.join(df.groupBy('ID').count(),on='ID')

这很好用,因为我得到这样的输出:

This works nicely, as I get an output like so:

ID      Thing  count
287099  Foo     3
287099  Bar     3
287099  Foobar  3
321244  Barbar  1
333032  Barfoo  2
333032  Foofoo  2

但是,现在我想分割df,以便得到一个df,其中count = 1,并且count>1.但是,下面的内容及其变体不起作用:

But, now I want to split the df so that I have a df where count = 1, and count > 1. The below and variations thereof didn't work, however:

singular = df2.filter(df2.count == 1)

我收到"TypeError:条件应该为字符串或列"错误.当我尝试显示列的类型时,它说count列是一个实例.如何让PySpark以我需要的方式处理计数列?

I get a 'TypeError: condition should be string or Column' error instead. When I tried displaying the type of the column, it says the count column is an instance. How can I get PySpark to treat the count column the way I need it to?

推荐答案

count是数据框的一种方法,

count is a method of dataframe,

>>> df2.count
<bound method DataFrame.count of DataFrame[id: bigint, count: bigint]>

当过滤器需要对列进行操作时,请按如下所示进行更改,

Where as filter needs a column to operate on, change it as below,

singular = df2.filter(df2['count'] == 1)

这篇关于根据值在PySpark中出现的次数进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆