根据值在PySpark中出现的次数进行过滤 [英] Filtering on number of times a value appears in PySpark
问题描述
我有一个文件,其中的列包含ID.通常,一个ID仅出现一次,但有时它们与多个记录相关联.我想计算给定ID出现了多少次,然后分成两个单独的df,以便可以在两个文件上运行不同的操作.一个df应该是ID只能出现一次的地方,而一个df应该是ID可以出现多次的地方.
I have a file with a column containing IDs. Usually, an ID appears only once, but occasionally, they're associated with multiple records. I want to count how many times a given ID appeared, and then split into two separate dfs so I can run different operations on both. One df should be where IDs only appear once, and one should be where IDs appear multiple times.
通过对ID进行分组并将计数重新添加到原始df上,我能够成功计算ID出现的实例数,就像这样:
I was able to successfully count the number of instances an ID appeared by grouping on ID and joining the counts back onto the original df, like so:
newdf = df.join(df.groupBy('ID').count(),on='ID')
这很好用,因为我得到这样的输出:
This works nicely, as I get an output like so:
ID Thing count
287099 Foo 3
287099 Bar 3
287099 Foobar 3
321244 Barbar 1
333032 Barfoo 2
333032 Foofoo 2
但是,现在我想分割df,以便得到一个df,其中count = 1,并且count>1.但是,下面的内容及其变体不起作用:
But, now I want to split the df so that I have a df where count = 1, and count > 1. The below and variations thereof didn't work, however:
singular = df2.filter(df2.count == 1)
我收到"TypeError:条件应该为字符串或列"错误.当我尝试显示列的类型时,它说count列是一个实例.如何让PySpark以我需要的方式处理计数列?
I get a 'TypeError: condition should be string or Column' error instead. When I tried displaying the type of the column, it says the count column is an instance. How can I get PySpark to treat the count column the way I need it to?
推荐答案
count是数据框的一种方法,
count is a method of dataframe,
>>> df2.count
<bound method DataFrame.count of DataFrame[id: bigint, count: bigint]>
当过滤器需要对列进行操作时,请按如下所示进行更改,
Where as filter needs a column to operate on, change it as below,
singular = df2.filter(df2['count'] == 1)
这篇关于根据值在PySpark中出现的次数进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!