根据值在PySpark中出现的次数进行过滤 [英] Filtering on number of times a value appears in PySpark

查看：166 发布时间：2021/4/28 20:43:58 python pyspark databricks

本文介绍了根据值在PySpark中出现的次数进行过滤的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件，其中的列包含ID.通常，一个ID仅出现一次，但有时它们与多个记录相关联.我想计算给定ID出现了多少次，然后分成两个单独的df，以便可以在两个文件上运行不同的操作.一个df应该是ID只能出现一次的地方，而一个df应该是ID可以出现多次的地方.

I have a file with a column containing IDs. Usually, an ID appears only once, but occasionally, they're associated with multiple records. I want to count how many times a given ID appeared, and then split into two separate dfs so I can run different operations on both. One df should be where IDs only appear once, and one should be where IDs appear multiple times.

通过对ID进行分组并将计数重新添加到原始df上，我能够成功计算ID出现的实例数，就像这样:

I was able to successfully count the number of instances an ID appeared by grouping on ID and joining the counts back onto the original df, like so:

newdf = df.join(df.groupBy('ID').count(),on='ID')

这很好用，因为我得到这样的输出:

This works nicely, as I get an output like so:

ID      Thing  count
287099  Foo     3
287099  Bar     3
287099  Foobar  3
321244  Barbar  1
333032  Barfoo  2
333032  Foofoo  2

但是，现在我想分割df，以便得到一个df，其中count = 1，并且count>1.但是，下面的内容及其变体不起作用:

But, now I want to split the df so that I have a df where count = 1, and count > 1. The below and variations thereof didn't work, however:

singular = df2.filter(df2.count == 1)

我收到"TypeError:条件应该为字符串或列"错误.当我尝试显示列的类型时，它说count列是一个实例.如何让PySpark以我需要的方式处理计数列?

I get a 'TypeError: condition should be string or Column' error instead. When I tried displaying the type of the column, it says the count column is an instance. How can I get PySpark to treat the count column the way I need it to?

根据值在PySpark中出现的次数进行过滤 [英] Filtering on number of times a value appears in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

根据值在PySpark中出现的次数进行过滤 [英] Filtering on number of times a value appears in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭