在pyspark中查找并删除匹配的列值 [英] Find and remove matching column values in pyspark
本文介绍了在pyspark中查找并删除匹配的列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个pyspark数据框,其中的列偶尔会具有与另一列匹配的错误值.看起来像这样:
I have a pyspark dataframe where occasionally the columns will have a wrong value that matches another column. It would look something like this:
| Date | Latitude |
| 2017-01-01 | 43.4553 |
| 2017-01-02 | 42.9399 |
| 2017-01-03 | 43.0091 |
| 2017-01-04 | 2017-01-04 |
很显然,最后一个纬度"值不正确.我需要删除所有这样的行.我曾考虑过使用.isin()
,但似乎无法正常工作.如果我尝试
Obviously, the last Latitude value is incorrect. I need to remove any and all rows that are like this. I thought about using .isin()
but I can't seem to get it to work. If I try
df['Date'].isin(['Latitude'])
我得到:
Column<(Date IN (Latitude))>
有什么建议吗?
推荐答案
如果您更熟悉SQL语法,这是在filter()
中使用pyspark-sql
条件的另一种方法:
If you're more comfortable with SQL syntax, here is an alternative way using a pyspark-sql
condition inside the filter()
:
df = df.filter("Date NOT IN (Latitude)")
Or equivalently using pyspark.sql.DataFrame.where()
:
df = df.where("Date NOT IN (Latitude)")
这篇关于在pyspark中查找并删除匹配的列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文