Pyspark:根据多种条件过滤数据框 [英] Pyspark: Filter dataframe based on multiple conditions
本文介绍了Pyspark:根据多种条件过滤数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想首先根据以下条件过滤数据帧(d <5),其次(如果col1中的值等于col3中的对应值,col2的值不等于col4中的对应值).
I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal its counterpart in col4 if value in col1 equal its counterpart in col3).
如果原始数据帧DF
如下:
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| xx| D| vv| 4|
| C| xxx| D| vv| 10|
| A| x| A| xx| 3|
| E| xxx| B| vv| 3|
| E| xxx| F| vvv| 6|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xxx| 4|
| G| xxx| G| xx| 4|
| G| xxx| G| xxx| 12|
| B|xxxx| B| xx| 13|
+----+----+----+----+---+
所需的数据框为:
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| xx| D| vv| 4|
| A| x| A| xx| 3|
| E| xxx| B| vv| 3|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xx| 4|
+----+----+----+----+---+
我尝试过的代码未按预期工作:
Code I have tried that did not work as expected:
cols=[('A','xx','D','vv',4),('C','xxx','D','vv',10),('A','x','A','xx',3),('E','xxx','B','vv',3),('E','xxx','F','vvv',6),('F','xxxx','F','vvv',4),('G','xxx','G','xxx',4),('G','xxx','G','xx',4),('G','xxx','G','xxx',12),('B','xxxx','B','xx',13)]
df=spark.createDataFrame(cols,['col1','col2','col3','col4','d'])
df.filter((df.d<5)& (df.col2!=df.col4) & (df.col1==df.col3)).show()
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| x| A| xx| 3|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xx| 4|
+----+----+----+----+---+
我应该怎么做才能达到预期的效果?
What should I do to achieve the desired result?
推荐答案
您的逻辑条件是错误的. IIUC,您想要的是:
Your logic condition is wrong. IIUC, what you want is:
import pyspark.sql.functions as f
df.filter((f.col('d')<5))\
.filter(
((f.col('col1') != f.col('col3')) |
(f.col('col2') != f.col('col4')) & (f.col('col1') == f.col('col3')))
)\
.show()
我将filter()
步骤分为2个以提高可读性的调用,但是您可以等效地在一行中完成.
I broke the filter()
step into 2 calls for readability, but you could equivalently do it in one line.
输出:
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| xx| D| vv| 4|
| A| x| A| xx| 3|
| E| xxx| B| vv| 3|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xx| 4|
+----+----+----+----+---+
这篇关于Pyspark:根据多种条件过滤数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文