pyspark 按位与与与号运算符 [英] pyspark bitwiseAND vs ampersand operator

查看:64
本文介绍了pyspark 按位与与与号运算符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试向数据框中添加一列,以指示何时在嵌套数组中找到两个不同的值

I am trying to add a column to a dataframe that indicates when two different values are both found in a nested array

 expr1 = array_contains(df.child_list, "value1")
 expr2 = array_contains(df.child_list, "value2")

我让它与符号运算符一起工作

I got it to work with an ampersand operator

 df.select(...).withColumn("boolTest", expr1 & expr2)

然后我尝试用 bitwiseAND 替换它,我的想法是,我希望将这些表达式的列表动态 AND 在一起.

Then I tried to replace this with bitwiseAND with the thought being, I would want to have a list of these expressions ANDed together dynamically.

此操作失败并出现错误

 df.select(...).withColumn("boolTest", expr1.bitwiseAND(expr2))

 cannot resolve ..... due to data type mismatch: '(array_contains(c1.`child_list`, 'value1') & 
array_contains(c1.`child_list`, 'value2'))' requires integral type, 
not boolean;;

有什么区别,我做错了什么?

What's the distinction and what am I doing wrong?

推荐答案

The &和|运算符在 pyspark 中的 BooleanType 列上工作,作为逻辑 AND 和 OR 操作.换句话说,他们将 True/False 作为输入并输出 True/False.

The & and | operators work on BooleanType columns in pyspark operate as logical AND and OR operations. In other words they take True/False as input and output True/False.

bitwiseAND 函数对两个数值进行逐位 AND 运算.所以他们可以取两个整数并输出它们的按位与.

The bitwiseAND functions does bit by bit AND'ing of two numeric values. So they could take two integers and output the bitwise AND'ing of them.

以下是每个示例:

from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = StructType([   
  StructField("b1", BooleanType()), 
  StructField("b2", BooleanType()),
  StructField("int1", IntegerType()), 
  StructField("int2", IntegerType())
])
data = [
  (True, True, 0x01, 0x01), 
  (True, False, 0xFF, 0xA), 
  (False, False, 0x01, 0x00)
]

df = sqlContext.createDataFrame(sc.parallelize(data), schema)


df2 = df.withColumn("logical", df.b1 & df.b2) \
        .withColumn("bitwise", df.int1.bitwiseAND(df.int2))

df2.printSchema()
df2.show()

+-----+-----+----+----+-------+-------+
|   b1|   b2|int1|int2|logical|bitwise|
+-----+-----+----+----+-------+-------+
| true| true|   1|   1|   true|      1|
| true|false| 255|  10|  false|     10|
|false|false|   1|   0|  false|      0|
+-----+-----+----+----+-------+-------+


>>> df2.printSchema()
root
 |-- b1: boolean (nullable = true)
 |-- b2: boolean (nullable = true)
 |-- int1: integer (nullable = true)
 |-- int2: integer (nullable = true)
 |-- logical: boolean (nullable = true)
 |-- bitwise: integer (nullable = true)

如果你想动态地将一个列列表组合在一起,你可以这样做:

If you want to dynamically AND together a list of columns, you can do it like this:

columns = [col("b1"), col("b2")]
df.withColumn("result", reduce(lambda a, b: a & b, columns))

这篇关于pyspark 按位与与与号运算符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆