pyspark 按位与与与号运算符 [英] pyspark bitwiseAND vs ampersand operator
问题描述
我正在尝试向数据框中添加一列,以指示何时在嵌套数组中找到两个不同的值
I am trying to add a column to a dataframe that indicates when two different values are both found in a nested array
expr1 = array_contains(df.child_list, "value1")
expr2 = array_contains(df.child_list, "value2")
我让它与符号运算符一起工作
I got it to work with an ampersand operator
df.select(...).withColumn("boolTest", expr1 & expr2)
然后我尝试用 bitwiseAND
替换它,我的想法是,我希望将这些表达式的列表动态 AND 在一起.
Then I tried to replace this with bitwiseAND
with the thought being, I would want to have a list of these expressions ANDed together dynamically.
此操作失败并出现错误
df.select(...).withColumn("boolTest", expr1.bitwiseAND(expr2))
cannot resolve ..... due to data type mismatch: '(array_contains(c1.`child_list`, 'value1') &
array_contains(c1.`child_list`, 'value2'))' requires integral type,
not boolean;;
有什么区别,我做错了什么?
What's the distinction and what am I doing wrong?
推荐答案
The &和|运算符在 pyspark 中的 BooleanType 列上工作,作为逻辑 AND 和 OR 操作.换句话说,他们将 True/False 作为输入并输出 True/False.
The & and | operators work on BooleanType columns in pyspark operate as logical AND and OR operations. In other words they take True/False as input and output True/False.
bitwiseAND 函数对两个数值进行逐位 AND 运算.所以他们可以取两个整数并输出它们的按位与.
The bitwiseAND functions does bit by bit AND'ing of two numeric values. So they could take two integers and output the bitwise AND'ing of them.
以下是每个示例:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType([
StructField("b1", BooleanType()),
StructField("b2", BooleanType()),
StructField("int1", IntegerType()),
StructField("int2", IntegerType())
])
data = [
(True, True, 0x01, 0x01),
(True, False, 0xFF, 0xA),
(False, False, 0x01, 0x00)
]
df = sqlContext.createDataFrame(sc.parallelize(data), schema)
df2 = df.withColumn("logical", df.b1 & df.b2) \
.withColumn("bitwise", df.int1.bitwiseAND(df.int2))
df2.printSchema()
df2.show()
+-----+-----+----+----+-------+-------+
| b1| b2|int1|int2|logical|bitwise|
+-----+-----+----+----+-------+-------+
| true| true| 1| 1| true| 1|
| true|false| 255| 10| false| 10|
|false|false| 1| 0| false| 0|
+-----+-----+----+----+-------+-------+
>>> df2.printSchema()
root
|-- b1: boolean (nullable = true)
|-- b2: boolean (nullable = true)
|-- int1: integer (nullable = true)
|-- int2: integer (nullable = true)
|-- logical: boolean (nullable = true)
|-- bitwise: integer (nullable = true)
如果你想动态地将一个列列表组合在一起,你可以这样做:
If you want to dynamically AND together a list of columns, you can do it like this:
columns = [col("b1"), col("b2")]
df.withColumn("result", reduce(lambda a, b: a & b, columns))
这篇关于pyspark 按位与与与号运算符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!