在pyspark数据帧中处理字符串到数组的转换 [英] Handle string to array conversion in pyspark dataframe
问题描述
我有一个文件(csv),当在spark数据框中读取该文件时,其打印模式具有以下值
I have a file(csv) which when read in spark dataframe has the below values for print schema
-- list_values: string (nullable = true)
list_values列中的值类似于:
the values in the column list_values are something like:
[[[167, 109, 80, ...]]]
是否可以将其转换为数组类型而不是字符串?
Is it possible to convert this to array type instead of string?
我尝试将其拆分,并使用在线提供的代码来解决类似问题:
I tried splitting it and using code available online for similar problems:
df_1 = df.select('list_values', split(col("list_values"), ",\s*").alias("list_values"))
但是如果我运行上面的代码,得到的数组将跳过原始数组中的很多值,即
but if I run the above code the array which I get skips a lot of values in the original array i.e.
以上代码的输出为:
[, 109, 80, 69, 5...
与原始数组不同(即-缺少167)
which is different from original array i.e. (-- 167 is missing)
[[[167, 109, 80, ...]]]
由于我是火花的新手,所以我对它的完成方法并不了解(对于python,我可以完成ast.literal_eval,但是spark没有为此做准备.
Since I am new to spark I don't have much knowledge how it is done (For python I could have done ast.literal_eval but spark has no provision for this.
所以我将再次重复这个问题:
So I'll repeat the question again :
如何将存储为字符串的数组转换/广播为array
,即
How can I convert/cast an array stored as string to array
i.e.
'[]' to [] conversion
推荐答案
假设您的DataFrame是以下内容:
Suppose your DataFrame was the following:
df.show()
#+----+------------------+
#|col1| col2|
#+----+------------------+
#| a|[[[167, 109, 80]]]|
#+----+------------------+
df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
您可以使用 pyspark.sql.functions.regexp_replace
删除前和后方括号.完成后,您可以在", "
上split
生成的字符串:
You could use pyspark.sql.functions.regexp_replace
to remove the leading and trailing square brackets. Once that's done, you can split
the resulting string on ", "
:
from pyspark.sql.functions import split, regexp_replace
df2 = df.withColumn(
"col3",
split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ")
)
df2.show()
#+----+------------------+--------------+
#|col1| col2| col3|
#+----+------------------+--------------+
#| a|[[[167, 109, 80]]]|[167, 109, 80]|
#+----+------------------+--------------+
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# | |-- element: string (containsNull = true)
如果您希望该列为整数数组,则可以使用cast:
If you wanted the column as an array of integers, you could use cast:
from pyspark.sql.functions import col
df2 = df2.withColumn("col3", col("col3").cast("array<int>"))
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# | |-- element: integer (containsNull = true)
这篇关于在pyspark数据帧中处理字符串到数组的转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!