在 spark 中读取字节列 [英] Read a bytes column in spark

查看:30
本文介绍了在 spark 中读取字节列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中包含一个采用未知(且不友好)编码的 ID 字段.我可以使用普通 python 读取单列,并验证多个数据集的值是否不同且一致(即它可以用作连接的主键).

I have a data set which contains an ID field that is in an unknown (and not friendly) encoding. I can read the single column using plain python and verify that the values are distinct and consistent across multiple data sets (i.e. it can be used as a primary key for joining).

使用 spark.read.csv 加载文件时,spark 似乎正在将列转换为 utf-8.但是,一些多字节序列被转换为 Unicode 字符 U+FFFD REPLACEMENT CHARACTER.(EF BF BD 十六进制).

When loading the file using spark.read.csv, it seems that spark is converting the column to utf-8. However, some of the multibyte sequences are converted to the Unicode character U+FFFD REPLACEMENT CHARACTER. (EF BF BD in hex).

有没有办法强制 Spark 将列读取为字节而不是字符串?

Is there a way to force Spark to read the column as bytes and not as a string?

以下是一些可用于重现我的问题的代码(让 a 列作为 ID 字段):

Here is some code that can be used to recreate my issue (let column a be the ID field):

使用示例数据创建文件

data = [
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb0'), '1', 'a'),
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb1'), '2', 'b'),
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb2'), '3', 'c')
]

with open('sample.csv', 'wb') as f:
    header = ["a", "b", "c"]
    f.write(",".join(header)+"\n")
    for d in data:
        f.write(",".join(d) + "\n")

使用 Pandas 阅读

import pandas as pd
df = pd.read_csv("sample.csv", converters={"a": lambda x: x.encode('hex')})
print(df)
#                  a  b  c
#0  baed858e91d4c7b0  1  a
#1  baed858e91d4c7b1  2  b
#2  baed858e91d4c7b2  3  c

尝试使用 Spark 读取同一个文件

spark_df = spark.read.csv("sample.csv", header=True)
spark_df.show()
#+-----+---+---+
#|a    |b  |c  |
#+-----+---+---+
#|�텎��ǰ|1  |a  |
#|�텎��DZ|2  |b  |
#|�텎��Dz|3  |c  |
#+-----+---+---+

哎呀!好的,那么转换为 hex 怎么样?

Yikes! OK, so how about converting to hex?

import pyspark.sql.functions as f
spark_df.withColumn("a", f.hex("a")).show(truncate=False)
#+----------------------------+---+---+
#|a                           |b  |c  |
#+----------------------------+---+---+
#|EFBFBDED858EEFBFBDEFBFBDC7B0|1  |a  |
#|EFBFBDED858EEFBFBDEFBFBDC7B1|2  |b  |
#|EFBFBDED858EEFBFBDEFBFBDC7B2|3  |c  |
#+----------------------------+---+---+

(在这个例子中,值是不同的,但在我的大文件中不是这样)

(In this example the values are distinct, but that's not true in my larger file)

如您所见,这些值是close,但某些字节已被EFBFBD

As you can see, the values are close, but some of the bytes have been replaced by EFBFBD

有什么方法可以在 Spark 中读取文件(也许使用 rdd?),以便我的输出看起来像 Pandas 版本:

Is there any way to read the file in Spark (maybe using rdd?) so that my output looks like the pandas version:

#+----------------+---+---+
#|a               |b  |c  |
#+----------------+---+---+
#|baed858e91d4c7b0|1  |a  |
#|baed858e91d4c7b1|2  |b  |
#|baed858e91d4c7b2|3  |c  |
#+----------------+---+---+

我尝试转换为 byte 并指定架构,以便此列是 ByteType() 但这不起作用.

I've tried casting to byte and specifying the schema so that this column is ByteType() but that didn't work.

编辑

我使用的是 Spark v 2.1.

I am using Spark v 2.1.

推荐答案

问题的根源在于分隔文件不适合二进制数据.

The problem is rooted in the fact that delimited files are poorly suited to binary data.

如果文本有已知的、一致的编码,请使用 charset 选项.请参阅 https://github.com/databricks/spark-csv#features(我不知道在 2.x 文档中描述了分隔阅读选项的好地方,所以我仍然回到 1.x 文档).我建议尝试使用 8 位 ASCII,例如 ISO-8859-1US-ASCII.

If there is a known, consistent encoding for the text, use the charset option. See https://github.com/databricks/spark-csv#features (I don't know of a good place in the 2.x docs where delimited reading options are described so I still go back to the 1.x docs). I would recommend experimenting with 8-bit ASCII, e.g., ISO-8859-1 or US-ASCII.

如果没有这样的编码,您需要将输入转换为不同的格式,例如对第一列进行 base64 编码,或者操作读取的数据以使其恢复为您需要的格式.

If there is no such encoding, you would need to either transform the input to a different format, e.g., base64 encoding the first column, or manipulate the read data to get it back to what you need.

这篇关于在 spark 中读取字节列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆