在 spark 中读取字节列 [英] Read a bytes column in spark

查看：30 发布时间：2021/11/14 22:21:45 apache-spark encoding pyspark apache-spark-sql

本文介绍了在 spark 中读取字节列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据集，其中包含一个采用未知(且不友好)编码的 ID 字段.我可以使用普通 python 读取单列，并验证多个数据集的值是否不同且一致(即它可以用作连接的主键).

I have a data set which contains an ID field that is in an unknown (and not friendly) encoding. I can read the single column using plain python and verify that the values are distinct and consistent across multiple data sets (i.e. it can be used as a primary key for joining).

使用 spark.read.csv 加载文件时，spark 似乎正在将列转换为 utf-8.但是，一些多字节序列被转换为 Unicode 字符 U+FFFD REPLACEMENT CHARACTER.(EF BF BD 十六进制).

When loading the file using spark.read.csv, it seems that spark is converting the column to utf-8. However, some of the multibyte sequences are converted to the Unicode character U+FFFD REPLACEMENT CHARACTER. (EF BF BD in hex).

有没有办法强制 Spark 将列读取为字节而不是字符串?

Is there a way to force Spark to read the column as bytes and not as a string?

以下是一些可用于重现我的问题的代码(让 a 列作为 ID 字段):

Here is some code that can be used to recreate my issue (let column a be the ID field):

使用示例数据创建文件

data = [
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb0'), '1', 'a'),
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb1'), '2', 'b'),
    (bytes(b'\xba\xed\x85\x8e\x91\xd4\xc7\xb2'), '3', 'c')
]

with open('sample.csv', 'wb') as f:
    header = ["a", "b", "c"]
    f.write(",".join(header)+"\n")
    for d in data:
        f.write(",".join(d) + "\n")

使用 Pandas 阅读

import pandas as pd
df = pd.read_csv("sample.csv", converters={"a": lambda x: x.encode('hex')})
print(df)
#                  a  b  c
#0  baed858e91d4c7b0  1  a
#1  baed858e91d4c7b1  2  b
#2  baed858e91d4c7b2  3  c

尝试使用 Spark 读取同一个文件

spark_df = spark.read.csv("sample.csv", header=True)
spark_df.show()
#+-----+---+---+
#|a    |b  |c  |
#+-----+---+---+
#|�텎��ǰ|1  |a  |
#|�텎��Ǳ|2  |b  |
#|�텎��ǲ|3  |c  |
#+-----+---+---+

哎呀！好的，那么转换为 hex 怎么样?

Yikes! OK, so how about converting to hex?

import pyspark.sql.functions as f
spark_df.withColumn("a", f.hex("a")).show(truncate=False)
#+----------------------------+---+---+
#|a                           |b  |c  |
#+----------------------------+---+---+
#|EFBFBDED858EEFBFBDEFBFBDC7B0|1  |a  |
#|EFBFBDED858EEFBFBDEFBFBDC7B1|2  |b  |
#|EFBFBDED858EEFBFBDEFBFBDC7B2|3  |c  |
#+----------------------------+---+---+

(在这个例子中，值是不同的，但在我的大文件中不是这样)

(In this example the values are distinct, but that's not true in my larger file)

如您所见，这些值是close，但某些字节已被EFBFBD

As you can see, the values are close, but some of the bytes have been replaced by EFBFBD

有什么方法可以在 Spark 中读取文件(也许使用 rdd?)，以便我的输出看起来像 Pandas 版本:

Is there any way to read the file in Spark (maybe using rdd?) so that my output looks like the pandas version:

#+----------------+---+---+
#|a               |b  |c  |
#+----------------+---+---+
#|baed858e91d4c7b0|1  |a  |
#|baed858e91d4c7b1|2  |b  |
#|baed858e91d4c7b2|3  |c  |
#+----------------+---+---+

我尝试转换为 byte 并指定架构，以便此列是 ByteType() 但这不起作用.

I've tried casting to byte and specifying the schema so that this column is ByteType() but that didn't work.

编辑

我使用的是 Spark v 2.1.

I am using Spark v 2.1.

在 spark 中读取字节列 [英] Read a bytes column in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 spark 中读取字节列 [英] Read a bytes column in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭