在 Scala & 中读取 zst 档案Spark:本机 zStandard 库不可用 [英] Reading a zst archive in Scala & Spark: native zStandard library not available

查看:207
本文介绍了在 Scala & 中读取 zst 档案Spark:本机 zStandard 库不可用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Scala 上使用 Spark 读取 zst 压缩文件.

I'm trying to read a zst-compressed file using Spark on Scala.

 import org.apache.spark.sql._
 import org.apache.spark.sql.types._
 val schema = new StructType()
      .add("title", StringType, true)
      .add("selftext", StringType, true)
      .add("score", LongType, true)
      .add("created_utc", LongType, true)
      .add("subreddit", StringType, true)
      .add("author", StringType, true)
 val df_with_schema = spark.read.schema(schema).json("/home/user/repos/concepts/abcde/RS_2019-09.zst")

 df_with_schema.take(1)

不幸的是,这会产生以下错误:

Unfortunately this produces the following error:

org.apache.spark.SparkException:由于阶段失败,作业中止:阶段 0.0 中的任务 0 失败 1 次,最近失败:丢失任务 0.0在阶段 0.0 (TID 0) (192.168.0.101 executor driver):java.lang.RuntimeException: 本机 zStandard 库不可用:这个版本的 libhadoop 是在没有 zstd 支持的情况下构建的.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.0.101 executor driver): java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support.

我的 hadoop checknative 看起来如下,但我从 here 了解到Apache Spark 有自己的 ZStandardCodec.

My hadoop checknative looks as follows, but I understand from here that Apache Spark has its own ZStandardCodec.

本地库检查:

  • hadoop: true/opt/hadoop/lib/native/libhadoop.so.1.0.0
  • zlib: true/lib/x86_64-linux-gnu/libz.so.1
  • zstd : true/lib/x86_64-linux-gnu/libzstd.so.1
  • snappy: true/lib/x86_64-linux-gnu/libsnappy.so.1
  • lz4:真实版本:10301
  • bzip2:真/lib/x86_64-linux-gnu/libbz2.so.1
  • openssl:false EVP_CIPHER_CTX_cleanup
  • ISA-L:错误的 libhadoop 是在没有 ISA-L 支持的情况下构建的
  • PMDK:false 本机代码是在没有 PMDK 支持的情况下构建的.

感谢您的任何想法,谢谢!

Any ideas are appreciated, thank you!

更新1:按照这个 post,我已经更好地理解了消息的含义,即默认情况下编译 Hadoop 时未启用 zstd,因此可能的解决方案之一显然是使用该标志已启用.

UPDATE 1: As per this post, I've understood better what the message meant, namely that zstd is not enabled when compiling Hadoop by default, so one of possible solutions would be obviously building it with that flag enabled.

推荐答案

由于我不想自己构建 Hadoop,受到使用的解决方法的启发 此处,我已将 Spark 配置为使用 Hadoop 本机库:

Since I didn't want to build Hadoop by myself, inspired by the workaround used here, I've configured Spark to use Hadoop native libraries:

spark.driver.extraLibraryPath=/opt/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/hadoop/lib/native

我现在可以毫无问题地将 zst 存档读取到 DataFrame 中.

I can now read the zst archive into a DataFrame with no issues.

这篇关于在 Scala & 中读取 zst 档案Spark:本机 zStandard 库不可用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆