尝试使用本地Spark从S3读取和写入Parquet文件 [英] Trying to read and write parquet files from s3 with local spark

查看:625
本文介绍了尝试使用本地Spark从S3读取和写入Parquet文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用spark从本地计算机读取拼写文件并将其写入S3.但是我似乎无法正确配置我的Spark会话来执行此操作.显然,需要进行一些配置,但是我找不到关于如何执行配置的明确参考.

I'm trying to read and write parquet files from my local machine to S3 using spark. But I can't seem to configure my spark session properly to do so. Obviously there are configurations to be made, but I could not find a clear reference on how to do it.

当前,我的spark会话读取本地实木复合地板模拟并定义为:

Currently my spark session reads local parquet mocks and is defined as such:

val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()

推荐答案

我将不得不更正himanshuIIITian的帖子,(抱歉).

I'm going to have to correct the post by himanshuIIITian slightly, (sorry).

  1. 使用s3a连接器,而不要使用较旧的,过时的,未维护的s3n. S3A是:更快,可与较新的S3群集(首尔,法兰克福,伦敦等)配合使用,扩展性更好. S3N存在一些基本的性能问题,只有通过完全删除该连接器才能解决在最新版本的Hadoop中.继续.

  1. Use the s3a connector, not the older, obsolete, unmaintained, s3n. S3A is: faster, works with the newer S3 clusters (Seoul, Frankfurt, London, ...), scales better. S3N has fundamental performance issues which have only been fixed in the latest version of Hadoop by deleting that connector entirely. Move on.

您不能安全地将s3用作Spark查询的直接目的地.写入本地文件://,然后使用AWS CLI界面复制数据.您将获得更好的性能,并获得通常期望从IO获得的可靠写入的保证

You cannot safely use s3 as a direct destination of a Spark query., not with the classic "FileSystem" committers available today. Write to your local file:// and then copy up the data afterwards, using the AWS CLI interface. You'll get better performance as well as the guarantees of reliable writing which you would normally expect from IO

这篇关于尝试使用本地Spark从S3读取和写入Parquet文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆