pyspark csv的URL到数据帧,而不写入磁盘 [英] pyspark csv at url to dataframe, without writing to disk
本文介绍了pyspark csv的URL到数据帧,而不写入磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何在URL中将csv读取到Pyspark中的数据帧中而不将其写入磁盘?
How can I read a csv at a url into a dataframe in Pyspark without writing it to disk?
我已经尝试了以下方法:
I've tried the following with no luck:
import urllib.request
from io import StringIO
url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()
text = data.decode('utf-8')
f = StringIO(text)
df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()
推荐答案
TL; DR 这是不可能的,并且通常通过驱动程序传输数据是死胡同.
TL;DR It is not possible and in general transferring data through driver is a dead-end.
- 在Spark 2.3
csv
之前,阅读器只能从URI中读取(并且不支持http). -
在Spark 2.3中,您使用
RDD
:
- Before Spark 2.3
csv
reader can read only from URI (and http is not supported). In Spark 2.3 you use
RDD
:
spark.read.csv(sc.parallelize(text.splitlines()))
但是数据将被写入磁盘.
but data will be written to disk.
您可以从熊猫createDataFrame
:
spark.createDataFrame(pd.read_csv(url)))
但这再次写入磁盘
如果文件很小,我只用sparkFiles
:
If file is small I'd just use sparkFiles
:
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("iris.csv"), header=True))
这篇关于pyspark csv的URL到数据帧,而不写入磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文