使用Spark Databricks平台从URL读取数据 [英] reading data from URL using spark databricks platform
问题描述
尝试在Databricks社区版平台上使用Spark从url读取数据 我尝试使用spark.read.csv和SparkFiles,但仍然缺少一些简单的要点
trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point
url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)
df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)
与路径有关的错误:
Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv;'
我也尝试了其他方法
val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
# val list = content.split("\n").filter(_ != "")
val rdd = sc.parallelize(content)
val df = rdd.toDF
SyntaxError: invalid syntax
File "<command-332010883169993>", line 16
val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
^
SyntaxError: invalid syntax
数据应该直接加载到databricks文件夹中,或者我应该能够使用spark.read从URL直接加载,有任何建议
data should be loaded directly to databricks folder or i should be able load directly from url using spark.read, any suggestions
推荐答案
尝试一下.
url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**
只需获取csv网址的几列.
Just fetching few columns of your csv url.
df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age| workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39| State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38| Private|215646| HS-grad|
| 53| Private|234721| 11th|
| 28| Private|338409|Bachelors|
| 37| Private|284582| Masters|
| 49| Private|160187| 9th|
| 52|Self-emp-not-inc|209642| HS-grad|
| 31| Private| 45781| Masters|
| 42| Private|159449|Bachelors|
+---+----------------+------+---------+
SparkFiles获取驱动程序或工作程序本地文件的绝对路径.这就是为什么它找不到它的原因.
SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.
这篇关于使用Spark Databricks平台从URL读取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!