使用 spark databricks 平台从 URL 读取数据 [英] reading data from URL using spark databricks platform
问题描述
尝试在databricks社区版平台上使用spark从url读取数据我尝试使用 spark.read.csv 并使用 SparkFiles 但仍然缺少一些简单的点
trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point
url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)
df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)
得到与路径相关的错误:
got path related error:
路径不存在:dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-9635783;csv代码>
我也试过其他方法
val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
# val list = content.split("\n").filter(_ != "")
val rdd = sc.parallelize(content)
val df = rdd.toDF
SyntaxError: invalid syntax
File "<command-332010883169993>", line 16
val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
^
SyntaxError: invalid syntax
数据应该直接加载到databricks文件夹或者我应该能够使用spark.read直接从url加载,任何建议
data should be loaded directly to databricks folder or i should be able load directly from url using spark.read, any suggestions
推荐答案
试试这个.
url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**
只需获取您的 csv 网址的几列.
Just fetching few columns of your csv url.
df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age| workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39| State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38| Private|215646| HS-grad|
| 53| Private|234721| 11th|
| 28| Private|338409|Bachelors|
| 37| Private|284582| Masters|
| 49| Private|160187| 9th|
| 52|Self-emp-not-inc|209642| HS-grad|
| 31| Private| 45781| Masters|
| 42| Private|159449|Bachelors|
+---+----------------+------+---------+
SparkFiles 获取驱动程序或工作器本地文件的绝对路径.这就是它无法找到它的原因.
SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.
这篇关于使用 spark databricks 平台从 URL 读取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!