使用 spark databricks 平台从 URL 读取数据 [英] reading data from URL using spark databricks platform

查看:41
本文介绍了使用 spark databricks 平台从 URL 读取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试在databricks社区版平台上使用spark从url读取数据我尝试使用 spark.read.csv 并使用 SparkFiles 但仍然缺少一些简单的点

trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 

df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

得到与路径相关的错误:

got path related error:

路径不存在:dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-9635783;csv代码>

我也试过其他方法

val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString

 # val list = content.split("\n").filter(_ != "")
   val rdd = sc.parallelize(content)
   val df = rdd.toDF

SyntaxError: invalid syntax
  File "<command-332010883169993>", line 16
    val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
              ^
SyntaxError: invalid syntax

数据应该直接加载到databricks文件夹或者我应该能够使用spark.read直接从url加载,任何建议

data should be loaded directly to databricks folder or i should be able load directly from url using spark.read, any suggestions

推荐答案

试试这个.

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**

只需获取您的 csv 网址的几列.

Just fetching few columns of your csv url.

df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age|       workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39|       State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38|         Private|215646|  HS-grad|
| 53|         Private|234721|     11th|
| 28|         Private|338409|Bachelors|
| 37|         Private|284582|  Masters|
| 49|         Private|160187|      9th|
| 52|Self-emp-not-inc|209642|  HS-grad|
| 31|         Private| 45781|  Masters|
| 42|         Private|159449|Bachelors|
+---+----------------+------+---------+

SparkFiles 获取驱动程序或工作器本地文件的绝对路径.这就是它无法找到它的原因.

SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.

这篇关于使用 spark databricks 平台从 URL 读取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆