获取CSV到Spark数据框 [英] Get CSV to Spark dataframe

查看:67
本文介绍了获取CSV到Spark数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Spark上使用python,并希望将csv放入数据框.

I'm using python on Spark and would like to get a csv into a dataframe.

Spark SQL的文档奇怪地没有提供有关CSV作为来源.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

我找到了 Spark-CSV ,但是我对文档的两部分有疑问:

I have found Spark-CSV, however I have issues with two parts of the documentation:

  • "This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3" 每次启动pyspark或spark-submit时,是否真的需要添加此参数?看起来很不雅致.有没有一种方法可以在python中导入它,而不是每次都重新下载它?

  • "This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3" Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?

df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv")即使执行上述操作,也无法正常工作.此代码行中的源"参数代表什么?我如何简单地在Linux上加载本地文件,例如"/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?

df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?

推荐答案

将csv文件读入RDD,然后从原始RDD生成RowRDD.

Read the csv file in to a RDD and then generate a RowRDD from the original RDD.

在第1步中创建的RDD中,创建一个由StructType表示的模式,该模式与Rows的结构相匹配.

Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.

通过SQLContext提供的createDataFrame方法将模式应用于行的RDD.

Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))

# The schema is encoded in a string.
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)

源: SPARK编程指南

这篇关于获取CSV到Spark数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆