获取CSV到Spark数据框 [英] Get CSV to Spark dataframe
问题描述
我在Spark上使用python,并希望将csv放入数据框.
I'm using python on Spark and would like to get a csv into a dataframe.
Spark SQL的文档奇怪地没有提供有关CSV作为来源.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
我找到了 Spark-CSV ,但是我对文档的两部分有疑问:
I have found Spark-CSV, however I have issues with two parts of the documentation:
-
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
每次启动pyspark或spark-submit时,是否真的需要添加此参数?看起来很不雅致.有没有一种方法可以在python中导入它,而不是每次都重新下载它?
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv")
即使执行上述操作,也无法正常工作.此代码行中的源"参数代表什么?我如何简单地在Linux上加载本地文件,例如"/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv")
Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
推荐答案
将csv文件读入RDD,然后从原始RDD生成RowRDD.
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
在第1步中创建的RDD中,创建一个由StructType表示的模式,该模式与Rows的结构相匹配.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
通过SQLContext提供的createDataFrame方法将模式应用于行的RDD.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
源: SPARK编程指南
这篇关于获取CSV到Spark数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!