获取 CSV 到 Spark 数据框 [英] Get CSV to Spark dataframe

查看:15
本文介绍了获取 CSV 到 Spark 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Spark 上使用 python,并希望将 csv 放入数据帧中.

I'm using python on Spark and would like to get a csv into a dataframe.

Spark SQL 的文档奇怪地没有提供解释CSV 作为来源.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

我找到了 Spark-CSV,但是文档的两部分有问题:

I have found Spark-CSV, however I have issues with two parts of the documentation:

  • "可以使用 --jars 命令行选项将此包添加到 Spark.例如,在启动 spark shell 时包含它: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"我真的需要在每次启动 pyspark 或 spark-submit 时添加这个参数吗?显得非常不雅观.有没有办法在python中导入而不是每次都重新下载?

  • "This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3" Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?

df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") 即使我做了以上,这是行不通的.这行代码中的源"参数代表什么?我如何简单地在 linux 上加载本地文件,比如/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?

df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?

推荐答案

将 csv 文件读入 RDD,然后从原始 RDD 生成 RowRDD.

Read the csv file in to a RDD and then generate a RowRDD from the original RDD.

创建由 StructType 表示的模式,该模式与步骤 1 中创建的 RDD 中的 Rows 结构相匹配.

Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.

通过 SQLContext 提供的 createDataFrame 方法将 schema 应用到 Rows 的 RDD.

Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))

# The schema is encoded in a string.
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)

来源:SPARK 编程指南

这篇关于获取 CSV 到 Spark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆