如何将具有多个字段的大型csv加载到Spark [英] how to load large csv with many fields to Spark

查看:111
本文介绍了如何将具有多个字段的大型csv加载到Spark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

新年快乐!

我知道以前曾问过/回答过这种类型的类似问题,但是,我的情况有所不同:

I know this type of similar question has been asked/answered before, however, mine is different:

我有具有100多个字段和100MB +的大型csv,我想将其加载到Spark(1.6)进行分析,该csv的标头看起来像所附的

I have large size csv with 100+ fields and 100MB+, I want to load it to Spark (1.6) for analysis, the csv's header looks like the attached sample (only one line of the data)

非常感谢您.

更新1(美国东部时间2016.12.31.1:26pm):

UPDATE 1(2016.12.31.1:26pm EST):

我使用以下方法并且能够加载数据(有限列的示例数据),但是,我需要自动将标题(来自csv)分配为DataFrame中的字段名称,但是,DataFrame看起来像:

I use the following approach and was able to load data (sample data with limited columns), however, I need to auto assign the header (from the csv) as the field's name in the DataFrame, BUT, the DataFrame looks like:

谁能告诉我该怎么做?注意,任何手动方式都是我要避免的.

Can anyone tell me how to do it? Note, any manual manner is what I want to avoid.

>>> import csv
>>> rdd = sc.textFile('file:///root/Downloads/data/flight201601short.csv') 
>>> rdd = rdd.mapPartitions(lambda x: csv.reader(x))
>>> rdd.take(5) 
>>> df = rdd.toDF() 
>>> df.show(5) 

推荐答案

如注释中所述,您可以将spark.read.csv用于spark 2.0.0+(

As noted in the comments you can use spark.read.csv for spark 2.0.0+ (https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)

df = spark.read.csv('your_file.csv', header=True, inferSchema=True)

header 设置为 True 会将标题解析为数据框的列名称.将 inferSchema 设置为 True 将获得表模式(但会减慢读取速度).

Setting header to True will parse the header to column names of the dataframe. Setting inferSchema to True will get the table schema (but will slow down reading).

另请参阅此处: 使用Spark加载CSV文件

这篇关于如何将具有多个字段的大型csv加载到Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆