使用Spark加载CSV文件 [英] Load CSV file with Spark

查看：236 发布时间：2017/2/24 15:13:49 python csv apache-spark pyspark

本文介绍了使用Spark加载CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是Spark的新手，我正在尝试使用Spark从文件中读取CSV数据。
这是我在做什么：

  sc.textFile（'file.csv'）
 .map （lambda line：（line.split（'，'）[0]，line.split（'，'）[1]）$ b $ b .collect（）
  / pre> 
 
 我希望这个调用给我一个文件的两个第一列的列表，但我得到这个错误：
 文件< ipython-input-60-73ea98550983>，line 1，in< 
 IndexError：list index超出范围
  
，但我的CSV文件为多列。 
解决方案
您确定所有行至少有2列吗？你可以尝试类似的东西，只是为了检查？：
  sc.textFile（file.csv）\ 
 .map（lambda line：line.split（，））\ 
 .filter（lambda line：len（line）> 1）\ 
 .map [0]，行[1]））\ 
 .collect（）
  
 ，您可以打印罪魁祸首（如果有）：
  sc.textFile（file.csv）\ 
 .map（lambda line：line.split（，））\ 
 .filter（lambda line：len（line）< = 1）\ 
 .collect b  
 
I'm new to Spark and I'm trying to read CSV data from a file with Spark.
Here's what I am doing :
sc.textFile('file.csv')
    .map(lambda line: (line.split(',')[0], line.split(',')[1]))
    .collect()
I would expect this call to give me a list of the two first columns of my file but I'm getting this error :
File "<ipython-input-60-73ea98550983>", line 1, in <lambda>
IndexError: list index out of range
although my CSV file as more than one column.
 解决方案 
Are you sure that all the lines have at least 2 columns? Can you try something like, just to check?:
sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>1) \
    .map(lambda line: (line[0],line[1])) \
    .collect()
Alternatively, you could print the culprit (if any):
sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)<=1) \
    .collect()


                        
这篇关于使用Spark加载CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Spark加载CSV文件 [英] Load CSV file with Spark

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Spark加载CSV文件 [英] Load CSV file with Spark

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭