Spark 2.3.0读取带有标题选项的文本文件不起作用 [英] Spark 2.3.0 Read Text File With Header Option Not Working

查看:111
本文介绍了Spark 2.3.0读取带有标题选项的文本文件不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码正在运行,并且可以从文本文件创建Spark数据框.但是,我试图使用header选项将第一列用作标题,由于某种原因,它似乎没有发生.我不明白为什么!这一定是愚蠢的,但我无法解决.

The code below is working and creates a Spark dataframe from a text file. However, I'm trying to use the header option to use the first column as header and for some reason it doesn't seem to be happening. I cannot understand why! It must be something stupid but I cannot solve this.

>>>from pyspark.sql import SparkSession
>>>spark = SparkSession.builder.master("local").appName("Word Count")\
    .config("spark.some.config.option", "some-value")\
    .getOrCreate()
>>>df = spark.read.option("header", "true")\
    .option("delimiter", ",")\
    .option("inferSchema", "true")\
    .text("StockData/ETFs/aadr.us.txt")
>>>df.take(3)

返回以下内容:

[Row(value = u'Date,Open,High,Low,Close,Volume,OpenInt'), 行(值= u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'), 行(值= u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]

[Row(value=u'Date,Open,High,Low,Close,Volume,OpenInt'), Row(value=u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'), Row(value=u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]

>>>df.columns

返回以下内容:

['value']

['value']

推荐答案

问题

问题是您使用的是.text api而不是.csv.load.如果您阅读 .text api文档,则会显示

The issue is that you are using .text api call instead of .csv or .load. If you read the .text api documentation, it says

def text(self, paths): """Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). df = spark.read.text('python/test_support/sql/text-test.txt') df.collect() [Row(value=u'hello'), Row(value=u'this')] """

def text(self, paths): """Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). df = spark.read.text('python/test_support/sql/text-test.txt') df.collect() [Row(value=u'hello'), Row(value=u'this')] """

使用.csv的解决方案

.text函数调用更改为.csv,您应该会得到

Change the .text function call to .csv and you should be fine as

df = spark.read.option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .csv("StockData/ETFs/aadr.us.txt")

df.show(2, truncate=False)

应该给您

+-------------------+------+------+------+------+------+-------+
|Date               |Open  |High  |Low   |Close |Volume|OpenInt|
+-------------------+------+------+------+------+------+-------+
|2010-07-21 00:00:00|24.333|24.333|23.946|23.946|43321 |0      |
|2010-07-22 00:00:00|24.644|24.644|24.362|24.487|18031 |0      |
+-------------------+------+------+------+------+------+-------+

使用.load的解决方案

.load假定文件为拼花格式..因此,您还需要定义一个格式选项

df = spark.read\
    .format("com.databricks.spark.csv")\
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .load("StockData/ETFs/aadr.us.txt")

df.show(2, truncate=False)

我希望答案会有所帮助

这篇关于Spark 2.3.0读取带有标题选项的文本文件不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆