Spark 2.3.0 读取带有标题选项的文本文件不起作用 [英] Spark 2.3.0 Read Text File With Header Option Not Working

查看:25
本文介绍了Spark 2.3.0 读取带有标题选项的文本文件不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码正在运行并从文本文件创建一个 Spark 数据帧.但是,我正在尝试使用 header 选项将第一列用作标题,但由于某种原因,它似乎没有发生.我不明白为什么!这一定是愚蠢的,但我无法解决这个问题.

>>>from pyspark.sql import SparkSession>>>spark = SparkSession.builder.master("local").appName("Word Count")\.config("spark.some.config.option", "some-value")\.getOrCreate()>>>df = spark.read.option("header", "true")\.option("分隔符", ",")\.option("inferSchema", "true")\.text("StockData/ETFs/aadr.us.txt")>>>df.take(3)

返回以下内容:

<块引用>

[Row(value=u'Date,Open,High,Low,Close,Volume,OpenInt'),行(值=u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'),行(值=u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]

>>>df.columns

返回以下内容:

<块引用>

['值']

解决方案

问题

问题是您使用的是 .text api 调用而不是 .csv.load.如果您阅读 .text api 文档,它会说

<块引用><块引用>

def 文本(自我,路径):"""加载文本文件并返回一个 :class:DataFrame 其架构以名为value"的字符串列开头,然后是分区列(如果有).文本文件中的每一行都是结果 DataFrame 中的一个新行.:param 路径:输入路径的字符串或字符串列表.df = spark.read.text('python/test_support/sql/text-test.txt')df.collect()[Row(value=u'hello'), Row(value=u'this')]"""

使用 .csv 的解决方案

.text 函数调用更改为 .csv,你应该没问题

df = spark.read.option("header", "true") \.option("分隔符", ",") \.option("inferSchema", "true") \.csv("StockData/ETFs/aadr.us.txt")df.show(2, truncate=False)

应该给你

+-------------------+------+------+------+----------+------+-------+|日期|开盘|最高价|最低价|收盘价|成交量|OpenInt|+-------------------+------+------+------+------+------+-----+|2010-07-21 00:00:00|24.333|24.333|23.946|23.946|43321 |0 ||2010-07-22 00:00:00|24.644|24.644|24.362|24.487|18031 |0 |+-------------------+------+------+------+------+------+-----+

使用 .load 的解决方案

如果未定义格式选项,

.load假定文件为镶木地板格式.所以你还需要定义一个格式选项

df = spark.read\.format("com.databricks.spark.csv")\.option("header", "true") \.option("分隔符", ",") \.option("inferSchema", "true") \.load("StockData/ETFs/aadr.us.txt")df.show(2, truncate=False)

希望回答对你有帮助

The code below is working and creates a Spark dataframe from a text file. However, I'm trying to use the header option to use the first column as header and for some reason it doesn't seem to be happening. I cannot understand why! It must be something stupid but I cannot solve this.

>>>from pyspark.sql import SparkSession
>>>spark = SparkSession.builder.master("local").appName("Word Count")\
    .config("spark.some.config.option", "some-value")\
    .getOrCreate()
>>>df = spark.read.option("header", "true")\
    .option("delimiter", ",")\
    .option("inferSchema", "true")\
    .text("StockData/ETFs/aadr.us.txt")
>>>df.take(3)

Returns the following:

[Row(value=u'Date,Open,High,Low,Close,Volume,OpenInt'), Row(value=u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'), Row(value=u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]

>>>df.columns

Returns the following:

['value']

解决方案

Issue

The issue is that you are using .text api call instead of .csv or .load. If you read the .text api documentation, it says

def text(self, paths): """Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). df = spark.read.text('python/test_support/sql/text-test.txt') df.collect() [Row(value=u'hello'), Row(value=u'this')] """

Solution using .csv

Change the .text function call to .csv and you should be fine as

df = spark.read.option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .csv("StockData/ETFs/aadr.us.txt")

df.show(2, truncate=False)

which should give you

+-------------------+------+------+------+------+------+-------+
|Date               |Open  |High  |Low   |Close |Volume|OpenInt|
+-------------------+------+------+------+------+------+-------+
|2010-07-21 00:00:00|24.333|24.333|23.946|23.946|43321 |0      |
|2010-07-22 00:00:00|24.644|24.644|24.362|24.487|18031 |0      |
+-------------------+------+------+------+------+------+-------+

Solution using .load

.load would assume the file to be of parquet format if a format option is not defined. So you would need a format option to be defined as well

df = spark.read\
    .format("com.databricks.spark.csv")\
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .load("StockData/ETFs/aadr.us.txt")

df.show(2, truncate=False)

I hope the answer is helpful

这篇关于Spark 2.3.0 读取带有标题选项的文本文件不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆