Spark 2.3.0读取带有标题选项的文本文件不起作用 [英] Spark 2.3.0 Read Text File With Header Option Not Working
问题描述
下面的代码正在运行,并且可以从文本文件创建Spark数据框.但是,我试图使用header选项将第一列用作标题,由于某种原因,它似乎没有发生.我不明白为什么!这一定是愚蠢的,但我无法解决.
The code below is working and creates a Spark dataframe from a text file. However, I'm trying to use the header option to use the first column as header and for some reason it doesn't seem to be happening. I cannot understand why! It must be something stupid but I cannot solve this.
>>>from pyspark.sql import SparkSession
>>>spark = SparkSession.builder.master("local").appName("Word Count")\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
>>>df = spark.read.option("header", "true")\
.option("delimiter", ",")\
.option("inferSchema", "true")\
.text("StockData/ETFs/aadr.us.txt")
>>>df.take(3)
返回以下内容:
[Row(value = u'Date,Open,High,Low,Close,Volume,OpenInt'), 行(值= u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'), 行(值= u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]
[Row(value=u'Date,Open,High,Low,Close,Volume,OpenInt'), Row(value=u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'), Row(value=u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]
>>>df.columns
返回以下内容:
['value']
['value']
推荐答案
问题
问题是您使用的是.text
api而不是.csv
或.load
.如果您阅读 .text api文档,则会显示
The issue is that you are using .text
api call instead of .csv
or .load
. If you read the .text api documentation, it says
def text(self, paths): """Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). df = spark.read.text('python/test_support/sql/text-test.txt') df.collect() [Row(value=u'hello'), Row(value=u'this')] """
def text(self, paths): """Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). df = spark.read.text('python/test_support/sql/text-test.txt') df.collect() [Row(value=u'hello'), Row(value=u'this')] """
使用.csv的解决方案
将.text
函数调用更改为.csv
,您应该会得到
Change the .text
function call to .csv
and you should be fine as
df = spark.read.option("header", "true") \
.option("delimiter", ",") \
.option("inferSchema", "true") \
.csv("StockData/ETFs/aadr.us.txt")
df.show(2, truncate=False)
应该给您
+-------------------+------+------+------+------+------+-------+
|Date |Open |High |Low |Close |Volume|OpenInt|
+-------------------+------+------+------+------+------+-------+
|2010-07-21 00:00:00|24.333|24.333|23.946|23.946|43321 |0 |
|2010-07-22 00:00:00|24.644|24.644|24.362|24.487|18031 |0 |
+-------------------+------+------+------+------+------+-------+
使用.load的解决方案
.load
会假定文件为拼花格式..因此,您还需要定义一个格式选项
df = spark.read\
.format("com.databricks.spark.csv")\
.option("header", "true") \
.option("delimiter", ",") \
.option("inferSchema", "true") \
.load("StockData/ETFs/aadr.us.txt")
df.show(2, truncate=False)
我希望答案会有所帮助
这篇关于Spark 2.3.0读取带有标题选项的文本文件不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!