pySpark (v2.4) DataFrameReader 为列名添加前导空格 [英] pySpark (v2.4) DataFrameReader adds leading whitespace to column names
问题描述
这是我拥有的 CSV 文件的片段:
索引"、居住空间(平方英尺)"、床位"、浴室"、邮编"、年份"、标价(美元)"1, 2222, 3, 3.5, 32312, 1981, 2500002, 1628, 3, 2, 32308, 2009, 1850003, 3824, 5, 4, 32312, 1954, 3990004, 1137, 3, 2, 32309, 1993, 1500005, 3560, 6, 4, 32309, 1973, 315000
奇怪的是,当我执行以下 pySpark (v2.4) 语句时,标题列名称(减去第一列)有前导空格.我尝试了不同的 quote
和 escape
options
,但都无济于事.
有谁知道为什么会发生这种情况以及如何在加载时去除多余的空格?提前致谢!
<预><代码>>>>csv_file = '/tmp/file.csv'>>>spark_reader.format('csv')>>>spark_reader.option("inferSchema", "true")>>>spark_reader.option("header", "true")>>>spark_reader.option("引用", '"')>>>df = spark_reader.load(csv_file)>>>df.columns['索引', '生活空间(平方英尺)"','床位"','浴室"','邮编"','年份"','标价(美元)"']来自 pyspark.sql.DataFrameReader
,你可以使用ignoreLeadingWhiteSpace
参数.
ignoreLeadingWhiteSpace – 指示是否应跳过正在读取的值的前导空格的标志.如果设置为 None,则使用默认值 false.
就您而言,您只需要添加:
spark_reader.option("ignoreLeadingWhiteSpace", "true")
Here is a snippet of a CSV file that I have:
"Index", "Living Space (sq ft)", "Beds", "Baths", "Zip", "Year", "List Price ($)"
1, 2222, 3, 3.5, 32312, 1981, 250000
2, 1628, 3, 2, 32308, 2009, 185000
3, 3824, 5, 4, 32312, 1954, 399000
4, 1137, 3, 2, 32309, 1993, 150000
5, 3560, 6, 4, 32309, 1973, 315000
Oddly, when I perform the following pySpark (v2.4) statements, the header column names (minus the first column) have leading whitespaces. I've tried different quote
and escape
options
, but to no avail.
Does anyone know why this is happening and how to strip the extra whitespaces on load? Thank you in advance!
>>> csv_file = '/tmp/file.csv'
>>> spark_reader.format('csv')
>>> spark_reader.option("inferSchema", "true")
>>> spark_reader.option("header", "true")
>>> spark_reader.option("quote", '"')
>>> df = spark_reader.load(csv_file)
>>> df.columns
['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"', ' "Zip"', ' "Year"', ' "List Price ($)"']
From the docs for pyspark.sql.DataFrameReader
, you can use the ignoreLeadingWhiteSpace
parameter.
ignoreLeadingWhiteSpace – A flag indicating whether or not leading whitespaces from values being read should be skipped. If None is set, it uses the default value, false.
In your case, you just need to add:
spark_reader.option("ignoreLeadingWhiteSpace", "true")
这篇关于pySpark (v2.4) DataFrameReader 为列名添加前导空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!