导入Spark中具有不同列数的文本文件 [英] Importing text file with varying number of columns in Spark

查看：111 发布时间：2021/4/8 19:27:00 apache-spark pyspark

本文介绍了导入Spark中具有不同列数的文本文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个用竖线分隔的文件，其中包含不同数量的列，如下所示:

I have a pipe-delimited file with varying numbers of columns, like this:

id|name|attribute|extraattribute
1|alvin|cool|funny
2|bob|tall
3|cindy|smart|funny

我正在尝试找到一种优雅的方法，使用pyspark将其导入数据框.我可以尝试修复文件以添加尾随|当缺少最后一列时(只有最后一列可以丢失)，但是很想找到一个不涉及更改输入文件的解决方案.

I'm trying to find an elegant way to import this into a dataframe using pyspark. I could try to fix the files to add a trailing | when the last column is missing (only the last column can be missing), but would love to find a solution that didn't involve changing the input files.

推荐答案

您可以使用方法 pyspark.sql.readwriter 中的nofollow noreferrer> csv 并设置 mode =" PERMISSIVE":

You can use the method csv in the module pyspark.sql.readwriter and set mode="PERMISSIVE":

df = sqlCtx.read.csv("/path/to/file.txt", sep="|", mode="PERMISSIVE", header=True)
df.show(truncate=False)
#+---+-----+---------+--------------+
#|id |name |attribute|extraattribute|
#+---+-----+---------+--------------+
#|1  |alvin|cool     |funny         |
#|2  |bob  |tall     |null          |
#|3  |cindy|smart    |funny         |
#+---+-----+---------+--------------+

从文档中

PERMISSIVE:遇到损坏时，将其他字段设置为null记录.

PERMISSIVE : sets other fields to null when it meets a corrupted record.

由用户设置架构时，它会为其他字段设置null.

When a schema is set by user, it sets null for extra fields.

这比简单得多我最初在评论中提出的建议.

这篇关于导入Spark中具有不同列数的文本文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

导入Spark中具有不同列数的文本文件 [英] Importing text file with varying number of columns in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

导入Spark中具有不同列数的文本文件 [英] Importing text file with varying number of columns in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭