Pyspark:从pyspark数据框中删除UTF空字符 [英] Pyspark: Remove UTF null character from pyspark dataframe

查看：83 发布时间：2021/4/8 19:58:02 python postgresql apache-spark utf-8 pyspark

本文介绍了Pyspark:从pyspark数据框中删除UTF空字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个类似于以下内容的pyspark数据框:

I have a pyspark dataframe similar to the following:

df = sql_context.createDataFrame([
  Row(a=3, b=[4,5,6],c=[10,11,12], d='bar', e='utf friendly'),
  Row(a=2, b=[1,2,3],c=[7,8,9], d='foo', e=u'ab\u0000the')
  ])

e 列的值之一包含UTF空字符 \ u0000 .如果我尝试将此 df 加载到Postgresql数据库中，则会出现以下错误:

Where one of the values for column e contains the UTF null character \u0000. If I try to load this df into a postgresql database, I get the following error:

ERROR: invalid byte sequence for encoding "UTF8": 0x00

这很有道理.在将数据加载到postgres中之前，如何有效地从pyspark数据框中删除空字符?

which makes sense. How can I efficiently remove the null character from the pyspark dataframe before loading the data into postgres?

我尝试使用某些 pyspark.sql.functions 首先清除数据，但没有成功. encode ， decode 和 regex_replace 无效:

I have tried using some of the pyspark.sql.functions to clean the data first without success. encode, decode, and regex_replace did not work:

df.select(regexp_replace(col('e'), u'\u0000', ''))
df.select(encode(col('e'), 'UTF-8'))
df.select(decode(col('e'), 'UTF-8'))

理想情况下，由于我不必提前知道此信息，所以我想清除整个数据框而不确切指定是哪个列或违反字符.

Ideally, I would like to clean the entire dataframe without specifying exactly which columns or what the violating character is, since I don't necessarily know this information ahead of time.

我正在使用具有 UTF8 编码的Postgres 9.4.9数据库.

I am using a postgres 9.4.9 database with UTF8 encoding.

推荐答案

啊，等等-我想我有.如果我做这样的事情，那似乎行得通:

Ah wait - I think I have it. If I do something like this, it seems to work:

null = u'\u0000'
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))

然后映射到所有字符串列:

And then mapping to all string columns:

string_columns = ['d','e']
new_df = df.select(
  *(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for
    c in df.columns)
  )

这篇关于Pyspark:从pyspark数据框中删除UTF空字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark:从pyspark数据框中删除UTF空字符 [英] Pyspark: Remove UTF null character from pyspark dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pyspark:从pyspark数据框中删除UTF空字符 [英] Pyspark: Remove UTF null character from pyspark dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭