如何从火花数据框中过滤掉包含不可读字符的行 [英] How to filter out rows from spark dataframe containing unreadable characters

查看:23
本文介绍了如何从火花数据框中过滤掉包含不可读字符的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取一个包含设备 ID、imei 等字段的镶木地板文件.该镶木地板文件是通过读取由 cascading.tuple.Tuple(s) 组成的序列文件编写的.

I am reading a parquet file containing some fields like device ID, imei, etc. This parquet file was written by reading a sequence file made of cascading.tuple.Tuple(s).

有些行包含我想完全放弃的不可读字符.

Some rows contain unreadable characters which I want to ditch completely.

这是我阅读文件的方式:

Here is how I am reading the file:

val sparkSession = SparkSession.builder().master(sparkMaster).appName(sparkAppName).config("spark.driver.memory", "32g").getOrCreate()

sparkSession.sparkContext.hadoopConfiguration.set("io.serializations", "cascading.tuple.hadoop.TupleSerialization") 

val df=sparkSession.read.parquet("hdfs://**.46.**.2*2:8020/test/oldData.parquet")

df.printSchema()

val filteredDF=df.select($"$DEVICE_ID", $"$DEVICE_ID_NEW", $"$IMEI", $"$WIFI_MAC_ADDRESS", $"$BLUETOOTH_MAC_ADDRESS", $"$TIMESTAMP").filter($"$TIMESTAMP" > 1388534400 && $"$TIMESTAMP" < 1483228800)

filteredDF.show(100)

import org.apache.spark.sql.functions.{udf,col,regexp_replace,trim}

val len=udf{ColVal:String => ColVal.size}

val new1DF=filteredDF.select(trim(col("deviceId")))

new1DF.show(100)

val newDF=new1DF.filter((len(col("deviceId")) <20))

newDF.show(100)

即使在那些长度小于 20 的设备 ID 上应用过滤器后,我仍然得到那些设备 ID 很长的行,其中主要包含空格和不可读的字符.

Even after applying a filter on those device ID whose length is less than 20, I still get those rows which has very long device ID containing mostly whitespaces and unreadable characters.

有人可以指出一些可能有助于我过滤此类行的线索吗.

Can some one point out some leads which may help me to filter such rows.

我还尝试过滤掉那些包含特价的设备 ID.使用这个:

I have also tried to filter out those device IDs containing Specials. Using this:

df.filter($"$DEVICE_ID" rlike "/[^\uFFFD]/g")

df.filter($"$DEVICE_ID" rlike "/[^\uFFFD]/g")

我得到了一个空的数据框.

I got an empty dataframe.

架构:

root
 |-- deviceId: string (nullable = true)
 |-- deviceIdNew: string (nullable = true)
 |-- imei: string (nullable = true)
 |-- wifiMacAddress: string (nullable = true)
 |-- bluetoothMacAddress: string (nullable = true)
 |-- timestamp: long (nullable = true)

具有不可读字符的行:

+--------------------+
|      trim(deviceId)|
+--------------------+
|                    |
|+~C���...|
|���
    Cv�...|
|���
    Cv�...|
|             �#Inten|
|                �$
                   �|
|                    |
|                    |
|                    |
|                    |
|    0353445a712d877b|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    0577bc8a29754939|
|    08bdae9e37b48080|

无法读取的行值

推荐答案

    val filteredDF=df.select("deviceId")
                     .filter((len(col("deviceId")) <17))
                     .filter($"$DEVICE_ID" rlike "^([A-Z]|[0-9]|[a-z])+$") 

解决了这个问题.

我之前没有使用的是正则表达式通配符 ^ 用于匹配开始和 $ 用于匹配结束.这确保只有具有完全匹配 deviceId 值的行才能通过过滤器.

What I was not using earlier was regex wild cards ^ for start of match and $ for end of match. This ensured that only rows with exact matching deviceId values gets through the filter.

这个网站真的帮助我生成和测试所需的正则表达式.

This website really helped me to generate and test desired regular expression.

这篇关于如何从火花数据框中过滤掉包含不可读字符的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆