基于正则表达式匹配创建列而无需提取 [英] Create column based on regex matching without extraction

查看:64
本文介绍了基于正则表达式匹配创建列而无需提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量这样的文件列表:

I have a massive amount fo file list like this :

file.txt
file.txt.tar.gz
file.txt.tgz
core123165
core123165.bak
file.jpg
file.jpg.bak
file.png
file.png.tgz
...

在很多情况下,我无法全部列出. 我想根据扩展名或文件名来推断文件类型.
问题是我想忽略一组扩展名,例如tgzbak,到目前为止,这是我的主意:

There are a lot of cases I cannot list them all. I would like to deduce file type based on there extension or file name.
Problem is I would like to ignore a set of extension such as tgz or bak, So far here's my idea:

val DF = spark.createDF(
  List(("file.txt"),("file.txt.tar.gz"),("file.txt.tgz"),
      ("core123165"),("core123165.bak"),("file.jpg"),
      ("file.jpg.bak"),("file.png"),("file.png.tgz")),
  List(("name", StringType, true))
  )

DF.withColumn("type",
when($"name".endsWith(".txt"), "text").
when($"name".endsWith(".txt.tar.gz"), "text").
when($"name".endsWith(".txt.tgz"), "text").
when($"name".endsWith(".txt.bz2"), "text").
when[...]
)

依此类推,但是我将需要使用正则表达式来标识诸如^core[0-9]{6}$之类的核心文件,并希望使用正则表达式来更容易地标识诸如^.+\.txt$|^.+\.txt.zip$|^.+\.txt.gz$之类的其他类型. 所以我的问题是是否有适用于列的Spark/Scala方法来执行类似的操作:

And so on, however I will need to use regex to identify core file with something like ^core[0-9]{6}$ and would like to use regex to identify other type more easily using something like ^.+\.txt$|^.+\.txt.zip$|^.+\.txt.gz$.
So my question is is there a Spark/Scala method applicable to column to do something like :

val DF = spark.createDF(
  List(("file.txt"),("file.txt.tar.gz"),("file.txt.tgz"),
      ("core123165"),("core123165.bak"),("file.jpg"),
      ("file.jpg.bak"),("file.png"),("file.png.tgz")),
  List(("name", StringType, true))
  )

DF.withColumn("type",
when($"name".matches("^.+\.txt$|^.+\.txt.zip$|^.+\.txt.gz$|^.+\.txt.bz2$^.+\.txt.tar.gz$^.+\.txt.tgz$"), "text").
when($"name".matches("^core[0-9]{6}$|^core[0-9]{6}\.bak$"), "core")
[...]
)

这将大大改善我的治疗效果.

This would greatly improve my treatment.

我知道我可以使用^.+\.txt(\.bak|\.tgz|\.bz2)$进一步分解正则表达式,但这只是一个例子.

I know I could factorize my regex even more using ^.+\.txt(\.bak|\.tgz|\.bz2)$ but it was just an example.

推荐答案

rlike是您要寻找的功能.

此外,您需要使用另一个反斜杠\\来转义反斜杠\.看起来像这样:

Also, you need to escape the backslashes \ with another backslash: \\. This would look like this:

df.withColumn("type",
   when('name rlike "^.+\\.txt$|^.+\\.txt.zip$", "text").otherwise("other"))

这篇关于基于正则表达式匹配创建列而无需提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆