基于正则表达式匹配创建列而无需提取 [英] Create column based on regex matching without extraction

查看：64 发布时间：2020/9/4 2:42:28 regex scala apache-spark

本文介绍了基于正则表达式匹配创建列而无需提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大量这样的文件列表:

I have a massive amount fo file list like this :

file.txt
file.txt.tar.gz
file.txt.tgz
core123165
core123165.bak
file.jpg
file.jpg.bak
file.png
file.png.tgz
...

在很多情况下，我无法全部列出. 我想根据扩展名或文件名来推断文件类型.
问题是我想忽略一组扩展名，例如tgz或bak，到目前为止，这是我的主意:

There are a lot of cases I cannot list them all. I would like to deduce file type based on there extension or file name.
Problem is I would like to ignore a set of extension such as tgz or bak, So far here's my idea:

val DF = spark.createDF(
  List(("file.txt"),("file.txt.tar.gz"),("file.txt.tgz"),
      ("core123165"),("core123165.bak"),("file.jpg"),
      ("file.jpg.bak"),("file.png"),("file.png.tgz")),
  List(("name", StringType, true))
  )

DF.withColumn("type",
when($"name".endsWith(".txt"), "text").
when($"name".endsWith(".txt.tar.gz"), "text").
when($"name".endsWith(".txt.tgz"), "text").
when($"name".endsWith(".txt.bz2"), "text").
when[...]
)

依此类推，但是我将需要使用正则表达式来标识诸如^core[0-9]{6}$之类的核心文件，并希望使用正则表达式来更容易地标识诸如^.+\.txt$|^.+\.txt.zip$|^.+\.txt.gz$之类的其他类型. 所以我的问题是是否有适用于列的Spark/Scala方法来执行类似的操作:

And so on, however I will need to use regex to identify core file with something like ^core[0-9]{6}$ and would like to use regex to identify other type more easily using something like ^.+\.txt$|^.+\.txt.zip$|^.+\.txt.gz$.
So my question is is there a Spark/Scala method applicable to column to do something like :

val DF = spark.createDF(
  List(("file.txt"),("file.txt.tar.gz"),("file.txt.tgz"),
      ("core123165"),("core123165.bak"),("file.jpg"),
      ("file.jpg.bak"),("file.png"),("file.png.tgz")),
  List(("name", StringType, true))
  )

DF.withColumn("type",
when($"name".matches("^.+\.txt$|^.+\.txt.zip$|^.+\.txt.gz$|^.+\.txt.bz2$^.+\.txt.tar.gz$^.+\.txt.tgz$"), "text").
when($"name".matches("^core[0-9]{6}$|^core[0-9]{6}\.bak$"), "core")
[...]
)

这将大大改善我的治疗效果.

This would greatly improve my treatment.

我知道我可以使用^.+\.txt(\.bak|\.tgz|\.bz2)$进一步分解正则表达式，但这只是一个例子.

I know I could factorize my regex even more using ^.+\.txt(\.bak|\.tgz|\.bz2)$ but it was just an example.

基于正则表达式匹配创建列而无需提取 [英] Create column based on regex matching without extraction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

基于正则表达式匹配创建列而无需提取 [英] Create column based on regex matching without extraction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭