Spark Regexp:根据日期拆分列 [英] Spark Regexp: Split column based on date

查看:84
本文介绍了Spark Regexp:根据日期拆分列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据框中有一个名为数据"的列,如下所示:

I have a column, called "data", in my dataframe that looks like this:

{"blah:" blah," blah:" blah""10/7/17service

我想将其分为三个不同的列,如下所示:

I would like to separate this into three different columns that look like:

col1:{"blah:" blah," blah:" blah"col2:10/7/17col3:服务

我尝试过这种方法:

val split = df.withColumn("col1",regexp_extract($"data",(/(0(1 [9-9] | 1 [012]])[-\/.](0 [1-9] | [12] [0-9] | 3 [01])[-\/.](19 | 20)\ d \ d/),1).withColumn("col2",regexp_extract($"data",(/(0 [1-9] | 1 [012])[-\/.](0 [1-9] | [12] [0-9] | 3 [01])[-\/.](19 | 20)\ d \ d/),2))

但是这个正则表达式并不能真正让我通过.我觉得我缺少关于正则表达式运算符在Spark中的工作方式的一些信息.有什么想法吗?

But this regex doesn't really get me through the door. I feel like I'm missing something about how the regex operator works in Spark. Any ideas?

非常感谢!!:)

列的编辑规则:

  • col1:日期值之前
  • col2:日期值
  • col3:在日期值之后

推荐答案

好的,当您

  • col1 :匹配直到找到最后一个"
  • col2 :匹配日期
  • col3 :字符串的其余部分
    • col1: Match until it finds the last "
    • col2: Match the date
    • col3: The rest of the string

    您需要的正则表达式是:

    The regex you need is:

    /(.+")(\d{1,2}\/\d{1,2}\/\d{1,2})(.+)/
    

    但是,当在 regexp_extract()函数上使用它时,必须转义反斜杠,因此对于每一列,您将使用:

    However, when you use it on the regexp_extract() function, you must escape the backslashes, so for each column, you'll use:

    regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", N)

    根据您编写的代码,尝试使用此代码:

    Based on the code you wrote, try using this:

    val separate = df.withColumn("col1", regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 1)).withColumn("col2",regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 1)).withColumn("col3",regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 3))
    

    这篇关于Spark Regexp:根据日期拆分列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆