如何从字符串列中提取数字? [英] How to extract number from string column?

查看:30
本文介绍了如何从字符串列中提取数字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的要求是从 comment 列中的注释列中检索订单号,并且总是以 R 开头.订单号应作为新列添加到表中.

My requirement is to retrieve the order number from the comment column which is in a column comment and always starts with R. The order number should be added as a new column to the table.

输入数据:

code,id,mode,location,status,comment 
AS-SD,101,Airways,hyderabad,D,order got delayed R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged
TY-OP,103,Airways,Pune,D,Order number R5463 not received

预期输出:

AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856 
TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463 

我在 spark-sql 中尝试过,我使用的查询如下:

I have tried it in spark-sql, the query I am using is given below:

val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")

但是,我收到以下错误:

However, I'm getting the following error:

org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0

推荐答案

你可以使用 regexp_extract 定义:

def regexp_extract(e: Column, exp: String, groupIdx: Int): Column

(R\\d{4}) 表示 R 后跟 4 位数字.您可以使用有效的正则表达式轻松适应任何其他情况

(R\\d{4}) means R followed by 4 digits. You can easily accommodate any other case by using a valid regex

df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show

+-----+---+-------+---------+------+--------------------+-------+
| code| id|   mode| location|status|             comment|orderId|
+-----+---+-------+---------+------+--------------------+-------+
|AS-SD|101|Airways|hyderabad|     D|order got delayed...|  R1657|
|FY-YT|102|Airways|    Delhi|    ND|R7856 package dam...|  R7856|
|TY-OP|103|Airways|     Pune|     D|Order number R546...|  R5463|
+-----+---+-------+---------+------+--------------------+-------+

这篇关于如何从字符串列中提取数字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆