Spark 1.3 Dataframe中的Strip或Regex函数 [英] Strip or Regex function in Spark 1.3 Dataframe
问题描述
我有一些来自PySpark 1.5的代码,不幸的是我不得不向后移植到Spark 1.3.我有一列包含字母数字元素的列,但是我只想要数字. "df"的"old_col"中的元素示例为:
I have some code from PySpark 1.5 that I unfortunately have to port backwards to Spark 1.3. I have a column with elements that are alpha-numeric but I only want the digits. An example of the elements in 'old_col' of 'df' are:
'125 Bytes'
在Spark 1.5中,我可以使用
In Spark 1.5 I was able to use
df.withColumn('new_col',F.regexp_replace('old_col','(\D+)','').cast("long"))
但是,我似乎无法想出使用SUBSTR或RLIKE等旧1.3方法的解决方案.原因是字节"前面的数字长度会有所不同,所以我真正需要的是在Spark 1.3中找不到的替换"或条"功能 有什么建议吗?
However, I cannot seem to come up with a solution using old 1.3 methods like SUBSTR or RLIKE. Reason being the number of digits in front of "Bytes" will vary in length, so what I really need is the 'replace' or 'strip' functionality I can't find in Spark 1.3 Any suggestions?
推荐答案
只要使用HiveContext
,就可以通过selectExpr
执行相应的Hive UDF:
As long as you use HiveContext
you can execute corresponding Hive UDFs either with selectExpr
:
df.selectExpr("regexp_extract(old_col,'([0-9]+)', 1)")
或使用普通SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT regexp_extract(old_col,'([0-9]+)', 1) FROM df")
这篇关于Spark 1.3 Dataframe中的Strip或Regex函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!