PySpark-字符串匹配以创建新列 [英] PySpark - String matching to create new column

查看：169 发布时间：2020/9/4 2:07:30 python regex apache-spark pyspark databricks

本文介绍了PySpark-字符串匹配以创建新列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，例如:

I have a dataframe like:

ID             Notes
2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha

例如，假设只有3名员工需要检查:John，Stacy或Marsha.我想像这样新建一列:

Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:

ID                Notes                              Employee
2345          Checked by John                          John
2398         Verified by Stacy                        Stacy
3983     Double Checked on 2/23/17 by Marsha          Marsha

这里的regex或grep更好吗?我应该尝试哪种功能?谢谢！

Is regex or grep better here? What kind of function should I try? Thanks!

我一直在尝试很多解决方案，但是似乎没有任何效果.我应该放弃，而是为每个员工创建具有二进制值的列吗? IE浏览器:

I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:

ID                Notes                             John       Stacy    Marsha
2345          Checked by John                        1            0       0
2398         Verified by Stacy                       0            1       0
3983     Double Checked on 2/23/17 by Marsha         0            0       1

简而言之:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

此表达式从之后的任意位置中提取 雇员姓名 ，然后 s)在文本列(col('Notes'))

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes'))

详细信息:

创建示例数据框

In Detail:

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

执行所需的导入

from pyspark.sql.functions import regexp_extract, col

在df上，使用regexp_extract(column_name, regex, group_number)从列中提取Employee名称.

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

此处 regex ('(.)(by)(\s+)(\w+)')表示

(.)-任何字符(换行符除外)
(by)-文字中的 by 字
(\ s +)-一个或多个空格
(\ w +)-长度为一的字母数字或下划线字符

(.) - Any character (except newline)
(by) - Word by in the text
(\s+) - One or many spaces
(\w+) - Alphanumeric or underscore chars of length one

和 group_number 为4，因为组(\w+)在表达式中位于第4位

and group_number is 4 because group (\w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Databricks笔记本

regexp_extract(col('Notes'), '.by\s+(\w+)', 1))的版本似乎更干净，并且在此处检查使用的正则表达式

regexp_extract(col('Notes'), '.by\s+(\w+)', 1)) seems much cleaner version and check the Regex in use here

这篇关于PySpark-字符串匹配以创建新列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark-字符串匹配以创建新列 [英] PySpark - String matching to create new column

问题描述

推荐答案

简而言之:

`regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))`

`regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))`

详细信息:

In Detail:

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark-字符串匹配以创建新列 [英] PySpark - String matching to create new column

问题描述

推荐答案

简而言之:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

详细信息:

In Detail:

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

`regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))`

`regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))`

登录关闭