PySpark - 字符串匹配以创建新列 [英] PySpark - String matching to create new column

查看:35
本文介绍了PySpark - 字符串匹配以创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的数据框:

ID 注释第2345章 被约翰查了第 2398 章3983 由 Marsha 于 17 年 2 月 23 日再次检查

例如,假设只有 3 名员工需要检查:John、Stacy 或 Marsha.我想像这样创建一个新专栏:

ID 备注 员工第 2345 章第 2398 章3983 由 Marsha Marsha 于 2/23/17 双重检查

这里 regex 或 grep 哪个更好?我应该尝试什么样的功能?谢谢!

我一直在尝试一堆解决方案,但似乎没有任何效果.我应该放弃并为每个员工创建一个二进制值的列吗?IE:

ID 注释 John Stacy Marsha2345 由约翰检查 1 0 02398 经史黛西验证 0 1 03983 由 Marsha 于 2/23/17 再次检查 0 0 1

解决方案

简而言之:

<块引用>

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

这个表达式从任何位置提取员工姓名,它位于by之后,然后空格(s) 在文本列中(col('Notes'))


详细:

创建示例数据框

data = [('2345', '由约翰检查'),('2398', '经史黛西验证'),('2328', 'Srinivas 验证而不是一些随机文本'),('3983', '2/23/17 由 Marsha 进行双重检查')]df = sc.parallelize(data).toDF(['ID', 'Notes'])df.show()+----+--------------------+|身份证|备注|+----+--------------------+|2345|由约翰检查||2398|由斯泰西验证||2328|由斯里尼验证...||3983|双重检查...|+----+--------------------+

做需要的导入

from pyspark.sql.functions import regexp_extract, col

df 上使用 regexp_extract(column_name, regex, group_number) 从列中提取 Employee 名称.

这里regex('(.)(by)(\s+)(\w+)') 表示

  • (.) - 任何字符(换行符除外)
  • (by) - 文字中的字词by
  • (\s+) - 一个或多个空格
  • (\w+) - 长度为一的字母数字或下划线字符

group_number 是 4 因为组 (\w+) 在表达式中的第 4 位

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))结果.show()+----+--------------------+--------+|身份证|备注|员工|+----+--------------------+--------+|2345|由约翰检查|约翰||2398|由斯泰西验证|斯泰西||2328|由斯里尼验证...|斯里尼瓦斯||3983|双重检查...|玛莎|+----+--------------------+--------+

Databricks笔记本

注意:

<块引用>

regexp_extract(col('Notes'), '.by\s+(\w+)', 1)) 似乎更干净的版本和 检查此处使用的正则表达式

I have a dataframe like:

ID             Notes
2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha 

Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:

ID                Notes                              Employee
2345          Checked by John                          John
2398         Verified by Stacy                        Stacy
3983     Double Checked on 2/23/17 by Marsha          Marsha

Is regex or grep better here? What kind of function should I try? Thanks!

EDIT: I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:

ID                Notes                             John       Stacy    Marsha
2345          Checked by John                        1            0       0
2398         Verified by Stacy                       0            1       0
3983     Double Checked on 2/23/17 by Marsha         0            0       1

解决方案

In short:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes'))


In Detail:

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

from pyspark.sql.functions import regexp_extract, col

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(\s+)(\w+)') means

  • (.) - Any character (except newline)
  • (by) - Word by in the text
  • (\s+) - One or many spaces
  • (\w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (\w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Databricks notebook

Note:

regexp_extract(col('Notes'), '.by\s+(\w+)', 1)) seems much cleaner version and check the Regex in use here

这篇关于PySpark - 字符串匹配以创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆