PySpark-字符串匹配以创建新列 [英] PySpark - String matching to create new column
问题描述
我有一个数据框,例如:
I have a dataframe like:
ID Notes
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
例如,假设只有3名员工需要检查:John,Stacy或Marsha.我想像这样新建一列:
Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:
ID Notes Employee
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
这里的regex或grep更好吗?我应该尝试哪种功能?谢谢!
Is regex or grep better here? What kind of function should I try? Thanks!
我一直在尝试很多解决方案,但是似乎没有任何效果.我应该放弃,而是为每个员工创建具有二进制值的列吗? IE浏览器:
I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:
ID Notes John Stacy Marsha
2345 Checked by John 1 0 0
2398 Verified by Stacy 0 1 0
3983 Double Checked on 2/23/17 by Marsha 0 0 1
推荐答案
简而言之:
regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
此表达式从之后的任意位置中提取 雇员姓名 ,然后 s)在文本列(
col('Notes')
)
regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
This expression extracts employee name from any position where it is after by then space(s) in text column(
col('Notes')
)
详细信息:
创建示例数据框
In Detail:
Create a sample dataframe
data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
执行所需的导入
from pyspark.sql.functions import regexp_extract, col
在df
上,使用regexp_extract(column_name, regex, group_number)
从列中提取Employee
名称.
On df
extract Employee
name from column using regexp_extract(column_name, regex, group_number)
.
此处 regex ('(.)(by)(\s+)(\w+)'
)表示
- (.)-任何字符(换行符除外)
- (by)-文字中的 by 字
- (\ s +)-一个或多个空格
- (\ w +)-长度为一的字母数字或下划线字符
- (.) - Any character (except newline)
- (by) - Word by in the text
- (\s+) - One or many spaces
- (\w+) - Alphanumeric or underscore chars of length one
和 group_number 为4,因为组(\w+)
在表达式中位于第4位
and group_number is 4 because group (\w+)
is in 4th position in expression
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
regexp_extract(col('Notes'), '.by\s+(\w+)', 1))
的版本似乎更干净,并且在此处检查使用的正则表达式
regexp_extract(col('Notes'), '.by\s+(\w+)', 1))
seems much cleaner version and check the Regex in use here
这篇关于PySpark-字符串匹配以创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!