如何有效地检查 Spark Dataframe 中是否包含单词列表? [英] How to efficiently check if a list of words is contained in a Spark Dataframe?
问题描述
使用 PySpark 数据帧,我正在尝试尽可能高效地执行以下操作.我有一个数据框,其中有一列包含文本和我想用来过滤行的单词列表.所以:
Using PySpark dataframes I'm trying to do the following as efficiently as possible. I have a dataframe with a column which contains text and a list of words I want to filter rows by. So:
数据框看起来像这样
df:
col1 col2 col_with_text
a b foo is tasty
12 34 blah blahhh
yeh 0 bar of yums
列表将是 list = [foo,bar]
因此结果将是:
The list will be list = [foo,bar]
And thus result will be:
result:
col1 col2 col_with_text
a b foo
yeh 0 bar
之后不仅会进行相同的字符串匹配,还会使用 SequenceMatcher 左右测试相似性.这是我已经尝试过的:
Afterwards not only identical string matching will be done but also tested for similarity by using SequenceMatcher or so. This is what I already tried:
def check_keywords(x):
words_list = ['foo','bar']
for word in x
if word == words_list[0] or word == words_list[1]:
return x
result = df.map(lambda x: check_keywords(x)).collect()
不幸的是我没有成功,有人可以帮助我吗?提前致谢.
Unfortunately I was unsuccesfull, could someone help me out? Thanks in advance.
推荐答案
你应该考虑使用 pyspark sql 模块函数而不是编写 UDF
,有几个基于 regexp
职能:
You should consider using pyspark sql module functions instead of writing a UDF
, there are several regexp
based functions:
首先让我们从一个更完整的示例数据框开始:
First let's start with a more complete sample data frame:
df = sc.parallelize([["a","b","foo is tasty"],["12","34","blah blahhh"],["yeh","0","bar of yums"],
['haha', '1', 'foobar none'], ['hehe', '2', 'something bar else']])
.toDF(["col1","col2","col_with_text"])
如果你想根据行中是否包含words_list
中的一个词来过滤行,你可以使用rlike
:
If you want to filter lines based on whether they contain one of the words in words_list
, you can use rlike
:
import pyspark.sql.functions as psf
words_list = ['foo','bar']
df.filter(psf.col('col_with_text').rlike('(^|s)(' + '|'.join(words_list) + ')(s|$)')).show()
+----+----+------------------+
|col1|col2| col_with_text|
+----+----+------------------+
| a| b| foo is tasty|
| yeh| 0| bar of yums|
|hehe| 2|something bar else|
+----+----+------------------+
如果要提取匹配正则表达式的字符串,可以使用regexp_extract
:
If you want to extract the strings matching the regular expression, you can use regexp_extract
:
df.withColumn(
'extracted_word',
psf.regexp_extract('col_with_text', '(?=^|s)(' + '|'.join(words_list) + ')(?=s|$)', 0))
.show()
+----+----+------------------+--------------+
|col1|col2| col_with_text|extracted_word|
+----+----+------------------+--------------+
| a| b| foo is tasty| foo|
| 12| 34| blah blahhh| |
| yeh| 0| bar of yums| bar|
|haha| 1| foobar none| |
|hehe| 2|something bar else| |
+----+----+------------------+--------------+
这篇关于如何有效地检查 Spark Dataframe 中是否包含单词列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!