pyspark查询和sql pyspark查询 [英] pyspark query and sql pyspark query
问题描述
嘿,我有一个数据框,其中包含带有以下列的行:日期和文本我需要找到多少行包含单词"corona"每天(数据框和sql两种方式)
hey i have a dataframe that contains rows with this columns: date and text and i need to find how many rows contains the word "corona" per day(two ways dataframes and sql)
- 电晕这个词必须是一个词,而不是一个子串,如果该词旁边有标点符号,我也要数一数.
我从删除文本列的标点开始然后我添加了一个指示符列,用于标记一行是否包含单词corona之后,我将检查列加总并按日期列分组
i started with removing the puntuation fron the text column then i added an indicator column called to mark if a row has the word corona in it after that i summed the check column and grouped by the date column
1.我想问这是做这样事情的正确方法吗?
1.and i wanted to ask is this the right way to do such a thing?
2.i试图将其转换为pyspark sql查询(如果我使用这种方式,我需要在sql代码中添加检查列),但结果却大不相同,那么我该如何翻译呢?
2.i tried to translate this to a pyspark sql query (i need to add the check column with sql code if i am using this way) but the results were very different,so how can i translate this?
dataframes way:
#above i defiend the puntuation function and i read the data into df
df = df.withColumn('no_punc_text',punc_udf('text'))
df = df.select('no_punc_text','dates')
df.registerTempTable('my_table')
df = df.withColumn("check",F.col("no_punc_text").rlike("corona " or " corona" or " corona
").cast("Integer"))
dfway = df.groupBy("dates").sum('check')
the sql way:
sqlw = spark.sql(
"""
select dates, sum(
case when (no_punc_text rlike ' corona') then 1
when (no_punc_text rlike ' corona') then 1
when (no_punc_text rlike ' corona ') then 1 else 0 end
) as check
from my_table group by dates
""")
推荐答案
使用单词边界( \ b
)如下-
use word boundary (\b
) as below-
val df = Seq("corona", "corona?", "this is corona", "coronavirus", "corona's", "is this corona?")
.toDF("text")
.withColumn("dates", monotonically_increasing_id())
df.show(false)
df.printSchema()
/**
* +---------------+-----+
* |text |dates|
* +---------------+-----+
* |corona |0 |
* |corona? |1 |
* |this is corona |2 |
* |coronavirus |3 |
* |corona's |4 |
* |is this corona?|5 |
* +---------------+-----+
*
* root
* |-- text: string (nullable = true)
* |-- dates: long (nullable = false)
*/
根据以下要求检测电晕单词
电晕"一词必须是一个词,而不是子字符串,如果该词旁边有点状标记,我也要数一数.
the word corona need to be a word and not a substring and if the word have a puntuation mark next to it i need to count that as well.
df.createOrReplaceTempView("my_table")
spark.sql(
"""
| select dates, sum(
| case when (text rlike '\\bcorona\\b') then 1
| else 0 end
| ) as check
| from my_table group by dates
""".stripMargin)
.show(false)
/**
* +-----+-----+
* |dates|check|
* +-----+-----+
* |2 |1 |
* |4 |1 |
* |5 |1 |
* |0 |1 |
* |1 |1 |
* |3 |0 |
* +-----+-----+
*/
请注意,由于您不想考虑子字符串,因此未将
coronavirus
字符串检测为电晕
在python中
sqlw = spark.sql(
"""
select dates, sum(
case when (text rlike '\\bcorona\\b') then 1
else 0 end
) as check
from my_table group by dates
""")
这篇关于pyspark查询和sql pyspark查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!