pyspark查询和sql pyspark查询 [英] pyspark query and sql pyspark query

查看:136
本文介绍了pyspark查询和sql pyspark查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嘿,我有一个数据框,其中包含带有以下列的行:日期和文本我需要找到多少行包含单词"corona"每天(数据框和sql两种方式)

hey i have a dataframe that contains rows with this columns: date and text and i need to find how many rows contains the word "corona" per day(two ways dataframes and sql)

  • 电晕这个词必须是一个词,而不是一个子串,如果该词旁边有标点符号,我也要数一数.

我从删除文本列的标点开始然后我添加了一个指示符列,用于标记一行是否包含单词corona之后,我将检查列加总并按日期列分组

i started with removing the puntuation fron the text column then i added an indicator column called to mark if a row has the word corona in it after that i summed the check column and grouped by the date column

1.我想问这是做这样事情的正确方法吗?

1.and i wanted to ask is this the right way to do such a thing?

2.i试图将其转换为pyspark sql查询(如果我使用这种方式,我需要在sql代码中添加检查列),但结果却大不相同,那么我该如何翻译呢?

2.i tried to translate this to a pyspark sql query (i need to add the check column with sql code if i am using this way) but the results were very different,so how can i translate this?

dataframes way:
#above i defiend the puntuation function and i read the data into df
df = df.withColumn('no_punc_text',punc_udf('text'))
df = df.select('no_punc_text','dates')
df.registerTempTable('my_table')
df = df.withColumn("check",F.col("no_punc_text").rlike("corona " or " corona" or " corona 
    ").cast("Integer"))
        dfway = df.groupBy("dates").sum('check')
the sql way:
sqlw = spark.sql(
      """
        select dates, sum(
         case when (no_punc_text rlike ' corona') then 1 
         when (no_punc_text rlike ' corona') then 1 
         when (no_punc_text rlike ' corona ') then 1 else 0 end
        ) as check
        from my_table group by dates
      """)

推荐答案

使用单词边界( \ b )如下-

use word boundary (\b) as below-

  val df = Seq("corona", "corona?", "this is corona", "coronavirus", "corona's", "is this corona?")
      .toDF("text")
      .withColumn("dates", monotonically_increasing_id())
    df.show(false)
    df.printSchema()

    /**
      * +---------------+-----+
      * |text           |dates|
      * +---------------+-----+
      * |corona         |0    |
      * |corona?        |1    |
      * |this is corona |2    |
      * |coronavirus    |3    |
      * |corona's       |4    |
      * |is this corona?|5    |
      * +---------------+-----+
      *
      * root
      * |-- text: string (nullable = true)
      * |-- dates: long (nullable = false)
      */

根据以下要求检测电晕单词

电晕"一词必须是一个词,而不是子字符串,如果该词旁边有点状标记,我也要数一数.

the word corona need to be a word and not a substring and if the word have a puntuation mark next to it i need to count that as well.

    df.createOrReplaceTempView("my_table")
    spark.sql(
      """
        | select dates, sum(
        |         case when (text rlike '\\bcorona\\b') then 1
        |         else 0 end
        |        ) as check
        |        from my_table group by dates
      """.stripMargin)
      .show(false)

    /**
      * +-----+-----+
      * |dates|check|
      * +-----+-----+
      * |2    |1    |
      * |4    |1    |
      * |5    |1    |
      * |0    |1    |
      * |1    |1    |
      * |3    |0    |
      * +-----+-----+
      */

请注意,由于您不想考虑子字符串,因此未将 coronavirus 字符串检测为电晕

在python中

sqlw = spark.sql(
      """
         select dates, sum(
          case when (text rlike '\\bcorona\\b') then 1
          else 0 end
         ) as check
         from my_table group by dates
      """)

这篇关于pyspark查询和sql pyspark查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆