pyspark查询和sql pyspark查询 [英] pyspark query and sql pyspark query

查看：136 发布时间：2021/4/8 19:29:33 python sql dataframe apache-spark pyspark

本文介绍了pyspark查询和sql pyspark查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

嘿，我有一个数据框，其中包含带有以下列的行:日期和文本我需要找到多少行包含单词"corona"每天(数据框和sql两种方式)

hey i have a dataframe that contains rows with this columns: date and text and i need to find how many rows contains the word "corona" per day(two ways dataframes and sql)

电晕这个词必须是一个词，而不是一个子串，如果该词旁边有标点符号，我也要数一数.

我从删除文本列的标点开始然后我添加了一个指示符列，用于标记一行是否包含单词corona之后，我将检查列加总并按日期列分组

i started with removing the puntuation fron the text column then i added an indicator column called to mark if a row has the word corona in it after that i summed the check column and grouped by the date column

1.我想问这是做这样事情的正确方法吗?

1.and i wanted to ask is this the right way to do such a thing?

2.i试图将其转换为pyspark sql查询(如果我使用这种方式，我需要在sql代码中添加检查列)，但结果却大不相同，那么我该如何翻译呢?

2.i tried to translate this to a pyspark sql query (i need to add the check column with sql code if i am using this way) but the results were very different,so how can i translate this?

dataframes way:
#above i defiend the puntuation function and i read the data into df
df = df.withColumn('no_punc_text',punc_udf('text'))
df = df.select('no_punc_text','dates')
df.registerTempTable('my_table')
df = df.withColumn("check",F.col("no_punc_text").rlike("corona " or " corona" or " corona 
    ").cast("Integer"))
        dfway = df.groupBy("dates").sum('check')
the sql way:
sqlw = spark.sql(
      """
        select dates, sum(
         case when (no_punc_text rlike ' corona') then 1 
         when (no_punc_text rlike ' corona') then 1 
         when (no_punc_text rlike ' corona ') then 1 else 0 end
        ) as check
        from my_table group by dates
      """)

根据以下要求检测电晕单词

电晕"一词必须是一个词，而不是子字符串，如果该词旁边有点状标记，我也要数一数.

the word corona need to be a word and not a substring and if the word have a puntuation mark next to it i need to count that as well.

    df.createOrReplaceTempView("my_table")
    spark.sql(
      """
        | select dates, sum(
        |         case when (text rlike '\\bcorona\\b') then 1
        |         else 0 end
        |        ) as check
        |        from my_table group by dates
      """.stripMargin)
      .show(false)

    /**
      * +-----+-----+
      * |dates|check|
      * +-----+-----+
      * |2    |1    |
      * |4    |1    |
      * |5    |1    |
      * |0    |1    |
      * |1    |1    |
      * |3    |0    |
      * +-----+-----+
      */

请注意，由于您不想考虑子字符串，因此未将 coronavirus 字符串检测为电晕

在python中

sqlw = spark.sql(
      """
         select dates, sum(
          case when (text rlike '\\bcorona\\b') then 1
          else 0 end
         ) as check
         from my_table group by dates
      """)

这篇关于pyspark查询和sql pyspark查询的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark查询和sql pyspark查询 [英] pyspark query and sql pyspark query

问题描述

推荐答案

根据以下要求检测电晕单词

在python中

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pyspark查询和sql pyspark查询 [英] pyspark query and sql pyspark query

问题描述

推荐答案

根据以下要求检测电晕单词

在python中

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭