计算每个句子中的单词数 Spark Dataframes [英] Count number of words in each sentence Spark Dataframes

查看：28 发布时间：2021/11/14 22:46:45 python apache-spark apache-spark-sql

本文介绍了计算每个句子中的单词数 Spark Dataframes的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 Spark Dataframe，其中每一行都有一个评论.

I have a Spark Dataframe where each row has a review.

+--------------------+
|          reviewText| 
+--------------------+
|Spiritually and m...|
|This is one my mu...|
|This book provide...|
|I first read THE ...|
+--------------------+

我试过了:

SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('reviewText')))
SplitSentences = SplitSentences.select(SplitSentences.split_sent)

然后我创建了函数:

def word_count(text):
    return len(text.split())

wordcount_udf = udf(lambda x: word_count(x))

df2 = SplitSentences.withColumn("word_count", 
  wordcount_udf(col('split_sent')).cast(IntegerType())

我想计算每个评论(行)中每个句子的字数，但不起作用.

I want to count the words of each sentence in each review (row) but it doesn't work.

推荐答案

你可以使用split 内置函数来拆分句子并使用size代码> 内置函数计算数组长度为


You can use split inbuilt function to split the sentences and use the size inbuilt function to count the length of the array as 
df.withColumn("word_count", F.size(F.split(df['reviewText'], ' '))).show(truncate=False)

这样您就不需要昂贵的 udf 功能
举个例子，假设你有以下一个句子数据框 
As an example, lets say you have following one sentence dataframe 
+-----------------------------+
|reviewText                   |
+-----------------------------+
|this is text testing spliting|
+-----------------------------+

应用上述size和split函数后，你应该得到
After applying above size and split function you should be getting
+-----------------------------+----------+
|reviewText                   |word_count|
+-----------------------------+----------+
|this is text testing spliting|5         |
+-----------------------------+----------+

如果一行中有多个句子如下
+----------------------------------------------------------------------------------+
|reviewText                                                                        |
+----------------------------------------------------------------------------------+
|this is text testing spliting. this is second sentence. And this is the third one.|
+----------------------------------------------------------------------------------+

然后你必须写一个udf函数如下
Then you will have to write a udf function as below 
from pyspark.sql import functions as F
def countWordsInEachSentences(array):
    return [len(x.split()) for x in array]

countWordsSentences = F.udf(lambda x: countWordsInEachSentences(x.split('. ')))

df.withColumn("word_count", countWordsSentences(df['reviewText'])).show(truncate=False)

应该给你
+----------------------------------------------------------------------------------+----------+
|reviewText                                                                        |word_count|
+----------------------------------------------------------------------------------+----------+
|this is text testing spliting. this is second sentence. And this is the third one.|[5, 4, 6] |
+----------------------------------------------------------------------------------+----------+

希望回答对你有帮助

                        这篇关于计算每个句子中的单词数 Spark Dataframes的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

计算每个句子中的单词数 Spark Dataframes [英] Count number of words in each sentence Spark Dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

计算每个句子中的单词数 Spark Dataframes [英] Count number of words in each sentence Spark Dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭