计算每个句子中的单词数Spark Dataframes [英] Count number of words in each sentence Spark Dataframes
本文介绍了计算每个句子中的单词数Spark Dataframes的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个Spark Dataframe,其中每一行都有一个评论.
I have a Spark Dataframe where each row has a review.
+--------------------+
| reviewText|
+--------------------+
|Spiritually and m...|
|This is one my mu...|
|This book provide...|
|I first read THE ...|
+--------------------+
我尝试过:
SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('reviewText')))
SplitSentences = SplitSentences.select(SplitSentences.split_sent)
然后我创建了函数:
def word_count(text):
return len(text.split())
wordcount_udf = udf(lambda x: word_count(x))
df2 = SplitSentences.withColumn("word_count",
wordcount_udf(col('split_sent')).cast(IntegerType())
我想计算每个评论(行)中每个句子的词数,但这是行不通的.
I want to count the words of each sentence in each review (row) but it doesn't work.
推荐答案
您可以使用 split
内置函数拆分句子并使用 size 代码> 内置函数将数组的长度计为
You can use split
inbuilt function to split the sentences and use the size
inbuilt function to count the length of the array as
df.withColumn("word_count", F.size(F.split(df['reviewText'], ' '))).show(truncate=False)
这样,您就不需要昂贵的udf函数
例如,假设您遵循了一句话数据框
+-----------------------------+
|reviewText |
+-----------------------------+
|this is text testing spliting|
+-----------------------------+
应用以上 size
和 split
函数之后,您应该会得到
After applying above size
and split
function you should be getting
+-----------------------------+----------+
|reviewText |word_count|
+-----------------------------+----------+
|this is text testing spliting|5 |
+-----------------------------+----------+
如果一行中有多个句子,如下所示
+----------------------------------------------------------------------------------+
|reviewText |
+----------------------------------------------------------------------------------+
|this is text testing spliting. this is second sentence. And this is the third one.|
+----------------------------------------------------------------------------------+
然后,您将必须编写如下的 udf
函数
Then you will have to write a udf
function as below
from pyspark.sql import functions as F
def countWordsInEachSentences(array):
return [len(x.split()) for x in array]
countWordsSentences = F.udf(lambda x: countWordsInEachSentences(x.split('. ')))
df.withColumn("word_count", countWordsSentences(df['reviewText'])).show(truncate=False)
应该给您
+----------------------------------------------------------------------------------+----------+
|reviewText |word_count|
+----------------------------------------------------------------------------------+----------+
|this is text testing spliting. this is second sentence. And this is the third one.|[5, 4, 6] |
+----------------------------------------------------------------------------------+----------+
我希望答案会有所帮助
这篇关于计算每个句子中的单词数Spark Dataframes的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文