在Spark中过滤和计数单词时出错 [英] Error during filtering and counting of words in Spark

查看:60
本文介绍了在Spark中过滤和计数单词时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想定义5个单词,并希望基于这17个单词过滤我的数据集,并计算出现的次数.假设

I want to define 5 words and want to filter my dataset based on that 17 words, and count the number of occurrences. Suppose,

words = ['dog', 'cat','tiger','lion','cheetah']

我有一个文本文件,其中包含超过2000行的句子.我想搜索我的文本文件,并返回出现的次数.

I have a text file which contains sentences ranging more than 2000 lines. I want to search my text file, and return the number of occurrences.

我搜索了互联网,发现了一些代码,例如

I have searched the internet and found some code, like,

val filePath = sc.text_file("/user/cloudera/input/Hin*/datafile.txt")
val crimecounts =
  filePath.
    flatMap(line=>line.split(" ")).
    filter(w => (w =="dog") || (w == "lion")).
    map(word=>(word, 1)).
    reduceByKey(_ + _)

此代码返回错误的狮子"计数.令人惊讶的是,仅返回了狮子"的计数.我已经分别使用Python代码检查了计数值的正确性.应该如何更正代码,以便返回所有5个单词的正确计数.数据子集如下,

This code returns wrong count for "lion". And surprisingly only the count of "lion" is returned. I have checked for correctness of count value using Python code separately. How should the code be corrected so as to return correct count of all 5 words. Subset of data is as follows,

那是一个炎热的夏日.一只狮子和野猪伸到一个小水域喝酒.狮子和野猪开始争论谁应该先喝酒.一段时间后,当他们注意到上方的秃鹰时,他们会感到疲倦并停止呼吸.不久,他们意识到秃鹰正在等待它们中的一个或两个掉下来,以它们为食.然后,狮子和野猪决定,与战斗并成为秃鹰的食物相比,结交和成为朋友是最好的选择.狮子和野猪一起喝水,然后追逐.

It was a hot summer day. A lion and a boar reach a small water body for a drink. Lion and boar begin arguing and fighting about who should drink first. After a while, they are tired and stop for breath, when they notice vultures above. Soon they realize that the vultures are waiting for one or both of them to fall, to feast on them. The lion and the boar then decide that it was best to make up and be friends than fight and become food for vultures. Lion and boar drink the water together and go their ways after.

我是Spark的新手.有人可以在这方面帮助我吗?

I am a newbie to Spark. Can anyone help me in this regard?

推荐答案

您的代码中有太多错误. 数组创建部分似乎在 pyspark 中,但是其余代码看起来在 scala 中.而且没有用于 sparkContext 实例的 text_file api .

There are so many errors in your code. The array creation part seems to be in pyspark but the rest of the codes look to be in scala. And there is no text_file api for sparkContext instance.

pyspark 的解决方案:

solution for pyspark :

words = ['dog', 'cat','tiger','lion','cheetah']

filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
from operator import add
crimecounts = filePath.flatMap(lambda line: line.split(" ")).filter(lambda w: w.lower() in words).map(lambda word: (word, 1)).reduceByKey(add)

scala 的解决方案:

solution for scala:

val words = Array("dog","cat","tiger","lion","cheetah")

val filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
val crimecounts = filePath.flatMap(line => line.split(" ")).filter(w => words.contains(w.toLowerCase)).map(word => (word, 1)).reduceByKey(_ + _)

这篇关于在Spark中过滤和计数单词时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆