每条推文中未重复单词的总数 [英] Total number of non repeated words in each tweet

查看:93
本文介绍了每条推文中未重复单词的总数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是java和Trident的新手,我导入了用于获取tweet的项目,但我想得到一些东西当我从tuple.getValue(0);仅意味着第一条tweet的代码中获得此代码时,如何获得多个tweet?

I'm new to java and Trident , I imported project for getting tweets but i want to get something How this code get more than one tweet as i got form the code that tuple.getValue(0); means first tweet only ?!

我遇到的问题是获取哈希集或哈希图中的所有推文,以获取每个推文中的区别词总数

Problem with me to get all tweets in hashset or hashmap to get total number of distnictive words in each tweet

public void execute(TridentTuple tuple, TridentCollector collector) {

此方法用于在推特上执行方程式

this method is used to execute equations on tweet

public Values getValues(Tweet tweet, String[] words){
 }

此代码首先发送tweet,然后获取其主体,将其转换为字符串数组,我知道我需要解决什么,但写得不好

This code got first tweet then get body of it ,converting it to array of string , i know what i need to solve but couldn't write it well

我的想法: 像

for (int i=0;i<10;i++)
{
 Tweet tweet = (Tweet) tuple.getValue(i);   
}

推荐答案

问题"是获取所有推文中不同单词的数量"和作为流处理器的Strom之间的错配.您想要回答的查询只能在有限的Tweets集合上进行计算.但是,在流处理中,您将处理潜在的无限输入数据流.

The "problem" is a miss-match between "get the count of distinct words over all tweets" and Strom as a stream processor. The query you want to answer can only be computed on a finite set of Tweets. However, in stream processing you process an potential infinite stream of input data.

如果您有一组有限的Tweets,则可能要使用批处理框架,例如Flink,Spark或MapReduce.如果您确实有无限数量的Tweets,则必须重新表达您的问题...

If you have a finite set of Tweets, you might want to use a batch processing framework such as Flink, Spark, or MapReduce. If you indeed have an infinite number of Tweets, you must rephrase your question...

正如您已经提到的,您实际上想遍历所有Tweets".当您如此进行流处理时,就没有这样的概念.您有无限数量的输入元组,并且Storm在每个元组上应用execute()(即,您可以认为它就像Storm自动遍历输入")一样,即使在"looping"中也不是正确的术语它).由于您的计算是遍历所有Tweet",因此您需要在Bolt代码中维持一个状态,以便您可以为每个Tweet更新此状态. Storm中状态的简单形式是您的Bolt类中的成员变量.

As you mentioned already, you actually want to "loop over all Tweets". As you so stream processing, there is no such concept. You have an infinite number of input tuples, and Storm applies execute() on each of those (ie, you can think of it as if Storm "loops over the input" automatically -- even in "looping" is not the correct term for it). As your computation is "over all Tweets" you would need to maintain a state in your Bolt code, such that you can update this state for each Tweet. The simples form of a state in Storm would be member variable in your Bolt class.

public class MyBolt implements ??? {
    // this is your "state" variable
    private final Set<String> allWords = new HashSet<String>();

    public void execute(TridentTuple tuple, TridentCollector collector) {
        Tweet tweet = (Tweet)tuple.getValue(0);        
        String tweetBody = tweet.getBody();
        String words[] = tweetBody.toLowerCase().split(regex);
        for(String w : words) {
            // as allWords is a set, you cannot add the same word twice
            // the second "add" call on the same word will just be ignored
           // thus, allWords will contain each word exactly once
            this.allWords.add(w);
        }
    }
}

现在,此代码不会发出任何内容,因为不清楚您实际上要发出什么内容?就像在流处理中一样,没有尽头,您不能说发出allWords中包含的单词的最终计数".您可以做什么,它将在每次更新后发出当前计数.为此,请在execute()的末尾添加collector.emit(new Values(this.allWords.size()));.

Right now, this code does not emit anything, because it is unclear what you actually want to emit? As in stream processing, there is no end, you cannot say "emit the final count of words, contained in allWords". What you could do, it to emit the current count after each update... For this, add collector.emit(new Values(this.allWords.size())); at the end of execute().

此外,我想补充一点,如果没有对MyBolt应用并行性,那么提出的解决方案只能正常工作-否则,实例上的不同集合可能包含相同的单词.要解决此问题,将需要在无状态Bolt中将每个Tweet标记为其单词,并将这些单词流放入采用内部Set作为状态的已采用MyBolt中. MyBolt的输入数据还必须通过fieldsGrouping接收数据,以确保每个实例上有不同的单词集.

Furthermore, I want to add, that the presented solution only works correctly, if no parallelism is applied to MyBolt -- otherwise, the different sets over the instances might contain the same word. To resolve this, it would be required to tokenize each Tweet into its words in a stateless Bolt and feet this streams of words into an adopted MyBolt that uses an internal Set as state. The input data for MyBolt must also receive the data via fieldsGrouping to ensure distinct sets of words on each instance.

这篇关于每条推文中未重复单词的总数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆