话归使用RDD [英] Words normalization using RDD

查看:170
本文介绍了话归使用RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

也许这个问题有点怪怪的......但我会尝试问问吧。

Maybe this question is a little bit strange... But I'll try to ask it.

每个人,谁使用Lucene API写的应用程序,看到的是这样的:

Everyone, who wrote applications with using Lucene API, seen something like this:

public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException
{
    TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, new StringReader(text));
    tokenStream = new StopFilter(Version.LUCENE_44, tokenStream, StopFilter.makeStopSet(Version.LUCENE_44, stopWords, true));
    tokenStream = new LowerCaseFilter(Version.LUCENE_44, tokenStream);
    tokenStream = new StandardFilter(Version.LUCENE_44, tokenStream);
    tokenStream.reset();
    String result = "";
    while (tokenStream.incrementToken()) 
    {
        CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
        try
        {
            //normalizer.getNormalForm(...) - stemmer or lemmatizer
            result += normalizer.getNormalForm(token.toString()) + " ";
        }
        catch(Exception e)
        {
            //if something went wrong
        }
    }
    return result;
}

是否有可能重写的话正常化使用RDD?
也许有人有这种转变的一个例子,或者它指定的Web资源?

Is it possible to rewrite words normalization using RDD? Maybe someone have an example of this transformation, or can specify web resource about it?

感谢你。

推荐答案

我最近用于通话一个类似的例子。它显示了如何删除停用词。它没有正常化的阶段,但如果 normalizer.getNormalForm 来自于可重复使用一个lib,应该很容易集成。

I recently used a similar example for a talk. It shows how to remove the stop words. It has no normalization phase, but if that normalizer.getNormalForm comes from a lib that can be reused, it should be easy to integrate.

这code可能是一个起点:

This code could be a starting point:

// source text
val rdd = sc.textFile(...)  
// stop words src
val stopWordsRdd = sc.textFile(...) 
// bring stop words to the driver to broadcast => more efficient than rdd.subtract(stopWordsRdd)
val stopWords = stopWordsRdd.collect.toSet
val stopWordsBroadcast = sc.broadcast(stopWords)
val words = rdd.flatMap(line => line.split("\\W").map(_.toLowerCase))
val cleaned = words.mapPartitions{iterator => 
    val stopWordsSet = stopWordsBroadcast.value
    iterator.filter(elem => !stopWordsSet.contains(elem))
    }
// plug the normalizer function here
val normalized = cleaned.map(normalForm(_)) 

请注意:这是从视图星火工作点。我不熟悉使用Lucene。

Note: This is from the Spark job point of view. I'm not familiar with Lucene.

这篇关于话归使用RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆