解析hadoop java中的json输入 [英] parsing json input in hadoop java

查看:154
本文介绍了解析hadoop java中的json输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的输入数据是在hdfs中。我只是试图做wordcount,但有一点区别。
数据为json格式。
因此每一行数据是:

$ p $ {author:foo,text:hello }
{author:foo123,text:hello world}
{author:foo234,text:hello this world}

我只想在文字部分中输入单词的字数。



我该怎么做?



到目前为止,我尝试了以下变体:

  public static class TokenCounterMapper 
扩展映射器< Object,Text,Text,IntWritable> {
private static final Log log = LogFactory.getLog(TokenCounterMapper.class);
private static static IntWritable one = new IntWritable(1);
私人文字=新文字();
$ b $ public void map(Object key,Text value,Context context)
throws IOException,InterruptedException {
try {

JSONObject jsn = new JSONObject value.toString());

// StringTokenizer itr = new StringTokenizer(value.toString());
String text =(String)jsn.get(text);
log.info(记录数据);
log.info(text);
StringTokenizer itr = new StringTokenizer(text);
while(itr.hasMoreTokens()){
word.set(itr.nextToken());
context.write(word,one);
}
} catch(JSONException e){
// TODO自动生成的catch块
e.printStackTrace();
}
}
}

但是我收到这个错误:

 错误:java.lang.ClassNotFoundException:org.json.JSONException $ b $ java.net.URLClassLoader $ 1。运行(URLClassLoader.java:202)$ java.util.AccessController.doPrivileged(Native方法)
在java.net.URLClassLoader.findClass(URLClassLoader.java:190)
。 lang.ClassLoader.loadClass(ClassLoader.java:306)sun.misc.Launcher上的
$ java.util.ClassLoader.loadClass上的
(ClassLoader.java: 247)在java.lang.Class.forName0处使用
(本地方法)$ java.util.Class.forName(Class.java:247)处的
在org.apache.hadoop.conf处
。 Configuration.getClassByName(Configuration.java:820)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass( JobContext.java:199)$ or $ $ b $ org.apache.hadoop.mapred。 MapTask.runNewMapper(MapTask.java:719)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child $ 4.run (Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org .apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)


解决方案

似乎忘了将JSon库嵌入到Hadoop作业jar中。
您可以在那里查看如何使用库建立作业:
http://tikalk.com/build-your-first-hadoop-project-maven


My input data is in hdfs. I am simply trying to do wordcount but there is slight difference. The data is in json format. So each line of data is:

{"author":"foo", "text": "hello"}
{"author":"foo123", "text": "hello world"}
{"author":"foo234", "text": "hello this world"}

I only want to do wordcount of words in "text" part.

How do I do this?

I tried the following variant so far:

public static class TokenCounterMapper
    extends Mapper<Object, Text, Text, IntWritable> {
    private static final Log log = LogFactory.getLog(TokenCounterMapper.class);
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
        try {

            JSONObject jsn = new JSONObject(value.toString());

            //StringTokenizer itr = new StringTokenizer(value.toString());
            String text = (String) jsn.get("text");
            log.info("Logging data");
            log.info(text);
            StringTokenizer itr = new StringTokenizer(text);
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        } catch (JSONException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

But I am getting this error:

Error: java.lang.ClassNotFoundException: org.json.JSONException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
    at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

解决方案

Seems you forgot to embed the JSon library in your Hadoop job jar. You can have a look there to see how you can build your job with the library: http://tikalk.com/build-your-first-hadoop-project-maven

这篇关于解析hadoop java中的json输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆