使用Java Mapreduce处理JSON [英] Processing JSON using java Mapreduce

查看：82 发布时间：2018/5/31 19:31:36 json hadoop mapreduce

本文介绍了使用Java Mapreduce处理JSON的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是hadoop mapreduce的新手

我输入了文本文件，其中的数据已经按照以下方式存储。这里只有几个元组（data.txt）

  {author：SharīfQāsim，book： 作者：NāṣirNimrī，书：Adīb'Abbāsī} 
 {作者：Muẓaffar'Abdal-MajīdKammūnah，书：Asmā'Allāhal-ḥusnáal-wāridahfīmuḥkamkitābih} 
 {author：ḤasanMuṣṭafáAḥmad，book：al-Jabhah al-sharqīyahwa-ma'ārikuhāfīḥarbRamaḍān } 
 {author：RafīqahSalīmḤammūd，book：Ta'līmfīal-Baḥrayn}

这是我的java文件，我应该在（CombineBooks.java）中写我的代码

 包org.hwone; 
 
导入org.apache.hadoop.conf.Configuration; 
 import org.apache.hadoop.mapreduce.Job; 
 import org.apache.hadoop.util.GenericOptionsParser; 
 
 // TODO导入必需的组件
 
 / * 
 *修改此文件以将来自同一书籍的书籍合并到
 *单个JSON对象中。 
 * ie {author：Tobias Wells，books：[{book：在该国死亡}，{book：Dinky died}]} 
 *请注意，这可能适用于任何数量的节点！ 
 * 
 * / 
 
 public class CombineBooks {
 
 // TODO定义变量并实现必要的组件
 
 public static void main（String [] args）抛出异常{
 Configuration conf = new Configuration（）; 
 String [] otherArgs = new GenericOptionsParser（conf，args）
 .getRemainingArgs（）; 
 if（otherArgs.length！= 2）{
 System.err.println（用法：CombineBooks&in>< out>）; 
 System.exit（2）; 
} 
 
 // TODO实现CombineBooks 
 
 Job job = new Job（conf，CombineBooks）; 
 
 // TODO实现CombineBooks 
 
 System.exit（job.waitForCompletion（true）？0：1）; 
 
 
 
 
 $ p 
 $ b我的任务是在CombineBooks中创建一个Hadoop程序.java
在question-2目录中返回。该程序应该执行以下
操作：给定输入作者手册元组，map-reduce 
程序应该生成一个JSON对象，其中包含来自同一作者的所有
书籍的JSON数组，即
  {author：Tobias Wells，books：[{book：A die in the country }，{book：Dinky died}]} 
  
 它可以做到吗？  
 
解决方案
首先，您尝试使用的JSON对象不适合您。为了解决这个问题：
 
 去这里下载zip文件： https://github.com/douglascrockford/JSON-java  
 
 解压缩到子目录中的源文件夹org / json / * 
 li> 
 
接下来，代码的第一行会生成一个包org.json，这是不正确的，您需要创建一个单独的包，实例my.books。
 
 
第三，在这里使用组合器是没有用的。
 
 
 这里是我结束的代码它可以解决你的问题：
 
 import java.io.IOException; 
导入org.apache.hadoop.conf.Configuration; 
导入org.apache.hadoop.fs.Path; 
 import org.apache.hadoop.io.LongWritable; 
 import org.apache.hadoop.io.NullWritable; 
 import org.apache.hadoop.io.Text; 
 import org.apache.hadoop.mapreduce.Job; 
 import org.apache.hadoop.mapreduce.Mapper; 
 import org.apache.hadoop.mapreduce.Reducer; 
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 
 import org.apache.hadoop.util.GenericOptionsParser; 
 import org.json。*; 
 
 import javax.security.auth.callback.TextInputCallback; 
 
 public class CombineBooks {
 
 public static class Map扩展Mapper< LongWritable，Text，Text，Text> {
 
 public void map（LongWritable key ，文本值，上下文上下文）抛出IOException，InterruptedException {
 
字符串作者; 
字符串书; 
 String line = value.toString（）; 
 String [] tuple = line.split（\\）; 
尝试{
 for（int i = 0; i< tuple.length; i ++）{
 JSONObject obj = new JSONObject（tuple [i]）; 
 author = obj.getString（author）; 
 book = obj.getString（book）; 
 context.write（new Text（author），new Text（book））; 
} 
} catch（JSONException e）{
 e.printStackTrace（）; 
 
 
 
 $ b $ public static class Reduce extends Reducer< Text，Text，NullWritable，Text> {
 
 public void reduce （Text key，Iterable< Text> values，Context context）throws IOException，InterruptedException {
 
 try {
 JSONObject obj = new JSONObject（）; 
 JSONArray ja = new JSONArray（）; 
 for（Text val：values）{
 JSONObject jo = new JSONObject（）。put（book，val.toString（））; 
 ja.put（jo）; 
} 
 obj.put（books，ja）; 
 obj.put（author，key.toString（））; 
 context.write（NullWritable.get（），new Text（obj.toString（）））; 
} catch（JSONException e）{
 e.printStackTrace（）; 
 
 
 
 $ b public static void main（String [] args）throws Exception {
 Configuration conf = new Configuration（）; 
 if（args.length！= 2）{
 System.err.println（用法：CombineBooks&in>< out>）; 
 System.exit（2）; 
} 
 
 Job job = new Job（conf，CombineBooks）; 
 job.setJarByClass（CombineBooks.class）; 
 job.setMapperClass（Map.class）; 
 job.setReducerClass（Reduce.class）; 
 job.setMapOutputKeyClass（Text.class）; 
 job.setMapOutputValueClass（Text.class）; 
 job.setOutputKeyClass（NullWritable.class）; 
 job.setOutputValueClass（Text.class）; 
 job.setInputFormatClass（TextInputFormat.class）; 
 job.setOutputFormatClass（TextOutputFormat.class）; 
 
 FileInputFormat.addInputPath（job，new Path（args [0]））; 
 FileOutputFormat.setOutputPath（job，new Path（args [1]））; 
 
 System.exit（job.waitForCompletion（true）？0：1）; 
 
 
 $ / code $ / pre 
 $ b $ p这里是我的项目的文件夹结构：
  src 
 src / my 
 src / my / books 
 src / my / books / CombineBooks。 java 
 src / org 
 src / org / json 
 src / org / json / zip 
 src / org / json / zip / BitReader.java 
 ... 
 src / org / json / zip / None.java 
 src / org / json / JSONStringer.java 
 src / org / json / JSONML.java 
 ... 
 src / org / json / JSONException.java 
  
以下是输入 
 
 
  [localhost：CombineBooks] $ hdfs dfs -cat /example.txt 
 {author：author1，book：book1 } 
 {author：author1，book：book2} 
 {author：author1，book：book3} 
 { author：author2，book：book4} 
 {author：author2，book：book5} 
 {author：author3 book：book6} 
  
运行命令： 
 
 
  hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt / test_output 
  以下是输出：
 
 pre $ $ $ $ $ part-r-00000 
 {books：[{book：book3}，{book：book2}，{book：book1}]，author： author1} 
 {books：[{book：book5}，{book：book4}]，author：author2} 
 {books ：[{book：book6}]，author：author3}

您可以使用以下三个选项将 org.json。* 类放入您的群集中：

将 org.json。* 类打包到jar文件中（可以使用GUI IDE轻松完成）。这是我在答案中使用的选项

将包含 org.json。* 类的jar文件放在每个集群节点上进入一个CLASSPATH目录（见yarn.application.classpath）

将包含 org.json。* 的jar文件放入HDFS （ hdfs dfs -put< org.json jar>< hdfs path> ）并使用 job.addFileToClassPath 调用使该jar文件可用于在集群上执行作业的所有任务。在我的回答中，您应该向 main 添加 job.addFileToClassPath（new Path（< jar_file_on_hdfs_location>））; >

I am new to hadoop mapreduce

I have input text file where data has been stored as follow. Here are only a few tuples (data.txt)
{"author":"Sharīf Qāsim","book":"al- Rabīʻ al-manshūd"} {"author":"Nāṣir Nimrī","book":"Adīb ʻAbbāsī"} {"author":"Muẓaffar ʻAbd al-Majīd Kammūnah","book":"Asmāʼ Allāh al-ḥusná al-wāridah fī muḥkam kitābih"} {"author":"Ḥasan Muṣṭafá Aḥmad","book":"al- Jabhah al-sharqīyah wa-maʻārikuhā fī ḥarb Ramaḍān"} {"author":"Rafīqah Salīm Ḥammūd","book":"Taʻlīm fī al-Baḥrayn"}
This is my java file that I am supposed to write my code in (CombineBooks.java)
package org.hwone; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.util.GenericOptionsParser; //TODO import necessary components /* * Modify this file to combine books from the same other into * single JSON object. * i.e. {"author": "Tobias Wells", "books": [{"book":"A die in the country"},{"book": "Dinky died"}]} * Beaware that, this may work on anynumber of nodes! * */ public class CombineBooks { //TODO define variables and implement necessary components public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: CombineBooks <in> <out>"); System.exit(2); } //TODO implement CombineBooks Job job = new Job(conf, "CombineBooks"); //TODO implement CombineBooks System.exit(job.waitForCompletion(true) ? 0 : 1); } }
My task is to create a Hadoop program in "CombineBooks.java" returned in the "question-2" directory. The program should do the following: Given the input author-book tuples, map-reduce program should procude a JSON object which contains all the books from same author in a JSON array, i.e.
{"author": "Tobias Wells", "books":[{"book":"A die in the country"},{"book": "Dinky died"}]}
Any idea how it can be done ?
解决方案
First, the JSON objects you are trying to work with are not available for you. To solve this:

Go here and download as zip: https://github.com/douglascrockford/JSON-java

Extract to your sources folder in subdirectory org/json/*

Next, the first line of your code makes a package "org.json", which is incorrect, you shold create a separate package, for instance "my.books".

Third, using combiner here is useless.

Here's the code I ended up with, it works and solves your problem:
package my.books; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.json.*; import javax.security.auth.callback.TextInputCallback; public class CombineBooks { public static class Map extends Mapper<LongWritable, Text, Text, Text>{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ String author; String book; String line = value.toString(); String[] tuple = line.split("\\n"); try{ for(int i=0;i<tuple.length; i++){ JSONObject obj = new JSONObject(tuple[i]); author = obj.getString("author"); book = obj.getString("book"); context.write(new Text(author), new Text(book)); } }catch(JSONException e){ e.printStackTrace(); } } } public static class Reduce extends Reducer<Text,Text,NullWritable,Text>{ public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{ try{ JSONObject obj = new JSONObject(); JSONArray ja = new JSONArray(); for(Text val : values){ JSONObject jo = new JSONObject().put("book", val.toString()); ja.put(jo); } obj.put("books", ja); obj.put("author", key.toString()); context.write(NullWritable.get(), new Text(obj.toString())); }catch(JSONException e){ e.printStackTrace(); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); if (args.length != 2) { System.err.println("Usage: CombineBooks <in> <out>"); System.exit(2); } Job job = new Job(conf, "CombineBooks"); job.setJarByClass(CombineBooks.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Here's the folder structure of my project:
src src/my src/my/books src/my/books/CombineBooks.java src/org src/org/json src/org/json/zip src/org/json/zip/BitReader.java ... src/org/json/zip/None.java src/org/json/JSONStringer.java src/org/json/JSONML.java ... src/org/json/JSONException.java
Here's the input
[localhost:CombineBooks]$ hdfs dfs -cat /example.txt {"author":"author1", "book":"book1"} {"author":"author1", "book":"book2"} {"author":"author1", "book":"book3"} {"author":"author2", "book":"book4"} {"author":"author2", "book":"book5"} {"author":"author3", "book":"book6"}
The command to run:
hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt /test_output
Here's the output:
[pivhdsne:CombineBooks]$ hdfs dfs -cat /test_output/part-r-00000 {"books":[{"book":"book3"},{"book":"book2"},{"book":"book1"}],"author":"author1"} {"books":[{"book":"book5"},{"book":"book4"}],"author":"author2"} {"books":[{"book":"book6"}],"author":"author3"}
You can use on of the three options to put the org.json.* classes into your cluster:

Pack the org.json.* classes into your jar file (can easily be done using GUI IDE). This is the option I used in my answer

Put the jar file containing org.json.* classes on each of the cluster nodes into one of the CLASSPATH directories (see yarn.application.classpath)

Put the jar file containing org.json.* into HDFS (hdfs dfs -put <org.json jar> <hdfs path>) and use job.addFileToClassPath call for this jar file to be available for all of the tasks executing your job on the cluster. In my answer you should add job.addFileToClassPath(new Path("<jar_file_on_hdfs_location>")); to the main

这篇关于使用Java Mapreduce处理JSON的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Java Mapreduce处理JSON [英] Processing JSON using java Mapreduce

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

使用Java Mapreduce处理JSON [英] Processing JSON using java Mapreduce

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭