使用Java Mapreduce处理JSON [英] Processing JSON using java Mapreduce
问题描述
我是hadoop mapreduce的新手
我输入了文本文件,其中的数据已经按照以下方式存储。这里只有几个元组(data.txt)
{author:SharīfQāsim,book: 作者:NāṣirNimrī,书:Adīb'Abbāsī}
{作者:Muẓaffar'Abdal-MajīdKammūnah,书:Asmā'Allāhal-ḥusnáal-wāridahfīmuḥkamkitābih}
{author:ḤasanMuṣṭafáAḥmad,book:al-Jabhah al-sharqīyahwa-ma'ārikuhāfīḥarbRamaḍān }
{author:RafīqahSalīmḤammūd,book:Ta'līmfīal-Baḥrayn}
这是我的java文件,我应该在(CombineBooks.java)中写我的代码
包org.hwone;
导入org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
// TODO导入必需的组件
/ *
*修改此文件以将来自同一书籍的书籍合并到
*单个JSON对象中。
* ie {author:Tobias Wells,books:[{book:在该国死亡},{book:Dinky died}]}
*请注意,这可能适用于任何数量的节点!
*
* /
public class CombineBooks {
// TODO定义变量并实现必要的组件
public static void main(String [] args)抛出异常{
Configuration conf = new Configuration();
String [] otherArgs = new GenericOptionsParser(conf,args)
.getRemainingArgs();
if(otherArgs.length!= 2){
System.err.println(用法:CombineBooks&in>< out>);
System.exit(2);
}
// TODO实现CombineBooks
Job job = new Job(conf,CombineBooks);
// TODO实现CombineBooks
System.exit(job.waitForCompletion(true)?0:1);
$ p
$ b我的任务是在CombineBooks中创建一个Hadoop程序.java
在question-2目录中返回。该程序应该执行以下
操作:给定输入作者手册元组,map-reduce
程序应该生成一个JSON对象,其中包含来自同一作者的所有
书籍的JSON数组,即 {author:Tobias Wells,books:[{book:A die in the country },{book:Dinky died}]}
它可以做到吗?
解决方案首先,您尝试使用的JSON对象不适合您。为了解决这个问题:
- 去这里下载zip文件: https://github.com/douglascrockford/JSON-java
- 解压缩到子目录中的源文件夹org / json / *
li>
接下来,代码的第一行会生成一个包org.json,这是不正确的,您需要创建一个单独的包,实例my.books。
第三,在这里使用组合器是没有用的。
这里是我结束的代码它可以解决你的问题:
import java.io.IOException;
导入org.apache.hadoop.conf.Configuration;
导入org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.json。*;
import javax.security.auth.callback.TextInputCallback;
public class CombineBooks {
public static class Map扩展Mapper< LongWritable,Text,Text,Text> {
public void map(LongWritable key ,文本值,上下文上下文)抛出IOException,InterruptedException {
字符串作者;
字符串书;
String line = value.toString();
String [] tuple = line.split(\\);
尝试{
for(int i = 0; i< tuple.length; i ++){
JSONObject obj = new JSONObject(tuple [i]);
author = obj.getString(author);
book = obj.getString(book);
context.write(new Text(author),new Text(book));
}
} catch(JSONException e){
e.printStackTrace();
$ b $ public static class Reduce extends Reducer< Text,Text,NullWritable,Text> {
public void reduce (Text key,Iterable< Text> values,Context context)throws IOException,InterruptedException {
try {
JSONObject obj = new JSONObject();
JSONArray ja = new JSONArray();
for(Text val:values){
JSONObject jo = new JSONObject()。put(book,val.toString());
ja.put(jo);
}
obj.put(books,ja);
obj.put(author,key.toString());
context.write(NullWritable.get(),new Text(obj.toString()));
} catch(JSONException e){
e.printStackTrace();
$ b public static void main(String [] args)throws Exception {
Configuration conf = new Configuration();
if(args.length!= 2){
System.err.println(用法:CombineBooks&in>< out>);
System.exit(2);
}
Job job = new Job(conf,CombineBooks);
job.setJarByClass(CombineBooks.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job,new Path(args [0]));
FileOutputFormat.setOutputPath(job,new Path(args [1]));
System.exit(job.waitForCompletion(true)?0:1);
$ / code $ / pre
$ b $ p这里是我的项目的文件夹结构: src
src / my
src / my / books
src / my / books / CombineBooks。 java
src / org
src / org / json
src / org / json / zip
src / org / json / zip / BitReader.java
...
src / org / json / zip / None.java
src / org / json / JSONStringer.java
src / org / json / JSONML.java
...
src / org / json / JSONException.java
以下是输入
[localhost:CombineBooks] $ hdfs dfs -cat /example.txt
{author:author1,book:book1 }
{author:author1,book:book2}
{author:author1,book:book3}
{ author:author2,book:book4}
{author:author2,book:book5}
{author:author3 book:book6}
运行命令:
hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt / test_output
以下是输出:
pre $ $ $ $ $ part-r-00000
{books:[{book:book3},{book:book2},{book:book1}],author: author1}
{books:[{book:book5},{book:book4}],author:author2}
{books :[{book:book6}],author:author3}
您可以使用以下三个选项将 org.json。*
类放入您的群集中:
- 将
org.json。*
类打包到jar文件中(可以使用GUI IDE轻松完成)。这是我在答案中使用的选项 - 将包含
org.json。*
类的jar文件放在每个集群节点上进入一个CLASSPATH目录(见yarn.application.classpath)
- 将包含
org.json。*
的jar文件放入HDFS (hdfs dfs -put< org.json jar>< hdfs path>
)并使用job.addFileToClassPath
调用使该jar文件可用于在集群上执行作业的所有任务。在我的回答中,您应该向main
添加job.addFileToClassPath(new Path(< jar_file_on_hdfs_location>));
>
I am new to hadoop mapreduce
I have input text file where data has been stored as follow. Here are only a few tuples (data.txt)
{"author":"Sharīf Qāsim","book":"al- Rabīʻ al-manshūd"}
{"author":"Nāṣir Nimrī","book":"Adīb ʻAbbāsī"}
{"author":"Muẓaffar ʻAbd al-Majīd Kammūnah","book":"Asmāʼ Allāh al-ḥusná al-wāridah fī muḥkam kitābih"}
{"author":"Ḥasan Muṣṭafá Aḥmad","book":"al- Jabhah al-sharqīyah wa-maʻārikuhā fī ḥarb Ramaḍān"}
{"author":"Rafīqah Salīm Ḥammūd","book":"Taʻlīm fī al-Baḥrayn"}
This is my java file that I am supposed to write my code in (CombineBooks.java)
package org.hwone;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
//TODO import necessary components
/*
* Modify this file to combine books from the same other into
* single JSON object.
* i.e. {"author": "Tobias Wells", "books": [{"book":"A die in the country"},{"book": "Dinky died"}]}
* Beaware that, this may work on anynumber of nodes!
*
*/
public class CombineBooks {
//TODO define variables and implement necessary components
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
//TODO implement CombineBooks
Job job = new Job(conf, "CombineBooks");
//TODO implement CombineBooks
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
My task is to create a Hadoop program in "CombineBooks.java" returned in the "question-2" directory. The program should do the following: Given the input author-book tuples, map-reduce program should procude a JSON object which contains all the books from same author in a JSON array, i.e.
{"author": "Tobias Wells", "books":[{"book":"A die in the country"},{"book": "Dinky died"}]}
Any idea how it can be done ?
First, the JSON objects you are trying to work with are not available for you. To solve this:
- Go here and download as zip: https://github.com/douglascrockford/JSON-java
- Extract to your sources folder in subdirectory org/json/*
Next, the first line of your code makes a package "org.json", which is incorrect, you shold create a separate package, for instance "my.books".
Third, using combiner here is useless.
Here's the code I ended up with, it works and solves your problem:
package my.books;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.json.*;
import javax.security.auth.callback.TextInputCallback;
public class CombineBooks {
public static class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String author;
String book;
String line = value.toString();
String[] tuple = line.split("\\n");
try{
for(int i=0;i<tuple.length; i++){
JSONObject obj = new JSONObject(tuple[i]);
author = obj.getString("author");
book = obj.getString("book");
context.write(new Text(author), new Text(book));
}
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static class Reduce extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
try{
JSONObject obj = new JSONObject();
JSONArray ja = new JSONArray();
for(Text val : values){
JSONObject jo = new JSONObject().put("book", val.toString());
ja.put(jo);
}
obj.put("books", ja);
obj.put("author", key.toString());
context.write(NullWritable.get(), new Text(obj.toString()));
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "CombineBooks");
job.setJarByClass(CombineBooks.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Here's the folder structure of my project:
src
src/my
src/my/books
src/my/books/CombineBooks.java
src/org
src/org/json
src/org/json/zip
src/org/json/zip/BitReader.java
...
src/org/json/zip/None.java
src/org/json/JSONStringer.java
src/org/json/JSONML.java
...
src/org/json/JSONException.java
Here's the input
[localhost:CombineBooks]$ hdfs dfs -cat /example.txt
{"author":"author1", "book":"book1"}
{"author":"author1", "book":"book2"}
{"author":"author1", "book":"book3"}
{"author":"author2", "book":"book4"}
{"author":"author2", "book":"book5"}
{"author":"author3", "book":"book6"}
The command to run:
hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt /test_output
Here's the output:
[pivhdsne:CombineBooks]$ hdfs dfs -cat /test_output/part-r-00000
{"books":[{"book":"book3"},{"book":"book2"},{"book":"book1"}],"author":"author1"}
{"books":[{"book":"book5"},{"book":"book4"}],"author":"author2"}
{"books":[{"book":"book6"}],"author":"author3"}
You can use on of the three options to put the org.json.*
classes into your cluster:
- Pack the
org.json.*
classes into your jar file (can easily be done using GUI IDE). This is the option I used in my answer - Put the jar file containing
org.json.*
classes on each of the cluster nodes into one of the CLASSPATH directories (see yarn.application.classpath) - Put the jar file containing
org.json.*
into HDFS (hdfs dfs -put <org.json jar> <hdfs path>
) and usejob.addFileToClassPath
call for this jar file to be available for all of the tasks executing your job on the cluster. In my answer you should addjob.addFileToClassPath(new Path("<jar_file_on_hdfs_location>"));
to themain
这篇关于使用Java Mapreduce处理JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!