将文档(.pdf,.doc和.txt文件)存储在MaprDB中 [英] Store documents (.pdf, .doc and .txt files) in MaprDB

查看:85
本文介绍了将文档(.pdf,.doc和.txt文件)存储在MaprDB中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将诸如.pdf,.doc和.txt文件之类的文档存储到MaprDB.我在Hbase中看到了一个示例,该示例以二进制形式存储文件并在Hue中作为文件检索,但是我不确定如何实现.知道如何将文档存储在MaprDB中吗?

I need to store documents such as .pdf, .doc and .txt files to MaprDB. I saw one example in Hbase where it stores files in binary and is retrieved as files in Hue, but I not sure how it could be implemented. Any idea how can a document be stored in MaprDB?

推荐答案

第一件事是,我不知道Maprdb是使用Cloudera的即时消息.但是我在hbase方面具有丰富的经验,如下面提到的那样,在hbase中将许多类型的对象存储为字节数组.

First thing is , Im not aware about Maprdb as Im using Cloudera. But I have experience in hbase storing many types of objects in hbase as byte array like below mentioned.

在hbase或任何其他db中存储的最原始的方法是字节数组. 查看我的答案

Most primitive way of storing in hbase or any other db is byte array. see my answer

您可以使用Apache commons lang API通过以下方式进行操作.也许这是最好的选择,它将适用于所有对象,包括图像/音频/视频等.

You can do that in below way using Apache commons lang API. probably this is best option, which will be applicable to all objects including image/audio/video etc..

请使用任何文件的对象类型之一测试此方法. SerializationUtils.serialize将返回字节.您可以插入.

please test this method with one of object type of any of your files. SerializationUtils.serialize will return bytes. which you can insert.

import org.apache.commons.lang.SerializationUtils;
/**
* testSerializeAndDeserialize.
*
**/
public void testSerializeAndDeserialize throws Exception {

//serialize here
    byte[] bytes = SerializationUtils.serialize("your object here which is of type f  .pdf, .doc and .txt ");


 // deserialize the same here and see you are getting back or not.
 yourobjecttype objtypeofpdfortxtordoc = (yourobjecttype) SerializationUtils.deserialize(bytes);

}

注意:Apache common lang的jar始终在hadoop集群中可用.(不是外部依赖项)

另一个例子:

Note :jar of apache commons lang always available in hadoop cluster.(not external dependency)

another example :

import java.io.FileInputStream;
import java.io.FileOutputStream;

import org.apache.commons.lang.SerializationUtils;

public class SerializationUtilsTrial {
  public static void main(String[] args) {
    try {
      // File to serialize object to
      String fileName = "testSerialization.ser";

      // New file output stream for the file
      FileOutputStream fos = new FileOutputStream(fileName);

      // Serialize String
      SerializationUtils.serialize("SERIALIZE THIS", fos);
      fos.close();

      // Open FileInputStream to the file
      FileInputStream fis = new FileInputStream(fileName);

      // Deserialize and cast into String
      String ser = (String) SerializationUtils.deserialize(fis);
      System.out.println(ser);
      fis.close();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}


出于任何原因,如果您不想使用Apache commons lang提供的SerializationUtils类,则可以在下面看到pdf序列化和反序列化示例,以使您更好地理解,但是如果需要的话,它的代码很长使用SerializationUtils可以减少代码.


For any reason if you don't want to use SerializationUtils class provided by Apache commons lang, then you can see below pdf serialize and deserialize example for your better understanding but its lengthy code if you use SerializationUtils the code will be reduced.

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;

public class PdfSerializeAndDeserExample {

    public static void main(String[] args) throws FileNotFoundException, IOException {
        File file = new File("someFile.pdf");

        FileInputStream fis = new FileInputStream(file);
        //System.out.println(file.exists() + "!!");
        //InputStream in = resource.openStream();
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        byte[] buf = new byte[1024];
        try {
            for (int readNum; (readNum = fis.read(buf)) != -1;) {
                bos.write(buf, 0, readNum); //no doubt here is 0
                //Writes len bytes from the specified byte array starting at offset off to this byte array output stream.
                System.out.println("read " + readNum + " bytes,");
            }
        } catch (IOException ex) {
            Logger.getLogger(genJpeg.class.getName()).log(Level.SEVERE, null, ex);
        }
        byte[] bytes = bos.toByteArray();

在获取字节数组之前,您可以准备放置请求以上传到数据库(即Hbase或任何其他数据库)


持久化后,可以使用hbase get或scan您获得get pdf字节,并使用下面的代码再次创建相同的文件,在这种情况下为someFile.pdf.

Above you are getting byte array you can prepare put request to upload to database i.e Hbase or any other database


Once you persisted, you can get the same using hbase get or scan you get your pdf bytes and use the below code to again make same file i.e someFile.pdf in this case.

        File someFile = new File("someFile.pdf");
        FileOutputStream fos = new FileOutputStream(someFile);
        fos.write(bytes);
        fos.flush();
        fos.close();
    }
}

自从您问了HBASE示例以来,我在下面的方法中添加了这个..

yourcolumnasBytearray是您的实例pdf ..的doc文件.在上述示例中,已转换为字节数组(使用SerializationUtils.serialize)...

EDIT : Since you asked HBASE examples I'm adding this.. in the below method

yourcolumnasBytearray is your doc file for instance pdf.. converted to byte array (using SerializationUtils.serialize) in above examples...

  /**
 * Put (or insert) a row
 */
@Override
public void addRecord(final String tableName, final String rowKey, final String family, final String qualifier,
                final byte[] yourcolumnasBytearray) throws Exception {
    try {
        final HTableInterface table = HBaseConnection.getHTable(getTable(tableName));
        final Put put = new Put(Bytes.toBytes(rowKey));
        put.add(Bytes.toBytes(family), Bytes.toBytes(qualifier), yourcolumnasBytearray);
        table.put(put);
        LOG.info("INSERT record " + rowKey + " to table " + tableName + " OK.");
    } catch (final IOException e) {
        printstackTrace(e);
    }

这篇关于将文档(.pdf,.doc和.txt文件)存储在MaprDB中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆