如何使用Apache Tika解析八位字节流文件? [英] How to parse octet-stream files using Apache Tika?

查看:53
本文介绍了如何使用Apache Tika解析八位字节流文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将所有不同类型的文件存储在Azure Blob存储上,文件是txt,doc,pdf等.但是,所有文件都存储为八位字节流",当我打开文件以使用Tika从其中提取文本时,Tika无法检测到字符编码.我该如何解决这个问题?

I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem?

FileSystem fs = FileSystem.get(new Configuration());            
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);          
InputStream stream = fs.open(pt);           


AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();   

parser.parse(stream, handler, metadata);       


spaceContentBuffer.append(handler.toString());

推荐答案

如果直接调用Azure存储REST API,则可以通过

If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties.

如果使用的是Azure存储客户端库,则可以编写如下类似的代码:

If you are using Azure Storage Client Library, you can write similar code as below:

blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();

这篇关于如何使用Apache Tika解析八位字节流文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆