如何使用 Apache Tika 解析八位字节流文件? [英] How to parse octet-stream files using Apache Tika?
问题描述
我在 Azure Blob 存储中存储了所有不同类型的文件,文件有 txt、doc、pdf 等.但是,所有文件都存储为八位字节流",当我打开文件以使用 Tika 从中提取文本时,Tika 无法检测到字符编码.我怎样才能解决这个问题?
I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem?
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
推荐答案
如果直接调用 Azure Storage REST API,可以通过 API 设置 Blob 属性.
If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties.
如果您使用的是 Azure 存储客户端库,您可以编写如下类似的代码:
If you are using Azure Storage Client Library, you can write similar code as below:
blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();
这篇关于如何使用 Apache Tika 解析八位字节流文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!