如何使用 Apache Tika 解析八位字节流文件? [英] How to parse octet-stream files using Apache Tika?

查看：34 发布时间：2021/11/14 23:48:42 java azure-blob-storage apache-tika

本文介绍了如何使用 Apache Tika 解析八位字节流文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Azure Blob 存储中存储了所有不同类型的文件，文件有 txt、doc、pdf 等.但是，所有文件都存储为八位字节流"，当我打开文件以使用 Tika 从中提取文本时，Tika 无法检测到字符编码.我怎样才能解决这个问题?

I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem?

FileSystem fs = FileSystem.get(new Configuration());            
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);          
InputStream stream = fs.open(pt);           


AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();   

parser.parse(stream, handler, metadata);       


spaceContentBuffer.append(handler.toString());

推荐答案

如果直接调用 Azure Storage REST API，可以通过 API 设置 Blob 属性.

If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties.

如果您使用的是 Azure 存储客户端库，您可以编写如下类似的代码:

If you are using Azure Storage Client Library, you can write similar code as below:

blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();

这篇关于如何使用 Apache Tika 解析八位字节流文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 Apache Tika 解析八位字节流文件? [英] How to parse octet-stream files using Apache Tika?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何使用 Apache Tika 解析八位字节流文件? [英] How to parse octet-stream files using Apache Tika?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭