如何使用Apache Tika解析八位字节流文件? [英] How to parse octet-stream files using Apache Tika?

查看：53 发布时间：2020/9/4 23:12:07 java azure-storage-blobs apache-tika

本文介绍了如何使用Apache Tika解析八位字节流文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已将所有不同类型的文件存储在Azure Blob存储上，文件是txt，doc，pdf等.但是，所有文件都存储为八位字节流"，当我打开文件以使用Tika从其中提取文本时，Tika无法检测到字符编码.我该如何解决这个问题?

I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem?

FileSystem fs = FileSystem.get(new Configuration());            
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);          
InputStream stream = fs.open(pt);           


AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();   

parser.parse(stream, handler, metadata);       


spaceContentBuffer.append(handler.toString());

推荐答案

如果直接调用Azure存储REST API，则可以通过

If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties.

如果使用的是Azure存储客户端库，则可以编写如下类似的代码:

If you are using Azure Storage Client Library, you can write similar code as below:

blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();

这篇关于如何使用Apache Tika解析八位字节流文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Apache Tika解析八位字节流文件? [英] How to parse octet-stream files using Apache Tika?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何使用Apache Tika解析八位字节流文件? [英] How to parse octet-stream files using Apache Tika?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭