Apache Tika 是否能够提取中文、日文等外语? [英] Is Apache Tika able to extract foreign languages like Chinese, Japanese?

查看:41
本文介绍了Apache Tika 是否能够提取中文、日文等外语?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Tika 是否能够提取中文、日文等外语?

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

我有以下代码:

    Detector detector = new DefaultDetector();
    Parser parser = new AutoDetectParser(detector);
    InputStream stream = new ByteArrayInputStream(bytes);
    OutputStream outputstream = new ByteArrayOutputStream();
    ContentHandler textHandler = new BodyContentHandler(outputstream);
    Metadata metadata = new Metadata();
    // Set<String> langs = LanguageIdentifier.getSupportedLanguages();
    // metadata.set(Metadata.CONTENT_LANGUAGE, lang);
    // metadata.set(Metadata.FORMAT, hint);
    ParseContext context = new ParseContext();
    try {
        parser.parse(stream, textHandler, metadata, context);
        String extractedText = outputstream.toString();
        return extractedText;
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

如果输入的是包含汉字的doc文件,每个汉字都会被提取为?".

If the input is a doc file that contains Chinese characters, each Chinese characters will be extracted as "?".

非常感谢!

推荐答案

Apache Tika 能够从其支持的文件格式中提取 unicode 文本.只要文件格式可以存储unicode文本(例如中文或日文字符),Apache Tika就可以提取出来

Apache Tika is able to extract unicode text from its supported file formats. As long as the file format can store unicode text (eg Chinese or Japanese characters), Apache Tika can extract it

Tika 还为此包含了许多单元测试,以验证它是否有效.一种这样的测试使用 此中文电子邮件示例.如果使用命令行 Tika 应用程序,并抓取前几行,我们会看到它工作:

Tika also includes a number of unit tests for this, which verify it works. One such test uses this sample chinese email. If with use the command line Tika app, and grab the first few lines, we see it working:

$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
    From
    Tests Chang@FT (張毓倫)
    To
    Tests Chang@FT (張毓倫)
    Recipients
    tests.chang@fengttt.com

或者用这个 日语文档:

$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期 

您只需要确保您生成的任何文本输出都以合适的编码(例如 utf8)存储,并且您用来显示它的字体支持这些字形!

You'll just need to ensure that any text output you generate gets stored in a suitable encoding (eg utf8), and the font you use to display it supports those glyphs!

这篇关于Apache Tika 是否能够提取中文、日文等外语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆