从docx中提取符号字符 [英] Extract symbol characters from docx

查看:210
本文介绍了从docx中提取符号字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个JAVA程序,它处理docx文件的XML内容并将其转换为特定格式。它工作得很好,但如果Word文件包含符号字符,我会遇到问题,例如希腊字母。在这种情况下,我只看到小方块。

I'm developing a JAVA program which processes the XML content of docx files and converts it to a specific format. It's working quite well, but I have problems if the Word file contains Symbol characters e.g. greek letters. In this case I see only little squares.

我查看了源代码并看到如下内容:

I checked the source and see something like this:

<w:r w:rsidRPr="008E65F6"><w:rPr><w:rFonts w:ascii="Symbol" w:hAnsi="Symbol"/></w:rPr><w:t>ďˇ</w:t></w:r>

或者如果我将编码设置为UTF-8:

Or if I set the encoding to UTF-8:

<w:r w:rsidRPr="008E65F6"><w:rPr><w:rFonts w:ascii="Symbol" w:hAnsi="Symbol"/></w:rPr><w:t></w:t></w:r>

当我查看为Hexa时,似乎希腊字符编码为 EF 81 A1 用于alpha, EF 81 A2 用于测试等等。

When I view as Hexa, it seems that the greek characters are encoded as EF 81 A1 for alpha, EF 81 A2 for beta and so on.

我也试过 val.getBytes(Charset.forName(utf8))其中val是< w:t>的值标签。结果是例如 [ - 17,-127,-95] 。负值对我来说非常令人惊讶。

I also tried val.getBytes(Charset.forName("utf8")) where val is the value of the <w:t> tag. The result is e.g. [-17, -127, -95]. The negative values are quite surprising for me.

所以我的问题是,将这些符号转换为常规UTF-8字符的安全可靠方法是什么?

So my question is, what is a safe and reliable way to covert these symbols to regular UTF-8 characters?

推荐答案

同时,我找到了解决方案,所以我将其添加为答案供将来参考。

Meanwhile, I have found the solution, so I add it as answer for future reference.

我使用字形查看器软件检查了Symbol字体,我意识到它使用Unicode的私有使用区域作为其字符。 Times New Roman等其他字体存储正常Unicode范围内的相关字符(例如希腊字母)。

I checked the Symbol font with a glyph viewer software and I realized that it uses the Private Use Area of Unicode for its characters. Other fonts like Times New Roman store the concerned characters (e.g. greek letters) in normal Unicode range.

因此解决方案是使用标准Unicode字形映射符号字形。我已经手工创建了一个转换表,用于希腊字母(大写/小写),符号字体中可用的标点,数字和数学符号。注意,即使variuos范围内的字符的顺序彼此不同,例如,希腊字母表在符号和Unicode中的顺序不同。所以我必须逐个检查字符代码。

So the solution is to map the Symbol glyphs with standard Unicode glyphs. I have created a conversion table by hand for the greek letters (upper/lower case), punctuations, numbers and mathematical symbols available in the Symbol font. Note that even the order of the characters in variuos ranges differ from each other, e.g. the greek alphabet is not in the same order in Symbol and Unicode. So I had to check the character codes one by one.

当我有转换表时,我将它存储在一个txt文件中。当我的应用程序在Word文件中找到一个段(运行)时,该文件使用符号字体格式化(示例中为< w:rFonts> 标记),它会调用转换方法。在这个方法中,我将txt文件解析为 HashMap ,并将符号代码中的字符逐个更改为Unicode:

When I had the conversion table, I stored it in a txt file. When my application finds a segment (run) in the Word file which is formatted with Symbol font (<w:rFonts> tag in the example), it calls the conversion method. In this method, I parse the txt file to a HashMap, and change the characters one by one from Symbol code to Unicode:

public String convert(String symbolString)  {
    StringBuilder sb = new StringBuilder();

    for(int k=0; k<symbolString.length(); k++){
        int origCode = Character.codePointAt(symbolString, k);
        Integer replaceCode = conversionTable.get(origCode);
        if(replaceCode != null) {
            sb.append(Character.toChars(replaceCode));
        } else {
            sb.append("?");
        }
    }

    return sb.toString();
}

其中 conversionTable 是包含替换代码为<十六进制值的 HashMap 对象。

Where conversionTable is the HashMap object containing the replace codes as hex values.

这篇关于从docx中提取符号字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆