将十六进制字符(连字)转换为utf-8字符 [英] Convert hexadecimal character (ligature) to utf-8 character

查看:97
本文介绍了将十六进制字符(连字)转换为utf-8字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从pdf文件转换的文本内容.文本中有一些不需要的字符,我想将它们转换为utf-8字符.

I had a text content which is converted from a pdf file. There are some unwanted character in the text and I want to convert them to utf-8 characters.

例如;人工免疫系统"的转换类似于人工免疫系统". fi 就像一个字符一样被转换,我使用 gdex 来学习字符的 ascii 值,但是我不知道如何替换它所有内容的真实价值.

For instance; 'Artificial Immune System' is converted like 'Articial Immune System'. is converted like a one character and I used gdex to learn the ascii value of the character but I don't know how to replace it with the real value in the all content.

推荐答案

我想您所看到的是 连字 -专业字体具有字形,可将多个单个字符组合成一个(外观更好)的字形.因此,该字体没有使用两个字形来编写"f"和"i",而是使用了一个"fi"字形.比较"fi"(两个字母)和"fi"(单个字形).

I guess what you're seeing are ligatures — professional fonts have glyps that combine several individual characters into a single (better looking) glyph. So instead of writing "f" and "i", as two glyphs, the font has a single "fi" glyph. Compare "fi" (two letters) with "fi" (single glyph).

在Python中,您可以使用 unicodedata 模块处理后期的Unicode文本.您还可以利用转换为NFKD正常形式来分割连字:

In Python, you can use the unicodedata module to manipute late Unicode text. You can also exploit the conversion to NFKD normal form to split ligatures:

>>> import unicodedata
>>> unicodedata.name(u'\uFB01')
'LATIN SMALL LIGATURE FI'
>>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System')
u'Artificial Immune System'

因此,使用NFKD规范化字符串应该可以帮助您.如果发现拆分过多,我最好的建议是为要拆分的连字制作一个小的映射表,并手动替换连字:

So normalizing your strings with NFKD should help you along. If you find that this splits too much, then my best suggestion is to make a small mapping table of the ligatures you want to split and replace the ligatures manually:

>>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'}
>>> u'Arti\uFB01cial Immune System'.translate(ligatures)
u'Artificial Immune System'

请参阅 Wikipedia文章,以获取 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆