将十六进制字符(连字)转换为utf-8字符 [英] Convert hexadecimal character (ligature) to utf-8 character

查看：97 发布时间：2021/4/21 20:20:33 python pdf character ligature

本文介绍了将十六进制字符(连字)转换为utf-8字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个从pdf文件转换的文本内容.文本中有一些不需要的字符，我想将它们转换为utf-8字符.

I had a text content which is converted from a pdf file. There are some unwanted character in the text and I want to convert them to utf-8 characters.

例如；人工免疫系统"的转换类似于人工免疫系统". fi 就像一个字符一样被转换，我使用 gdex 来学习字符的 ascii 值，但是我不知道如何替换它所有内容的真实价值.

For instance; 'Artificial Immune System' is converted like 'Artiﬁcial Immune System'. ﬁ is converted like a one character and I used gdex to learn the ascii value of the character but I don't know how to replace it with the real value in the all content.

推荐答案

我想您所看到的是连字 -专业字体具有字形，可将多个单个字符组合成一个(外观更好)的字形.因此，该字体没有使用两个字形来编写"f"和"i"，而是使用了一个"fi"字形.比较"fi"(两个字母)和"fi"(单个字形).

I guess what you're seeing are ligatures — professional fonts have glyps that combine several individual characters into a single (better looking) glyph. So instead of writing "f" and "i", as two glyphs, the font has a single "fi" glyph. Compare "fi" (two letters) with "ﬁ" (single glyph).

在Python中，您可以使用 unicodedata 模块处理后期的Unicode文本.您还可以利用转换为NFKD正常形式来分割连字:

In Python, you can use the unicodedata module to manipute late Unicode text. You can also exploit the conversion to NFKD normal form to split ligatures:

>>> import unicodedata
>>> unicodedata.name(u'\uFB01')
'LATIN SMALL LIGATURE FI'
>>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System')
u'Artificial Immune System'

因此，使用NFKD规范化字符串应该可以帮助您.如果发现拆分过多，我最好的建议是为要拆分的连字制作一个小的映射表，并手动替换连字:

So normalizing your strings with NFKD should help you along. If you find that this splits too much, then my best suggestion is to make a small mapping table of the ligatures you want to split and replace the ligatures manually:

>>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'}
>>> u'Arti\uFB01cial Immune System'.translate(ligatures)
u'Artificial Immune System'

请参阅 Wikipedia文章，以获取查看全文

将十六进制字符(连字)转换为utf-8字符 [英] Convert hexadecimal character (ligature) to utf-8 character

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将十六进制字符(连字)转换为utf-8字符 [英] Convert hexadecimal character (ligature) to utf-8 character

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭