如何将unicode重音字符转换为不带重音的纯ascii? [英] How to convert unicode accented characters to pure ascii without accents?

查看：174 发布时间：2020/7/12 18:48:25 python unicode wget unicode-normalization

本文介绍了如何将unicode重音字符转换为不带重音的纯ascii?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从诸如 http://dictionary的词典站点下载某些内容.reference.com/browse/apple?s = t

我遇到的问题是，原始段落包含所有这些弯曲的行和反向字母等，因此，当我阅读本地文件时，我最终得到了那些有趣的转义字符，例如\ x85，\ xa7，\ x8d等

The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like \x85, \xa7, \x8d, etc.

我的问题是，有什么办法可以将所有这些转义字符转换为各自的UTF-8字符，例如，如果存在'à'，我如何将其转换为标准的'a'?

My question is, is there any way i can convert all those escape characters into their respective UTF-8 characters, eg if there is an 'à' how do i convert that into a standard 'a' ?

Python调用代码:

Python calling code:

import os
word = 'apple'
os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word)

我在Windows 7系统上使用wget-1.11.4-1(请不要杀死我Linux的人，这是一个客户端要求)，并且wget exe被Python 2.6脚本文件解雇了.

I'm using wget-1.11.4-1 on a Windows 7 system (don't kill me Linux people, it was a client requirement), and the wget exe is being fired off with a Python 2.6 script file.

推荐答案

我如何将所有这些转义字符转换为各自的字符，例如是否有unicode à，如何将其转换为标准的 a ?

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?

假设您已将unicode加载到名为my_unicode的变量中，...将à标准化为a就是这么简单...

Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

明确的例子...

>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>

工作原理
unicodedata.normalize('NFD', "insert-unicode-text-here")对unicode文本执行规范分解(NFD)；然后我们使用str.encode('ascii', 'ignore')将NFD映射的字符转换为ascii(忽略错误).

How it works
unicodedata.normalize('NFD', "insert-unicode-text-here") performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore') to transform the NFD mapped characters into ascii (ignoring errors).

这篇关于如何将unicode重音字符转换为不带重音的纯ascii?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将unicode重音字符转换为不带重音的纯ascii? [英] How to convert unicode accented characters to pure ascii without accents?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将unicode重音字符转换为不带重音的纯ascii? [英] How to convert unicode accented characters to pure ascii without accents?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭