Python + Regex + UTF-8无法识别口音 [英] Python + Regex + UTF-8 doesn't recognize accents

查看：103 发布时间：2020/7/13 4:26:15 python regex utf-8

本文介绍了Python + Regex + UTF-8无法识别口音的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题是，即使我使用utf-8，使用regex和re.search()的Python也无法识别口音.这是我的代码字符串；

My problem is that Python, using regex and re.search() doesn't recognize accents even though I use utf-8. Here is my string of code;

#! /usr/bin/python
-*- coding: utf-8 -*-
import re

htmlString = '</dd><dt> Fine, thank you.&#160;</dt><dd> Molt bé, gràcies.'

SearchStr = '(\<\/dd\>\<dt\>)+ ([\w+\,\.\s]+)([\&\#\d\;]+)(\<\/dt\>\<dd\>)+ (\w+) (\w+)'

Result = re.search(SearchStr, htmlString)

if Result:
print Result.groups()

passavol23:jO$ catalanword.py
('</dd><dt>', 'Fine, thank you.', '&#160;', '</dt><dd>', 'Molt', 'b')

因此，问题在于它无法识别é并因此停止.任何帮助，将不胜感激.我是Python初学者.

So the problem is that it doesn't recognizes the é and thus stops. Any help would be appreciated. Im a Python beginner.

推荐答案

默认情况下，\w仅匹配ascii字符，它转换为[a-zA-Z0-9_].使用正则表达式匹配UTF-8字节已经足够困难，更不用说仅匹配 word字符了，您必须匹配字节范围.

By default, \w only matches ascii characters, it translates to [a-zA-Z0-9_]. And matching UTF-8 bytes using regular expressions is hard enough, let alone only matching word characters, you'd have to match byte ranges instead.

您需要从UTF-8解码为unicode，然后使用标志代替:

You'll need to decode from UTF-8 to unicode and use the re.UNICODE flag instead:

>>> re.search(SearchStr, htmlString.decode('utf8'), re.UNICODE).groups()
(u'</dd><dt>', u'Fine, thank you.', u'&#160;', u'</dt><dd>', u'Molt', u'b\xe9')

但是，您实际上应该使用HTML解析器来处理HTML.例如，使用BeautifulSoup.它将为您正确处理编码和Unicode.

However, you should really be using a HTML parser to deal with HTML instead. Use BeautifulSoup, for example. It'll handle encoding and Unicode correctly for you.

这篇关于Python + Regex + UTF-8无法识别口音的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python + Regex + UTF-8无法识别口音 [英] Python + Regex + UTF-8 doesn't recognize accents

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python + Regex + UTF-8无法识别口音 [英] Python + Regex + UTF-8 doesn&#39;t recognize accents

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Python + Regex + UTF-8无法识别口音 [英] Python + Regex + UTF-8 doesn't recognize accents

登录关闭