如何获取正则表达式来将非ASCII字符识别为字母? [英] How do I get a regular expression to recognize non-ASCII characters as letters?

查看:149
本文介绍了如何获取正则表达式来将非ASCII字符识别为字母?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要从瑞典语的网页中提取信息。此页面使用以下字符:öäå。

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.

我的问题是,当我打印信息öäå已经走了。

My problem is that when I print the information the öäå are gone.

使用美丽的汤。我认为问题是,我对我提取的字符串做一堆正则表达式,例如。 location = re.sub(r'([^ \w])+','',location)删除除字母以外的所有内容。在这之前我猜想美丽的Soup编码字符串,使öäå成为类似/ x02 /,十六进制值。

I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.

所以如果我是正确的,那么正则表达式正在删除öäå,对,我的意思是应该留下的十六进制字符的唯一的东西是正则表达式后的x,但没有x,而不是öäå在我的页面,所以这个小理论可能不正确吗?无论如何,如果它是对或错,你怎么解决这个?当我以后打印提取的信息到我的网页我使用self.response.out.write()在google应用程序引擎(不知道是否有帮助解决问题)

So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)

编辑:瑞典语网站上的编码是utf-8,我的网站上的编码也是utf-8。
EDIT2:你可以使用ISO-8859-10瑞典语,但根据google chrome的编码是Unicode(utf-8)在这个特定的网站

The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8. You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site

推荐答案

始终工作在unicode ,只在必要时转换为编码表示。

Always work in unicode and only convert to an encoded representation when necessary.

,您还需要使用 re.U 标志,因此 \w 匹配unicode字母:

For this particular situation, you also need to use the re.U flag so \w matches unicode letters:

#coding: utf-8

import re

location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)

print location # prints öäå

这篇关于如何获取正则表达式来将非ASCII字符识别为字母?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆