python正则表达式查找带重音的单词 [英] python regex to find accented words

查看:62
本文介绍了python正则表达式查找带重音的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请我需要帮助.尝试在文本(西班牙语)中查找带重音的单词时遇到问题.我必须在大文本中搜索以Nombre vernáculo"字样开头的第一段
例如,文本是这样的:Nombre vernáculo registrado en la zona de ..."
但是我的python脚本无法识别重音词.

Please I need help. I've got a problem when trying to find accented words in a text (in Spanish). I have to search in a large text the first paragraph starting with the words 'Nombre vernáculo'
For example, the text is like: "Nombre vernáculo registrado en la zona de ..."
But accented words are not recoginzed by my python script.

我尝试过:

re.compile('/(?<!\p{L})(vern[áa]culo*)(?!\p{L})/')
re.compile(r'Nombre vern[a\xc3\xa1]culo\.', re.UNICODE)
re.compile ('[A-Z][a-záéíóúñ]+')
\p{Lu}] [\p{Ll}]+ \b

我已阅读以下主题:

grep/regex 找不到带重音的词
带有重音字符的 Python 正则表达式奇怪行为
Python 正则表达式和重音表达式
Python:使用带有重音字符的正则表达式和标记(负回顾)

我还发现了一些几乎有用的东西:

Also I found something that almost work:

In [95]: dd=re.search(r'^\w.*', 'Nombre vernáculo' )
In [96]: dd.group(0)
Out[96]: 'Nombre vern\xc3\xa1culo'

但它也会返回文本中所有带重音的单词.

But it also returns all accented words in the text.

对此的任何帮助将不胜感激.谢谢.

Any help with this will be appreciaded. Thanks.

推荐答案

最简单的方法与您在 Python 3 中的方法相同.这意味着您必须明确使用 unicode 而不是 str 对象,包括 u 前缀的字符串文字.而且,理想情况下,在文件顶部有一个显式的编码声明,这样您也可以用 Unicode 编写文字.

The simplest way to do this is the same way you'd do it in Python 3. This means you have to explicitly use unicode instead of str objects, include u-prefixed string literals. And, ideally, an explicit coding declaration at the top of your file so you can write the literals in Unicode as well.

# -*- coding: utf-8 -*-

import re

pattern = re.compile(ur'Nombre vern[aá]culo'`)
text = u'Nombre vernáculo'
match = pattern.search(text)
print match

请注意,我在模式末尾省略了 \..您的文本不以 . 结尾,因此您不应该寻找一个,否则会失败.

Notice that I left off the \. on the end of the pattern. Your text doesn't end in a ., so you shouldn't be looking for one, or it's going to fail.

当然,如果你想搜索除源代码之外的其他地方的文本,你需要decode('utf-8'),或者io.opencodecs.open 文件,而不仅仅是 open

Of course if you want to search text that comes from somewhere besides your source code, you'll need to decode('utf-8') it, or io.open or codecs.open the file instead of just open, etc.

如果您不能使用编码声明,或者不能相信您的文本编辑器能够处理 UTF-8,您仍然可以使用 Unicode 字符串,只需使用它们的 Unicode 代码点对字符进行转义:

If you can't use a coding declaration, or can't trust your text editor to handle UTF-8, you can still use Unicode strings, just escape the characters with their Unicode code points:

import re

pattern = re.compile(ur'Nombre vern[a\xe1]culo'`)
text = u'Nombre vern\xe1culo'
match = pattern.search(text)
print match

<小时>

如果您必须使用 str,那么您必须手动编码为 UTF-8 并转义单个字节,就像您尝试做的那样.但是现在您不是要匹配单个字符,而是要匹配多字符序列 \xc3\xa1.所以你不能使用字符类.取而代之的是,您已将其明确地写成一组交替:


If you have to use str, then you do have to manually encode to UTF-8 and escape the individual bytes, as you were trying to do. But now you're not trying to match a single character, but a multi-character sequence, \xc3\xa1. So you can't use a character class. Instead, you have write it out explicitly as a group with alternation:

pattern = re.compile(r'Nombre vern(?:a|\xc3\xa1)culo')
text = 'Nombre vern\xc3\xa1culo'
match = pattern.search(text)
print match

这篇关于python正则表达式查找带重音的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆