在 Python 中搜索连字符的所有 Unicode 变体 [英] Searching for all Unicode variation of hyphens in Python

查看:44
本文介绍了在 Python 中搜索连字符的所有 Unicode 变体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试从转换为文本文件的 PDF 中提取某些文本.PDF 来自各种来源,我不知道它们是如何生成的.

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.

我试图提取的模式是一个简单的两位数字,后跟一个连字符,然后是另外两位数字,例如12-34.所以我写了一个简单的正则表达式 \d\d-\d\d 并希望它能工作.

The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.

然而,当我测试它时,我发现它错过了一些点击.后来我注意到至少有两个连字符表示为 \u2212\xad.所以我将我的正则表达式更改为 \d\d[-\u2212\xad]\d\d 并且它起作用了.

However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.

我的问题是,由于我要提取这么多 PDF 以至于我不知道还有哪些其他连字符变体,是否有任何正则表达式涵盖所有连字符",希望看起来比 [-\u2212\xad] 表达式?

My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?

推荐答案

您在问题标题中要求的解决方案暗示了一种白名单方法,这意味着您需要找到您认为是的字符类似于连字符.

The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.

您可以参考标点符号、破折号类别,Unicode 类别列出了所有可能的 Unicode 连字符.

You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.

您可以使用 PyPi 正则表达式模块并使用 \p{Pd} 匹配任何 Unicode 连字符的模式.

You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.

或者,如果您只能使用 re,请使用

Or, if you can only work with re, use

[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]

您可以使用其他在 Unicode 名称中包含 minus 的 Unicode 字符扩展此列表,请参阅 这个列表.

You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.

黑名单 方法意味着您不希望在两对数字之间匹配特定字符.如果你想匹配任何非空格,你可以使用 \S.如果要匹配任何标点或符号,请使用 (?:[^\w\s]|_).

A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).

这篇关于在 Python 中搜索连字符的所有 Unicode 变体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆