使用正则表达式查找名称 [英] Find names with Regular Expression

查看:66
本文介绍了使用正则表达式查找名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了在大文本中查找名称,我有以下正则表达式

([A-Z][a-z]*)[\s-]([A-Z][a-z]*)

这适用于像Jack Oneill"或John Guidetti"这样的法线名称.但是有一些可能性我想找到,但找不到.喜欢:

钱德勒·穆里亚·宾灰袍甘道夫彼得·范登沃德

由于我对正则表达式的了解有限,我似乎无法理解这一点.任何人都可以帮助我(并请为此提供一个好的网站/书籍):)

解决方案

解决正则表达式问题的最佳方法是描述您正在寻找的匹配项(通常称为语法).

例如,根据您的问题,我可能会这样描述:

  1. 大写单词被定义为一个大写字母和 1 个以上的字母/破折号或一个大写字母和一个 .(首字母).
  2. 未大写的单词定义为 1 个字母和 1 个以上的字母/破折号(不完美,因为这可能允许以破折号结尾).
  3. 第一个单词以大写字母开头
  4. 最后一个单词以大写字母结尾
  5. 第一个单词和最后一个单词之间有 0+ 个大写单词
  6. 然后在第一个大写单词和最后一个单词之间有 0-2 个非大写单词
  7. 至少两个字.
  8. 单词被空格打断

如果这提供了与所需结果集相当接近的匹配(并且要清楚,对于名称,有很多变体,您将有误报或漏报),然后您开始构建表达式:

  1. 大写单词:[A-Z]([a-z]+|\.)
  2. 未大写的单词:[a-z][a-z\-]+

结果:

 [AZ]([az]+|\.)(?:\s+[AZ]([az]+|\.))*(?:\s+[az][az\-]+){0,2}\s+[AZ]([az]+|\.)

匹配(粗体):

<块引用>

大家好,我叫钱德勒·穆里尔·宾.我有一个朋友,名叫Pieter van den Woude,他还有另一个朋友,A.A. 米尔恩.灰色甘道夫加入我们.我们一起组成了Friends Cast and Crew.

问题:

  • 因为您想匹配 Gandalf the GreyPieter van den Woude,您将不可避免地匹配其他由姓名和中间未大写单词组成的集合(Friends Cast 和 Crew).上述语法试图通过将问题限制为 2 个非大写单词来限制问题.您还可以创建一组允许的非大写字词(van"、der"、the"),并且只匹配这些字词.
  • 不允许使用非拉丁字母、连字、变音符号等.
  • 正如我和其他人所指出的,正则表达式对于这种情况永远不会是完美的,但正如您所说,您需要一些东西来帮助您完成大部分工作.在这种情况下,上面的表达式应该做得很好,但将其视为钝器!你已被警告.

For finding names in a big text I have the following regex

([A-Z][a-z]*)[\s-]([A-Z][a-z]*)

This works fine for normals names like "Jack Oneill" or "John Guidetti". But there are a few possebilities that I want to find, but cannot find. Like:

Chandler Murial Bing
Gandalf the Gray
Pieter van den Woude

I cannot seem to get this wright with my limited knowledge of Regular Expressions. Can anyone help me (and please provide a good website / book for this) :)

解决方案

The best way to approach a regular expression problem is to describe the matches you are looking for (usually called grammar).

For example, from your question, I might describe it like the following:

  1. A capitalized word is defined as one capital letter and 1+ letters/dashes or one capital letter and a . (an initial).
  2. An uncapitalized word is defined as 1 letter and 1+ letters/dashes (not perfect, because that could allow ending in a dash).
  3. First word starts with a capital letter
  4. Last word ends with a capital letter
  5. 0+ capitalized words between first and last word
  6. Then 0-2 uncapitalized words between first capitalized words and last word
  7. At least two words.
  8. Words are broken by whitespace

If this provides a reasonably close match to the desired result set (and to be clear, for names, there are so many variations that you will either have false positives or false negatives), then you begin constructing the expression:

  1. Capitalized word: [A-Z]([a-z]+|\.)
  2. Uncapitalized word: [a-z][a-z\-]+

Result:

 [A-Z]([a-z]+|\.)(?:\s+[A-Z]([a-z]+|\.))*(?:\s+[a-z][a-z\-]+){0,2}\s+[A-Z]([a-z]+|\.)

Matches (in bold):

Hello my name is Chandler Muriel Bing. I have a friend who is named Pieter van den Woude and he has another friend, A. A. Milne. Gandalf the Gray joins us. Together, we make up the Friends Cast and Crew.

Problems:

  • Because you want to match Gandalf the Gray and Pieter van den Woude you will inevitably match other sets that consist of names with uncapitalized words in between (Friends Cast and Crew). The above grammar attempts to limit the problem by limiting it to 2 uncapitalized words. You could also create a set of allowed uncapitalized words instead ("van", "der", "the"), and only match those words.
  • Doesn't allow for non-Latin-alphabet letters, ligatures, diacritics, etc.
  • As I and others have pointed out, regular expressions will never be perfect for this situation, but as you said, you want something to get you most of the way there. In this case, the above expression should do a pretty good job, but consider it a blunt instrument! You've been warned.

这篇关于使用正则表达式查找名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆