用于匹配化学式的严格正则表达式 [英] A strict regular expression for matching chemical formulae

查看:30
本文介绍了用于匹配化学式的严格正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用 Perl 处理大型文本化学数据库的过程中,我遇到了使用正则表达式匹配化学式的问题.我之前看过这些 两个主题,但建议的答案对于我的要求来说太松散了.

具体来说,我(公认有限)的研究使我这篇文章 给出了当前接受的化学符号的正则表达式,我将复制到这里以供参考

<前>[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb

(因此例如 CCmCn 会通过,但不会通过 CgCx.)

与前面的问题一样,我还需要匹配数字、完整的括号集和完整的方括号集,以便例如C2H6O(CH3)2CFCOO(CH2)2Si(CH3)2Cl 匹配.

那么我如何将之前的解决方案与正则表达式结合起来以匹配有效的化学元素以严格匹配化学式?

(如果添加不是太麻烦,将非常感谢如何人工解析正则表达式的详细说明,尽管不是绝对必要的.)

解决方案

Brief

我决定为什么不创建一个庞大的正则表达式来做你想做的事(但仍然保持一个干净的正则表达式).此正则表达式将与循环结合使用,以遍历括号或括号组的匹配项.

<小时>

假设

我假设如下,因为 OP 没有给出完整的正面和负面匹配列表:

  • 嵌套括号是不可能的
  • 嵌套方括号是不可能的
  • 包围单个圆括号组的方括号组是多余的,因此是不正确的
  • 方括号组必须至少包含 2 个组,其中 1 个这样的组必须是括号组

如果这些假设中的任何一个不正确,请告诉我,以便我可以相应地修复正则表达式

<小时>

答案

在此处查看正在使用的正则表达式

代码

(?(DEFINE)(?#周期元素)(?<氢>H)(?<氦>He)(?<锂>Li)(?<铍>Be)(?<硼>B)(?<碳>C)(?<氮>N)(?<氧气>O)(? <氟>F)(?<氖>氖)(?<钠>Na)(?<镁>Mg)(?<铝>Al)(?<硅>Si)(?<磷>P)(?<硫>S)(?氯Cl)(?<氩>Ar)(?<钾>K)(?<钙>Ca)(?<钪>Sc)(?<钛>Ti)(?<钒>V)(?<铬>Cr)(?<锰>Mn)(?<铁>Fe)(?<钴>Co)(?<镍>Ni)(?<铜>Cu)(?<锌>Zn)(?<镓>Ga)(?<锗>Ge)(?<砷>As)(?<硒>Se)(? <溴>Br)(?<氪>氪)(?<铷>Rb)(?<锶>Sr)(?<钇>Y)(?<锆>Zr)(?<铌>Nb)(?<钼>Mo)(?<锝>Tc)(?<钌>Ru)(?<铑>Rh)(?<钯>Pd)(?<银>Ag)(?<镉>Cd)(?<铟>In)(? 锡 Sn)(?<锑>Sb)(?<碲>Te)(?<碘>I)(?<氙>氙)(? 铯 Cs)(?<钡>Ba)(?<镧>La)(?<铈>Ce)(?<镨>Pr)(?<钕>Nd)(?<钷>Pm)(?<钐>Sm)(?<Europium>Eu)(?<钆>Gd)(? <铽>Tb)(?<镝>Dy)(?<钬>Ho)(?<铒>Er)(? <铥>Tm)(?<镱>Yb)(?<镥>Lu)(?<铪>Hf)(?<钽>Ta)(?<钨>W)(?<铼>Re)(?<锇>Os)(?<铱>Ir)(?<白金>Pt)(?<金>Au)(?<汞>Hg)(?<铊>Tl)(?<铅>铅)(?<铋>Bi)(?<钋>Po)(?<砹>At)(?<氡>Rn)(?<钫>Fr)(?<镭>Ra)(?<锕>Ac)(?<钍>Th)(?Pa)(?<铀>U)(?<镎>Np)(?<钚>Pu)(?<镅>Am)(?<锔>Cm)(?<Berkelium>Bk)(?<Californium>Cf)(?<锕>Es)(?Fm)(?<Mendelevium>Md)(?<Nobelium>否)(?<劳伦西姆>Lr)(?<Rutherfordium>Rf)(?<Dubnium>Db)(?<Seaborgium>Sg)(?<Bohrium>Bh)(?<钆>Hs)(?<Meitnerium>Mt)(?<达姆施塔特>Ds)(?<Roentgenium>Rg)(?<Copernicium>Cn)(?<Nihonium>Nh)(?<Flerovium>Fl)(?<莫斯科>麦克)(?Lv)(?<Tennessine>Ts)(?<Oganesson>Og)(?#正则表达式)(?<元素>(?&锕)|(?&银)|(?&铝)|(?&Americ)|(?&氩)|(?&砷)|(?&砹)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&&;溴)|(?&硼)|(?&钙)|(?&镉)|(?&铈)|(?&Californium)|(?&氯)|(?&锔)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&氟)|(?&镓)|(?&钆)|(?&锗)|(?&氦)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&碘)|(?&氪)|(?&钾)|(?&镧)|(?&锂)|(?&Lawrncium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&镁)|(?&锰)|(?&钼)|(?&钼)|(?&钠)|(?&铌)|(?&钕)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&&;锇)|(?&氧)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&镨)|(?&铂)|(?&钚)|(?&磷)|(?&镭)|(?&铷)|(?&铼)|(?&卢瑟福))|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&铽)|(?&锝)|(?&碲)|(?&钍)|(?&钛)|(?&铊)|(?&铥)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&锆)|(?&锌))(?(?:[1-9]\d*)?)(?(?:(?&Element)(?&Num))+)(?\((?&ElementGroup)+\)(?&Num))(?\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num)))^((?(?&ElementSquareBracketGroup))|(?<括号>(?&ElementParenthesesGroup))|(?(?&ElementGroup))))+$

说明

  1. (?(DEFINE)) 部分的第一部分列出了每个周期元素(按原子序数排序以便于查找).
  2. Element 组充当 1 中列出的每个元素之间的简单或 |.确保每个元素的符号按第一个字符的字母顺序排列,然后按符号字符长度(以免捕获,例如Carbon C 而不是Calcium Ca)
  3. ElementGroup 以下列格式指定一组化学品:一个或多个 Element 后跟零个或多个数字,不包括零(由组 Num<指定)/代码>)
    • 有效示例
      • C - Element
      • CH - Element 后跟另一个 Element
      • CH3 -Element 后跟另一个 Element 和一个 Num
      • O2 - Element 后跟一个 Num
    • 无效示例
      • N0 - 0 不能显式使用
      • N01 - Num 组指定数字必须以1-9开头或没有数字
      • A - 元素不存在
      • c - 元素不存在 - 区分大小写的正则表达式
  4. ElementParenthesesGroup 在括号( )之间指定一组或多组ElementGroup,但至少包含一个>元素组
    • 有效示例
      • (CH) - ElementGroup 用括号括起来
      • (CH3) - ElementGroup 括号括起来
      • (CH3NO4) - 多个 ElementGroup 用括号括起来
      • (CH3N04)2 - 多个 ElementGroup 用括号括起来,后跟一个 Num
    • 无效示例
      • (CH[NO4]) - 只有 ElementGroupElementParenthesesGroup
      • 内有效
  5. ElementSquareBracketGroup 指定一组 ElementParenthesesGroupElementGroup 在方括号 [ ] 之间> 但至少包含一个 ElementParenthesesGroup 和另一个组(ElementParenthesesGroupElementGroup)
    • 有效示例
      • [CH3(NO4)] - 包含至少一个 ElementParenthesesGroup 和另一个 ElementParenthesesGroupElementGroup
      • [(NO4)CH]2 - 包含至少一个 ElementParenthesesGroup 和另一个 ElementParenthesesGroupElementGroup后跟 Num
      • [(NO4)(CH3)] - 包含至少一个 ElementParenthesesGroup 和另一个 ElementParenthesesGroupElementGroup>
    • 无效示例
      • [(NO4)] - 不包含第二组,括号 [ ] 是多余的
      • [NO4] - 不包含 ElementParenthesesGroup

附加信息

我意识到这是一个很长的答案,但 OP 提出了一个非常具体的问题,并希望确保满足特定标准.

确保设置了以下标志:

  • g - 确保全局匹配
  • x - 确保忽略空格
  • 如果数据跨多行(由换行符分隔)使用 m 表示多行

注意:正则表达式只会捕获它找到的最后一组 X 类型(并覆盖先前捕获的所述类型 X 的组.这是正则表达式,目前无法覆盖此行为.这可能会给您带来不良结果.您可以在链接的正则表达式中的最后一个示例以及 (CH3)2CFCOO(CH2)2Si(CH3)2Cl 因为每个组类型有多个.

In the course of processing a large textual chemical database with Perl, I had been faced with the problem of using a regex to match chemical formulae. I have seen these two previous topics, but the suggested answers there are too loose for my requirements.

Specifically, my (admittedly limited) research has led me to this posting that gives a regex for the currently accepted chemical symbols, which I'll copy here for reference

[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb

(Thus e.g. C, Cm, and Cn will pass, but not Cg or Cx.)

As with the previous questions, I also need to match numbers, complete sets of parentheses and complete sets of square brackets, so that both e.g. C2H6O and (CH3)2CFCOO(CH2)2Si(CH3)2Cl are matched.

So how do I combine the previous solutions with the grand regex for matching valid chemical elements to strictly match a chemical formula?

(If it's not too much trouble to add, a blow-by-blow account of how to humanly parse the regex would be appreciated greatly, though not strictly necessary.)

解决方案

Brief

I decided why not create a massive regex to do what you want (but still maintain a clean regex). This regex would be used in conjunction with a loop to go over matches for bracket or parentheses groups.


Assumptions

I am assuming the following since the OP has not given a full list of positive and negative matches:

  • Nested parentheses aren't possible
  • Nested square brackets aren't possible
  • Square bracket groups that surround a single parentheses group are redundant and therefore incorrect
  • Square bracket groups must contain at least 2 groups, of which 1 such group must be a parentheses group

If any of these assumptions are incorrect, please let me know so that I may fix the regex accordingly


Answer

View this regex in use here

Code

(?(DEFINE)
  (?# Periodic elements )
  (?<Hydrogen>H)
  (?<Helium>He)
  (?<Lithium>Li)
  (?<Beryllium>Be)
  (?<Boron>B)
  (?<Carbon>C)
  (?<Nitrogen>N)
  (?<Oxygen>O)
  (?<Fluorine>F)
  (?<Neon>Ne)
  (?<Sodium>Na)
  (?<Magnesium>Mg)
  (?<Aluminum>Al)
  (?<Silicon>Si)
  (?<Phosphorus>P)
  (?<Sulfur>S)
  (?<Chlorine>Cl)
  (?<Argon>Ar)
  (?<Potassium>K)
  (?<Calcium>Ca)
  (?<Scandium>Sc)
  (?<Titanium>Ti)
  (?<Vanadium>V)
  (?<Chromium>Cr)
  (?<Manganese>Mn)
  (?<Iron>Fe)
  (?<Cobalt>Co)
  (?<Nickel>Ni)
  (?<Copper>Cu)
  (?<Zinc>Zn)
  (?<Gallium>Ga)
  (?<Germanium>Ge)
  (?<Arsenic>As)
  (?<Selenium>Se)
  (?<Bromine>Br)
  (?<Krypton>Kr)
  (?<Rubidium>Rb)
  (?<Strontium>Sr)
  (?<Yttrium>Y)
  (?<Zirconium>Zr)
  (?<Niobium>Nb)
  (?<Molybdenum>Mo)
  (?<Technetium>Tc)
  (?<Ruthenium>Ru)
  (?<Rhodium>Rh)
  (?<Palladium>Pd)
  (?<Silver>Ag)
  (?<Cadmium>Cd)
  (?<Indium>In)
  (?<Tin>Sn)
  (?<Antimony>Sb)
  (?<Tellurium>Te)
  (?<Iodine>I)
  (?<Xenon>Xe)
  (?<Cesium>Cs)
  (?<Barium>Ba)
  (?<Lanthanum>La)
  (?<Cerium>Ce)
  (?<Praseodymium>Pr)
  (?<Neodymium>Nd)
  (?<Promethium>Pm)
  (?<Samarium>Sm)
  (?<Europium>Eu)
  (?<Gadolinium>Gd)
  (?<Terbium>Tb)
  (?<Dysprosium>Dy)
  (?<Holmium>Ho)
  (?<Erbium>Er)
  (?<Thulium>Tm)
  (?<Ytterbium>Yb)
  (?<Lutetium>Lu)
  (?<Hafnium>Hf)
  (?<Tantalum>Ta)
  (?<Tungsten>W)
  (?<Rhenium>Re)
  (?<Osmium>Os)
  (?<Iridium>Ir)
  (?<Platinum>Pt)
  (?<Gold>Au)
  (?<Mercury>Hg)
  (?<Thallium>Tl)
  (?<Lead>Pb)
  (?<Bismuth>Bi)
  (?<Polonium>Po)
  (?<Astatine>At)
  (?<Radon>Rn)
  (?<Francium>Fr)
  (?<Radium>Ra)
  (?<Actinium>Ac)
  (?<Thorium>Th)
  (?<Protactinium>Pa)
  (?<Uranium>U)
  (?<Neptunium>Np)
  (?<Plutonium>Pu)
  (?<Americium>Am)
  (?<Curium>Cm)
  (?<Berkelium>Bk)
  (?<Californium>Cf)
  (?<Einsteinium>Es)
  (?<Fermium>Fm)
  (?<Mendelevium>Md)
  (?<Nobelium>No)
  (?<Lawrencium>Lr)
  (?<Rutherfordium>Rf)
  (?<Dubnium>Db)
  (?<Seaborgium>Sg)
  (?<Bohrium>Bh)
  (?<Hassium>Hs)
  (?<Meitnerium>Mt)
  (?<Darmstadtium>Ds)
  (?<Roentgenium>Rg)
  (?<Copernicium>Cn)
  (?<Nihonium>Nh)
  (?<Flerovium>Fl)
  (?<Moscovium>Mc)
  (?<Livermorium>Lv)
  (?<Tennessine>Ts)
  (?<Oganesson>Og)
  (?# Regex )
  (?<Element>(?&Actinium)|(?&Silver)|(?&Aluminum)|(?&Americium)|(?&Argon)|(?&Arsenic)|(?&Astatine)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&Bromine)|(?&Boron)|(?&Calcium)|(?&Cadmium)|(?&Cerium)|(?&Californium)|(?&Chlorine)|(?&Curium)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&Fluorine)|(?&Gallium)|(?&Gadolinium)|(?&Germanium)|(?&Helium)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&Iodine)|(?&Krypton)|(?&Potassium)|(?&Lanthanum)|(?&Lithium)|(?&Lawrencium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&Magnesium)|(?&Manganese)|(?&Molybdenum)|(?&Meitnerium)|(?&Sodium)|(?&Niobium)|(?&Neodymium)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&Osmium)|(?&Oxygen)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&Praseodymium)|(?&Platinum)|(?&Plutonium)|(?&Phosphorus)|(?&Radium)|(?&Rubidium)|(?&Rhenium)|(?&Rutherfordium)|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&Terbium)|(?&Technetium)|(?&Tellurium)|(?&Thorium)|(?&Titanium)|(?&Thallium)|(?&Thulium)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&Zirconium)|(?&Zinc))
  (?<Num>(?:[1-9]\d*)?)
  (?<ElementGroup>(?:(?&Element)(?&Num))+)
  (?<ElementParenthesesGroup>\((?&ElementGroup)+\)(?&Num))
  (?<ElementSquareBracketGroup>\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num))
)
^((?<Brackets>(?&ElementSquareBracketGroup))|(?<Parentheses>(?&ElementParenthesesGroup))|(?<Group>(?&ElementGroup)))+$

Explanation

  1. The first part of the (?(DEFINE)) section lists each periodic element (ordered by atomic number for easy lookup).
  2. The Element group acts as a simple or | between each of the elements listed in 1. ensuring that each element's symbol is ordered alphabetically by the first character, and then by symbol character length (so as not to catch, for example, Carbon C instead of Calcium Ca)
  3. ElementGroup specifies a group of chemicals in the format: one or more Element followed by zero or more digits, excluding zero (specified by the group Num)
    • Valid Examples
      • C - Element
      • CH - Element followed by another Element
      • CH3 -Element followed by another Element and a Num
      • O2 - Element followed by a Num
    • Invalid Examples
      • N0 - 0 cannot be used explicitly
      • N01 - Num group specifies the number must begin with 1-9 or not have a number
      • A - Element does not exist
      • c - Element does not exist - case sensitive regex
  4. ElementParenthesesGroup specifies one or more groupings of ElementGroup between parentheses ( ) but containing at least one ElementGroup
    • Valid Examples
      • (CH) - ElementGroup surrounded by parentheses
      • (CH3) - ElementGroup surrounded by parentheses
      • (CH3NO4) - multiple ElementGroup surrounded by parentheses
      • (CH3N04)2 - multiple ElementGroup surrounded by parentheses followed by a Num
    • Invalid Examples
      • (CH[NO4]) - Only ElementGroup is valid inside ElementParenthesesGroup
  5. ElementSquareBracketGroup specifies a grouping of ElementParenthesesGroup or ElementGroup between square brackets [ ] but containing at least one ElementParenthesesGroup and one other group (ElementParenthesesGroup or ElementGroup)
    • Valid Examples
      • [CH3(NO4)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
      • [(NO4)CH]2 - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup followed by Num
      • [(NO4)(CH3)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
    • Invalid Examples
      • [(NO4)] - Does not contain second group, brackets [ ] are redundant
      • [NO4] - Does not contain ElementParenthesesGroup

Additional Information

I realize this is a very long answer, but the OP is asking a very specific question and wants to ensure specific criteria are met.

Ensure the following flags are set:

  • g - ensures global matches
  • x - ensures whitespace is ignored
  • if the data is across multiple lines (separated by a newline character) use m for multi line

Note: Regex will only capture the last group of type X that it finds (and overwrite the previously captured group of said type X. This is the default behaviour of regex and there is no way to currently override this behaviour. This may give you undesirable results. You can see this with the last example in the linked regex as well as with your example of (CH3)2CFCOO(CH2)2Si(CH3)2Cl since there are multiple of each group type.

这篇关于用于匹配化学式的严格正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆