使用正则表达式作为标记器? [英] Using regular expression as a tokenizer?

查看:40
本文介绍了使用正则表达式作为标记器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将我的语料库标记为句子.我尝试使用 spacy 和 nltk,但它们效果不佳,因为我的文字有点棘手.下面是我制作的一个人工样本,它涵盖了我所知道的所有边缘情况:

I am trying tokenize my corpus into sentences. I tried using spacy and nltk and they did not work well since my text is a bit tricky. Below is an artificial sample I made which covers all the edge cases I know:

It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death
 to one cannot be generalised. However, the High Court while enhancing the same from life to 
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

我希望如何标记句子:

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death to one cannot be generalised.
2) However, the High Court while enhancing the same from life to death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it is not a rarest of rare case where extreme penalty of death is called for instead sentence of imprisonment for life as ordered by the trial Court would be appropriate.
4)15. In the light of the above discussion, while
 maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

这是我使用的正则表达式:

Here is the regular expression I am using:

sent = re.split('(?<!\w\.\w.)(?<![A-Z]\.)(?<![1-9]\.)(?<![1-9]\.)(?<![v]\.)(?<![vs]\.)(?<=\.|\?) ',j)

我不是很精通正则表达式,但我手动输入条件,例如 vvs.我也忽略了在 te 期间之前是否有一个数字,例如 15.

I am not really versed with regular expressions but I am manually putting in conditions for example v and vs. I am also ignoring if before te period there is a number for example 15.

我面临的问题:

  1. 如果两个句子之间没有间隙,则无法正确拆分.
  2. 如果前面的单词大写,我也希望它忽略句号.例如 No.Mr.

推荐答案

一般来说,你不能依赖一个单一的 Great White 万无一失的正则表达式,你必须编写一个使用多个正则表达式(正则和负)的函数;还有一个缩写字典,以及一些知道的基本语言解析,例如'I'、'USA'、'FCC'、'TARP' 用英文大写.参考

按照本指南,以下函数使用几个正则表达式来解析你的句子修改 D Greenberg 答案

Following this guideline the following function uses several regexes to parse your sentence Modification of D Greenberg answer

代码

import re

def split_into_sentences(text):
    # Regex pattern
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    # website regex from https://www.geeksforgeeks.org/python-check-url-string/
    websites = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»""‘’]))"
    digits = "([0-9])"
    section = "(Section \d+)([.])(?= \w)"
    item_number = "(^|\s\w{2})([.])(?=[-+ ]?\d+)"
    abbreviations = "(^|[\s\(\[]\w{1,2}s?)([.])(?=[\s\)\]]|$)"
    parenthesized = "\((.*?)\)"
    bracketed = "\[(.*?)\]"
    curly_bracketed = "\{(.*?)\}"
    enclosed = '|'.join([parenthesized, bracketed, curly_bracketed])
    # text replacement
    # replace unwanted stop period with <prd>
    # actual stop periods with <stop>
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites, lambda m: m.group().replace('.', '<prd>'), text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    if "..." in text: text = text.replace("...","<prd><prd><prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    text = re.sub(section,"\\1<prd>",text)
    text = re.sub(item_number,"\\1<prd>",text)
    text = re.sub(abbreviations, "\\1<prd>",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(enclosed, lambda m: m.group().replace('.', '<prd>'), text)
    if """ in text: text = text.replace("."","".")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")

    # Tokenize sentence based upon <stop>
    sentences = text.split("<stop>")
    if sentences[-1].isspace():
        # remove last since only whitespace
        sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]

    return sentences

使用

for index, token in enumerate(split_into_sentences(s), start = 1):
    print(f'{index}) {token}')

测试

1.输入

s='''It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death
 to one cannot be generalised. However, the High Court while enhancing the same from life to 
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.
'''

输出

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death  to one cannot be generalised.
2) However, the High Court while enhancing the same from life to  death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it is not a  rarest of rare case where extreme penalty of death is called for instead sentence of  imprisonment for life as ordered by the trial Court would be appropriate.
4) 15) In the light of the  above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,  award of extreme penalty of death by the High Court is set aside and we restore the sentence of  life imprisonment as directed by the trial Court.

2.输入

s = '''Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.He's arriving on flight No. 48213 out of Denver.He'll take the No. 2 bus from the airport.However, he may grab a taxi instead.'''

输出

1) Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.
2) He's arriving on flight No. 48213 out of Denver.
3) He'll take the No. 2 bus from the airport.
4) However, he may grab a taxi instead.

3.输入

s = '''The respondent, in his statement Ex.-73, which is accepted and found to be truthful. The passcode is either No.5, No. 5, No.-5, No.+5.'''

输出

1) The respondent, in his statement Ex.-73, which is accepted and found to be truthful.
2) The passcode is either No.5, No. 5, No.-5, No.+5.

4.输入

s = '''He went to New York. He is 10 years old.'''

输出

1) He went to New York.
2) He is 10 years old.

5.输入

s = '''15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court. The appeal is allowed in part to the extent mentioned above.'''

输出

1) 15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court.
2) The appeal is allowed in part to the extent mentioned above.

这篇关于使用正则表达式作为标记器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆