使用“但是"这个词和 RegEx 分块句子 [英] Chunking sentences using the word 'but' with RegEx

查看:54
本文介绍了使用“但是"这个词和 RegEx 分块句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在单词但是"(或任何其他并列连词)上使用正则表达式对句子进行分块.它不起作用...

sentence = nltk.pos_tag(word_tokenize("没有大集合,但存在椎管狭窄."))结果 = nltk.RegexpParser(grammar).parse(sentence)DigDug = nltk.RegexpParser(r'CHUNK: {.*.*}')对于 DigDug.parse(sentence).subtrees() 中的子树:如果 subtree.label() == 'CHUNK': 打印(subtree.node())

我需要将句子没有大集合,但存在椎管狭窄." 一分为二:

<代码>1.目前没有大型收藏品"2.有椎管狭窄".

我还希望使用相同的代码在and"和其他并列连词 (CC) 词处拆分句子.但是我的代码不起作用.请帮忙.

解决方案

我认为你可以简单地做

导入重新结果 = re.split(r"\s+(?:but|and)\s+", 句子)

哪里

<块引用>

`\s` 匹配作为空白字符"的单个字符(空格、制表符、换行符等)`+` 一次和无限次之间,尽可能多次,按需回馈(贪婪)`(?:` 匹配下面的正则表达式,不捕获匹配以下任一正则表达式(仅当此选项失败时才尝试下一个选项)`but` 逐字匹配字符but"`|` 或者匹配下面第2个正则表达式(如果这个匹配失败,则整个组失败)`and` 逐字匹配字符and")`\s` 匹配作为空白字符"的单个字符(空格、制表符、换行符等)`+` 一次和无限次之间,尽可能多次,按需回馈(贪婪)

您可以在其中添加更多连接词,以管道字符 | 分隔.但请注意,这些单词不包含在正则表达式中具有特殊含义的字符.如果有疑问,请先使用 re.escape(word)

转义它们

I am attempting to chunk sentences using RegEx at the word 'but' (or any other coordinating conjunction words). It's not working...

sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees(): 
    if subtree.label() == 'CHUNK': print(subtree.node())

I need to split the sentence "There are no large collections present but there is spinal canal stenosis." into two:

1. "There are no large collections present"
2. "there is spinal canal stenosis."

I also wish to use the same code to split sentences at 'and' and other coordinating conjunction (CC) words. But my code isn't working. Please help.

解决方案

I think you can simply do

import re
result = re.split(r"\s+(?:but|and)\s+", sentence)

where

`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)
`(?:`       Match the regular expression below, do not capture
            Match either the regular expression below (attempting the next alternative only if this one fails)
  `but`     Match the characters "but" literally
  `|`       Or match regular expression number 2 below (the entire group fails if this one fails to match)
  `and`     Match the characters "and" literally
)
`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)

You can add more conjunction words in there separated by a pipe-character |. Take care though that these words do not contain characters that have special meaning in regex. If in doubt, escape them first with re.escape(word)

这篇关于使用“但是"这个词和 RegEx 分块句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆