使用“但是"这个词和 RegEx 分块句子 [英] Chunking sentences using the word 'but' with RegEx
问题描述
我试图在单词但是"(或任何其他并列连词)上使用正则表达式对句子进行分块.它不起作用...
sentence = nltk.pos_tag(word_tokenize("没有大集合,但存在椎管狭窄."))结果 = nltk.RegexpParser(grammar).parse(sentence)DigDug = nltk.RegexpParser(r'CHUNK: {.*.*}')对于 DigDug.parse(sentence).subtrees() 中的子树:如果 subtree.label() == 'CHUNK': 打印(subtree.node())
我需要将句子没有大集合,但存在椎管狭窄."
一分为二:
<代码>1.目前没有大型收藏品"2.有椎管狭窄".
我还希望使用相同的代码在and"和其他并列连词 (CC) 词处拆分句子.但是我的代码不起作用.请帮忙.
我认为你可以简单地做
导入重新结果 = re.split(r"\s+(?:but|and)\s+", 句子)
哪里
<块引用>`\s` 匹配作为空白字符"的单个字符(空格、制表符、换行符等)`+` 一次和无限次之间,尽可能多次,按需回馈(贪婪)`(?:` 匹配下面的正则表达式,不捕获匹配以下任一正则表达式(仅当此选项失败时才尝试下一个选项)`but` 逐字匹配字符but"`|` 或者匹配下面第2个正则表达式(如果这个匹配失败,则整个组失败)`and` 逐字匹配字符and")`\s` 匹配作为空白字符"的单个字符(空格、制表符、换行符等)`+` 一次和无限次之间,尽可能多次,按需回馈(贪婪)
您可以在其中添加更多连接词,以管道字符 |
分隔.但请注意,这些单词不包含在正则表达式中具有特殊含义的字符.如果有疑问,请先使用 re.escape(word)
I am attempting to chunk sentences using RegEx at the word 'but' (or any other coordinating conjunction words). It's not working...
sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees():
if subtree.label() == 'CHUNK': print(subtree.node())
I need to split the sentence "There are no large collections present but there is spinal canal stenosis."
into two:
1. "There are no large collections present"
2. "there is spinal canal stenosis."
I also wish to use the same code to split sentences at 'and' and other coordinating conjunction (CC) words. But my code isn't working. Please help.
I think you can simply do
import re
result = re.split(r"\s+(?:but|and)\s+", sentence)
where
`\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) `+` Between one and unlimited times, as many times as possible, giving back as needed (greedy) `(?:` Match the regular expression below, do not capture Match either the regular expression below (attempting the next alternative only if this one fails) `but` Match the characters "but" literally `|` Or match regular expression number 2 below (the entire group fails if this one fails to match) `and` Match the characters "and" literally ) `\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) `+` Between one and unlimited times, as many times as possible, giving back as needed (greedy)
You can add more conjunction words in there separated by a pipe-character |
.
Take care though that these words do not contain characters that have special meaning in regex. If in doubt, escape them first with re.escape(word)
这篇关于使用“但是"这个词和 RegEx 分块句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!