如何使用nltk正则表达式模式提取特定的短语块? [英] How to use nltk regex pattern to extract a specific phrase chunk?

查看：273 发布时间：2020/5/18 0:33:05 python regex nlp nltk text-chunking

本文介绍了如何使用nltk正则表达式模式提取特定的短语块?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了以下正则表达式来标记某些短语模式

I have written the following regex to tag certain phrases pattern

pattern = """
        P2: {<JJ>+ <RB>? <JJ>* <NN>+ <VB>* <JJ>*}
        P1: {<JJ>? <NN>+ <CC>? <NN>* <VB>? <RB>* <JJ>+}
        P3: {<NP1><IN><NP2>}
        P4: {<NP2><IN><NP1>}

    """

此模式将正确标记一个短语，例如:

This pattern would correctly tag a phrase such as:

a = 'The pizza was good but pasta was bad'

并使用2个短语给出所需的输出:

and give the desired output with 2 phrases:

披萨很好
意大利面不好

但是，如果我的句子是这样的话:

However, if my sentence is something like:

a = 'The pizza was awesome and brilliant'

仅匹配短语:

'pizza was awesome'

而不是期望的:

'pizza was awesome and brilliant'

我如何在第二个示例中也加入正则表达式模式?

推荐答案

首先，让我们看一下NLTK提供的POS标签:

Firstly, let's take a look at the POS tags that NLTK gives:

>>> from nltk import pos_tag
>>> sent = 'The pizza was awesome and brilliant'.split()
>>> pos_tag(sent)
[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')]
>>> sent = 'The pizza was good but pasta was bad'.split()
>>> pos_tag(sent)
[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')]

(注意:以上是NLTK v3.1 pos_tag的输出，较旧的版本可能有所不同)

(Note: The above are the outputs from NLTK v3.1 pos_tag, older version might differ)

您想要捕获的基本上是:

What you want to capture is essentially:

NN VBD JJ CC JJ
NN VBD JJ

因此，让我们用以下模式来捕捉它们:

So let's catch them with these patterns:

>>> from nltk import RegexpParser
>>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant']
>>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad']
>>> patterns = """
... P: {<NN><VBD><JJ><CC><JJ>}
... {<NN><VBD><JJ>}
... """
>>> PChunker = RegexpParser(patterns)
>>> PChunker.parse(pos_tag(sent1))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
>>> PChunker.parse(pos_tag(sent2))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])

所以这是通过硬编码进行的欺骗"！

让我们回到POS模式:

Let's go back to the POS patterns:

NN VBD JJ CC JJ
NN VBD JJ

可以简化为:

NN VBD JJ(CC JJ)

因此您可以在正则表达式中使用可选的运算符，例如:

So you can use the optional operators in the regex, e.g.:

>>> patterns = """
... P: {<NN><VBD><JJ>(<CC><JJ>)?}
... """
>>> PChunker = RegexpParser(patterns)
>>> PChunker.parse(pos_tag(sent1))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
>>> PChunker.parse(pos_tag(sent2))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])

很可能您使用的是旧标记器，这就是为什么您的模式有所不同的原因，但是我想您会发现您可以使用上面的示例捕获所需的短语.

Most probably you're using the old tagger, that's why your patterns are different but I guess you see how you could capture the phrases you need using the example above.

步骤是:

首先，使用pos_tag
然后概括模式并简化它们
然后将它们放入RegexpParser

First, check what is the POS patterns using the pos_tag
Then generalize patterns and simplify them
Then put them into the RegexpParser

这篇关于如何使用nltk正则表达式模式提取特定的短语块?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用nltk正则表达式模式提取特定的短语块? [英] How to use nltk regex pattern to extract a specific phrase chunk?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用nltk正则表达式模式提取特定的短语块? [英] How to use nltk regex pattern to extract a specific phrase chunk?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭