如何从BIO分块语句中提取块? - Python [英] How to extract chunks from BIO chunked sentences? - python

查看:69
本文介绍了如何从BIO分块语句中提取块? - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个输入句子,该句子具有

Give an input sentence, that has BIO chunk tags:

[('What','B-NP'),('is','B-VP'),('the','B-NP'),('airspeed', 'I-NP'),('of','B-PP'),('an','B-NP'),('unladen','I-NP'), ('sallow','I-NP'),('?','O')]

[('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]

我需要提取相关短语,例如如果要提取'NP',则需要提取包含B-NPI-NP的元组片段.

I would need to extract the relevant phrases out, e.g. if I want to extract 'NP', I would need to extract the fragments of tuples that contains B-NP and I-NP.

[输出]:

[('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]

(注意:提取的元组中的数字表示令牌索引.)

(Note: the numbers in the extract tuples represent the token index.)

我尝试使用以下代码将其提取:

I have tried extracting it using the following code:

def extract_chunks(tagged_sent, chunk_type):
    current_chunk = []
    current_chunk_position = []
    for idx, word_pos in enumerate(tagged_sent):
        word, pos = word_pos
        if '-'+chunk_type in pos: # Append the word to the current_chunk.
            current_chunk.append((word))
            current_chunk_position.append((idx))
        else:
            if current_chunk: # Flush the full chunk when out of an NP.
                _chunk_str = ' '.join(current_chunk) 
                _chunk_pos_str = '-'.join(map(str, current_chunk_position))
                yield _chunk_str, _chunk_pos_str 
                current_chunk = []
                current_chunk_position = []
    if current_chunk: # Flush the last chunk.
        yield ' '.join(current_chunk), '-'.join(current_chunk_position)


tagged_sent = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]
print (list(extract_chunks(tagged_sent, chunk_type='NP')))

但是当我有相同类型的相邻块时:

But when I have adjacent chunk of the same type:

tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'),  ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]

print (list(extract_chunks(tagged_sent, chunk_type='NP')))

它输出以下内容:

[('The Mitsubishi Electric Company Managing Director', '0-1-2-3-4-5'), ('ramen', '7')]

而不是期望的:

[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]

如何从上面的代码中解决此问题?

除了上面的代码是如何完成的以外,还有更好的解决方案来提取特定chunk_type的所需块吗?

Other than how it's done from the code above, is there a better solution to extract the desired chunks of a specific chunk_type?

推荐答案

def extract_chunks(tagged_sent, chunk_type):
    grp1, grp2, chunk_type = [], [], "-" + chunk_type
    for ind, (s, tp) in enumerate(tagged_sent):
        if tp.endswith(chunk_type):
            if not tp.startswith("B"):
                grp2.append(str(ind))
                grp1.append(s)
            else:
                if grp1:
                    yield " ".join(grp1), "-".join(grp2)
                grp1, grp2 = [s], [str(ind)]
    yield " ".join(grp1), "-".join(grp2)

输出:

In [2]: l = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'),
   ...:                ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]

In [3]: list(extract_chunks(l, "NP"))
Out[3]: 
[('The Mitsubishi Electric Company', '0-1-2-3'),
 ('Managing Director', '4-5'),
 ('ramen', '7')]

In [4]: l = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]

In [5]: list(extract_chunks(l, "NP"))
Out[5]: [('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]

这篇关于如何从BIO分块语句中提取块? - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆