如何在没有文档上下文的情况下对散乱文本进行标记? [英] How to detokenize spacy text without doc context?

查看:106
本文介绍了如何在没有文档上下文的情况下对散乱文本进行标记?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个序列,可以对由spacy的标记化形成的标记训练的模型进行序列化.这既是编码器又是解码器.

I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.

输出是来自seq2seq模型的令牌流.我想对文本进行标记以形成自然文本.

The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.

示例:

输入到Seq2Seq:一些文本

Seq2Seq的输出:这无效.

是否有任何API可以通过其令牌化程序中的规则反向执行令牌化?

Is there any API in spacy to reverse tokenization done by rules in its tokenizer?

推荐答案

TL; DR 我已经编写了尝试执行此操作的代码,代码段如下.

TL;DR I've written a code that attempts to do it, the snippet is below.

另一种计算复杂度为O(n ^ 2)*的方法是使用我刚刚编写的函数. 主要思想是拆分的内容将再次加入!"

Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote. The main thought was "What spaCy splits, shall be rejoined once more!"

#!/usr/bin/env python                     
import spacy     
import string




class detokenizer:                                                                            
    """ This class is an attempt to detokenize spaCy tokenized sentence """
    def __init__(self, model="en_core_web_sm"):             
        self.nlp = spacy.load(model)

    def __call__(self, tokens : list):
        """ Call this method to get list of detokenized words """                     
        while self._connect_next_token_pair(tokens):
            pass              
        return tokens                                                                         

    def get_sentence(self, tokens : list) -> str:                                                                                                                                            
        """ call this method to get detokenized sentence """            
        return " ".join(self(tokens))

    def _connect_next_token_pair(self, tokens : list):                  
        i = self._find_first_pair(tokens)
        if i == -1:                                                                                                                                                                          
            return False                                                                                                                 
        tokens[i] = tokens[i] + tokens[i+1]                                                   
        tokens.pop(i+1)                                                                                                                                                                       
        return True                                                                                                                                                                          


    def _find_first_pair(self,tokens):                                                                                                                                                       
        if len(tokens) <= 1:                                                                                                                                                                 
            return -1                                                                         
        for i in range(len(tokens)-1):
            if self._would_spaCy_join(tokens,i):                                
                return i
        return -1                                                                             

    def _would_spaCy_join(self, tokens, index):                                       
        """             
        Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...                                                                  

        In other words, we say we should join only if the join is reversible.          
        eg.:             
            for the text ["The","man","."]
            we would joins "man" with "."
            but wouldn't join "The" with "man."                                               
        """                                    
    left_part = tokens[index]
    right_part = tokens[index+1]
    length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
    length_after_join = len(self.nlp(left_part + right_part))
    if self.nlp(left_part)[-1].text in string.punctuation:
        return False
    return length_before_join == length_after_join 

用法:

import spacy                           
dt = detokenizer()                     

sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")      
spaCy_tokenized = nlp(sentence)                      

string_tokens = [a.text for a in spaCy_tokenized]           

detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)

print(sentence)    
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)

输出:

I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']

羽绒服:

通过这种方法,您可以轻松地合并"do"和"nt",以及点"之间的空格.和前面的词. 这种方法不是完美的,因为有多种可能的句子组合导致特定的空间标记化.

Downfals:

In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word. This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.

我不确定当您只用空格分隔的文本时是否有一种方法可以完全取消句子的通证,但这是我所拥有的最好的方法.

I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.

在Google上搜索了几个小时后,只有几个答案出现了,在我在chrome上的3个选项卡上打开了这个堆栈式问题;),它写的基本上都是"不要使用spaCy ,请使用revtok"..由于我无法更改其他研究人员选择的令牌化,因此我不得不开发自己的解决方案.希望它可以帮助某人;)

After having searched for hours on google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and All it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)

这篇关于如何在没有文档上下文的情况下对散乱文本进行标记?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆