将nlp.pipe()与带有spaCy的预分段文本和预标记文本一起使用 [英] Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

查看:723
本文介绍了将nlp.pipe()与带有spaCy的预分段文本和预标记文本一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试标记和解析已被拆分成句子且已被标记化的文本.例如:

I am trying to tag and parse text that has already been split up in sentences and has already been tokenized. As an example:

sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]

处理批量文本的最快方法是.pipe().但是,我不清楚如何将其与预先标记和预先分段的文本一起使用.在这里,性能是关键.我尝试了以下操作,但这引发了错误

The fastest approach to process batches of text is .pipe(). However, it is not clear to me how I can use that with pre-tokenized, and pre-segmented text. Performance is key here. I tried the following, but that threw an error

docs = [nlp.tokenizer.tokens_from_list(sentence) for sentence in sents]
nlp.tagger(docs)
nlp.parser(docs)

踪迹:

Traceback (most recent call last):
  File "C:\Python\Python37\Lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Python\projects\PreDicT\predicting-wte\build_id_dictionary.py", line 204, in process_batch
    self.nlp.tagger(docs)
  File "pipes.pyx", line 377, in spacy.pipeline.pipes.Tagger.__call__
  File "pipes.pyx", line 396, in spacy.pipeline.pipes.Tagger.predict
  File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 169, in __call__
    return self.predict(x)
  File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 40, in predict
    X = layer(X)
  File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 169, in __call__
    return self.predict(x)
  File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 133, in predict
    y, _ = self.begin_update(X, drop=None)
  File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 14, in begin_update
    features = [self._get_feats(doc) for doc in docs]
  File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 14, in <listcomp>
    features = [self._get_feats(doc) for doc in docs]
  File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 21, in _get_feats
    arr = doc.doc.to_array(self.attrs)[doc.start : doc.end]
AttributeError: 'list' object has no attribute 'doc'

推荐答案

只需使用nlp.tokenizer.tokens_from_list替换管道中的默认标记生成器,而不是分别调用它即可:

Just replace the default tokenizer in the pipeline with nlp.tokenizer.tokens_from_list instead of calling it separately:

import spacy
nlp = spacy.load('en')
nlp.tokenizer = nlp.tokenizer.tokens_from_list

for doc in nlp.pipe([['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]):
    for token in doc:
        print(token, token.pos_)

输出:

I PRON
like VERB
cookies NOUN
. PUNCT
Do VERB
you PRON
? PUNCT

这篇关于将nlp.pipe()与带有spaCy的预分段文本和预标记文本一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆