从nltk word_tokenize获取原始文本的索引 [英] get indices of original text from nltk word_tokenize

查看:272
本文介绍了从nltk word_tokenize获取原始文本的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用nltk.word_tokenize对文本进行标记,并且我还希望将原始原始文本中的索引获取到每个标记的第一个字符,即

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.

import nltk
x = 'hello world'
tokens = nltk.word_tokenize(x)
>>> ['hello', 'world']

我还如何获得与令牌的原始索引相对应的数组[0, 7]?

How can I also get the array [0, 7] corresponding to the raw indices of the tokens?

推荐答案

我认为您正在寻找的是span_tokenize()方法. 显然默认标记器不支持. 这是另一个标记器的代码示例.

I think you are looking for is the span_tokenize() method. Apparently this is not supported by the default tokenizer. Here is a code example with another tokenizer.

from nltk.tokenize import WhitespaceTokenizer
s = "Good muffins cost $3.88\nin New York."
span_generator = WhitespaceTokenizer().span_tokenize(s)
spans = [span for span in span_generator]
print(spans)

哪个给:

[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36)]

仅获取偏移量:

offsets = [span[0] for span in spans]
[0, 5, 13, 18, 24, 27, 31]

有关更多信息(有关可用的不同标记器的信息),请参阅标记化api 文档

For further information (on the different tokenizers available) see the tokenize api docs

这篇关于从nltk word_tokenize获取原始文本的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆