根据条件从现有令牌和元组创建新令牌和元组 [英] Create new tokens and tuples from existing ones based on conditions

查看:88
本文介绍了根据条件从现有令牌和元组创建新令牌和元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与

This is very related to a previous question but I am having difficulties adapting for my use case.

我有一句话:"Forbes Asia 200 Best Under 500 Billion 2011"

我有令牌,例如:

oldTokens = [u'Forbes', u'Asia', u'200', u'Best', u'Under', u'500', u'Billion', u'2011']

上一个解析器的索引指出了应该在哪里放置位置或数字插槽:

And the indices of where a previous parser has figured out where there should be location or number slots:

numberTokenIDs =  {(7,): 2011.0, (2,): 200.0, (5,6): 500000000000.00}
locationTokenIDs = {(0, 1): u'Forbes Asia'}

令牌ID与存在位置或数字的令牌索引相对应,目的是获得一组新的令牌,例如:

The token IDs correspond to the index of the tokens where there are locations or numbers, the objective is to obtain a new set of tokens like:

newTokens = [u'ForbesAsia', u'200', u'Best', u'Under', u'500Billion', u'2011']

可能会有新的数字和位置令牌ID(以避免索引超出范围例外):

With new number and location tokenIDs perhaps like (to avoid index out of bounds exceptions):

numberTokenIDs =  {(5,): 2011.0, (1,): 200.0, (4,): 500000000000.00}
locationTokenIDs = {(0,): u'Forbes Asia'}

从本质上讲,我想遍历减少的新标记集,并最终能够创建一个新的句子,称为:

Essentially I would like to go through the new reduced set of tokens, and be able to ultimately create a new sentence called:

"LOCATION_SLOT NUMBER_SLOT Best Under NUMBER_SLOT NUMBER_SLOT"

.如果使用当前的一组数字和位置令牌ID进行此操作,则会得到:

via going through the new set of tokens and replacing the correct tokenID with either LOCATION_SLOT or NUMBER_SLOT. If I did this with the current set of number and location token IDs, I would get:

"LOCATION_SLOT LOCATION_SLOT NUMBER_SLOT Best Under NUMBER_SLOT NUMBER_SLOT NUMBER_SLOT".

我该怎么做?

另一个例子是:

Location token IDs are:  (0, 1)
Number token IDs are:  (3, 4)

旧的sampleTokens [u'United', u'Kingdom', u'USD', u'1.240', u'billion']

Old sampleTokens [u'United', u'Kingdom', u'USD', u'1.240', u'billion']

在这里我既要删除令牌,又要更改位置和编号令牌ID,以便能够替换如下语句:

Where I want to both delete tokens and also change location and number token IDs to be able to replace the sentence like:

sampleTokens[numberTokenID] = "NUMBER_SLOT"
sampleTokens[locationTokenID] = "LOCATION_SLOT"

被替换的令牌是[u'LOCATION_SLOT', u'USD', u'NUMBER_SLOT']

请注意,如果有多个(多个元组也可以包含> 2个元素,例如The United States of America),则串联操作应将元组中的所有值连接起来.

Note, the concatenation should concatenate all the values in the tuple if there are more than one (also the tuple could also contain >2 elements e.g. The United States of America).

推荐答案

这应该有效(如果我理解正确的话):

This should work (if I understood correctly):

token_by_index = dict(enumerate(oldTokens))
groups = numberTokenIDs.keys() + locationTokenIDs.keys()
for group in groups:
    token_by_index[group[0]] = ''.join(token_by_index.pop(index)
                                       for index in group)
newTokens = [token for _, token in sorted(token_by_index.items(),
                                          key=lambda (index, _): index)]

查找新的令牌ID:

new_index_by_token = dict(map(lambda (i, t): (t, i), enumerate(newTokens))
numberTokenIDs = {(new_index_by_token[token_by_index[group[0]]],): value
                  for group, value in numberTokenIDs.items()}
locationTokenIDs = {(new_index_by_token[token_by_index[group[0]]],): value
                    for group, value in locationTokenIDs.items()}

这篇关于根据条件从现有令牌和元组创建新令牌和元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆