当单词不存在时,将 0 分配给某些单词 [英] Assign 0 to certain words when the words are not present

查看:19
本文介绍了当单词不存在时,将 0 分配给某些单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次在 stackoverflow 上发帖,我对编码完全陌生.所以,请耐心等待.

This is my first ever post on stackoverflow and am I am total fresher to coding. So, please bear with me.

我正在做一个有两组数据文档的实验.Doc1如下:

I am working on an experiment which has two sets of data documents. Doc1 is as follows:

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464

TOPIC:topic_2 ....
.....
.....

TOPIC:topic_3 1066.0
say 0.062
word 0.182

以此类推,直到 100 个主题.

and so on till 100 topics.

在本文档中,有些词要么出现在所有主题中,要么只出现在少数主题中.所以,我想执行一个过程,如果某个词不存在于一个主题中,我希望该词在该主题中的值为 0.即 BBC 出现在主题 2 中,但不存在于主题 1,所以我希望我的列表如下:

In this document, there are words that are either present in all the topics or just present in few topics. So, I would like to perform a process where if a word is not present in one topic, I would like to have the word's value in that topic as 0. That is the word BBC is present in topic 2, but is not there in topic 1, so I would like to have my list as :

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
Mr 0
s 0
president 0
tell 0
BBC 0

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398
site 0
Internet 0
online 0
web 0
say 0
image 0

我必须将这些值与另一个文档中存在的另一组值相乘.为此,

I have to multiply these values with another set of values present in another document. For that,

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
values = map(float, f2.read().split()) 
for line in f:
    if line.strip() and not line.startswith("TOPIC"):
        name, val = line.split()
        d[name].append(float(val))

for k,v in d.items():
     print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

我的 doc2 格式如下:

My doc2 is of the format:

  0.566667 0.0333333 0.133333 0 0 0  2.43333 0 0.13333......... till 100 values. 

上面的代码考虑了单词say".它检查单词是否在 3 个主题中,并将它们的值收集在一个列表中,如 [0.015, 0.45, 0.062].这个列表是从 doc2 中的值乘以这样的方式,即值 0.015 乘以 doc2 中的第 0 个值,doc2 中的 0.45 * 1st 值和 doc2 中的 0.062 * 2nd 值.但这不是我想要的.我们可以看到topic_2中没有SAY"这个词.这里的列表必须包含 [0.015, 0.45, 0, 0.062].因此,当这些值与它们各自来自 doc2 的位置值相乘时,它们会给出

The above code considers the word "say". It checks that the word is in 3 topics, and gathers their values in a list like [0.015, 0.45, 0.062]. This list is multiplied from values in doc2 in such a way that the value 0.015 is multiplied to the 0th value in doc2, 0.45 * 1st value in doc2 and 0.062*2nd value in doc2. But this is not what I want. We can see that there is no word "SAY" in topic_2. Here the list must contain [0.015, 0.45, 0, 0.062]. So, when these values are multiplied with their respective position values from doc2, they would give

P(SAY) = (0.566667*0.015) + (0.0333333*0.045) + (0.133333 *0) + (0*0.062)

所以,代码完全没问题,只是需要修改.

So, the code is perfectly fine but just this modification is required.

推荐答案

作为重写块的另一种简洁方式,您可以将所有名称存储在一个集合中,然后创建块的相关 OrderedDict然后使用 set.difference 和每个块的主要词(集合 words)获取遗漏的名称,然后将它们写在块的末尾:

As an alternative concise way for rewriting the blocks you can store all the names in a set then crate the relative OrderedDict of your blocks then get the missed names using set.difference with main words (the set words) for each block, then write them at the end of block :

from itertools import tee
from collections import OrderedDict

d=OrderedDict()
with open('input.txt') as f,open('new','w') as new:
    f2,f3,f=tee(f,3)
    next(f3)
    words={line.split()[0] for line in f if not line.startswith('TOPIC') and line.strip()}

    for line in f2:
        if line.startswith('TOPIC'):
           key=line
           next_line=next(f3)
           try:
               while not next_line.startswith('TOPIC'):
                  d.setdefault(key,[]).append(next_line)
                  next_line=next(f3)
           except:
                pass

    for k,v in d.items():
        block_words={line.split()[0] for line in v if line.strip()}
        insec=words.difference(block_words)
        new.writelines([k]+v+['{} {}
'.format(i,0) for i in insec])

结果:

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
president 0
s 0
BBC 0
tell 0
Mr 0
TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398web 0
image 0
online 0
site 0
Internet 0

这篇关于当单词不存在时,将 0 分配给某些单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆