如果某些单词不存在,则将其分配为0 [英] Assign 0 to certain words when the words are not present

查看:85
本文介绍了如果某些单词不存在,则将其分配为0的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我关于stackoverflow的第一篇文章,我现在对编码更新鲜了.所以,请忍受我.

This is my first ever post on stackoverflow and am I am total fresher to coding. So, please bear with me.

我正在做一个有两组数据文件的实验. Doc1如下:

I am working on an experiment which has two sets of data documents. Doc1 is as follows:

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464

TOPIC:topic_2 ....
.....
.....

TOPIC:topic_3 1066.0
say 0.062
word 0.182

以此类推,直到100个主题.

and so on till 100 topics.

在本文档中,某些单词要么出现在所有主题中,要么仅出现在少数几个主题中.因此,我想执行一个过程,如果在一个主题中不存在任何单词,则希望该单词在该主题中的值为0.也就是说,在主题2中存在单词BBC,但在主题2中不存在该单词.主题1,所以我希望将列表显示为:

In this document, there are words that are either present in all the topics or just present in few topics. So, I would like to perform a process where if a word is not present in one topic, I would like to have the word's value in that topic as 0. That is the word BBC is present in topic 2, but is not there in topic 1, so I would like to have my list as :

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
Mr 0
s 0
president 0
tell 0
BBC 0

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398
site 0
Internet 0
online 0
web 0
say 0
image 0

我必须将这些值与另一个文档中存在的另一组值相乘.为此,

I have to multiply these values with another set of values present in another document. For that,

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
values = map(float, f2.read().split()) 
for line in f:
    if line.strip() and not line.startswith("TOPIC"):
        name, val = line.split()
        d[name].append(float(val))

for k,v in d.items():
     print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

我的doc2的格式为:

My doc2 is of the format:

  0.566667 0.0333333 0.133333 0 0 0  2.43333 0 0.13333......... till 100 values. 

上面的代码考虑单词"say".它检查单词是否在3个主题中,并在[0.015,0.45,0.062]之类的列表中收集其值.此列表与doc2中的值相乘,以使值0.015乘以doc2中的第0个值,0.45 * doc2中的第1个值和doc2中0.062 * 2nd的值.但这不是我想要的.我们可以看到在topic_2中没有单词"SAY".这里的列表必须包含[0.015,0.45,0,0.062].因此,当这些值与doc2中的相应位置值相乘时,它们会给出

The above code considers the word "say". It checks that the word is in 3 topics, and gathers their values in a list like [0.015, 0.45, 0.062]. This list is multiplied from values in doc2 in such a way that the value 0.015 is multiplied to the 0th value in doc2, 0.45 * 1st value in doc2 and 0.062*2nd value in doc2. But this is not what I want. We can see that there is no word "SAY" in topic_2. Here the list must contain [0.015, 0.45, 0, 0.062]. So, when these values are multiplied with their respective position values from doc2, they would give

P(SAY) = (0.566667*0.015) + (0.0333333*0.045) + (0.133333 *0) + (0*0.062)

因此,代码非常好,但是仅需要进行此修改.

So, the code is perfectly fine but just this modification is required.

推荐答案

作为重写块的另一种简洁方法,您可以将所有名称存储在一个集中,然后创建相对的块OrderedDict,然后获取丢失的名称对每个块使用set.difference和主词(集合words),然后将其写在块的末尾:

As an alternative concise way for rewriting the blocks you can store all the names in a set then crate the relative OrderedDict of your blocks then get the missed names using set.difference with main words (the set words) for each block, then write them at the end of block :

from itertools import tee
from collections import OrderedDict

d=OrderedDict()
with open('input.txt') as f,open('new','w') as new:
    f2,f3,f=tee(f,3)
    next(f3)
    words={line.split()[0] for line in f if not line.startswith('TOPIC') and line.strip()}

    for line in f2:
        if line.startswith('TOPIC'):
           key=line
           next_line=next(f3)
           try:
               while not next_line.startswith('TOPIC'):
                  d.setdefault(key,[]).append(next_line)
                  next_line=next(f3)
           except:
                pass

    for k,v in d.items():
        block_words={line.split()[0] for line in v if line.strip()}
        insec=words.difference(block_words)
        new.writelines([k]+v+['{} {}\n'.format(i,0) for i in insec])

结果:

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
president 0
s 0
BBC 0
tell 0
Mr 0
TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398web 0
image 0
online 0
site 0
Internet 0

这篇关于如果某些单词不存在,则将其分配为0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆