当单词不存在时,将 0 分配给某些单词 [英] Assign 0 to certain words when the words are not present
问题描述
这是我第一次在 stackoverflow 上发帖,我对编码完全陌生.所以,请耐心等待.
This is my first ever post on stackoverflow and am I am total fresher to coding. So, please bear with me.
我正在做一个有两组数据文档的实验.Doc1如下:
I am working on an experiment which has two sets of data documents. Doc1 is as follows:
TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
TOPIC:topic_2 ....
.....
.....
TOPIC:topic_3 1066.0
say 0.062
word 0.182
以此类推,直到 100 个主题.
and so on till 100 topics.
在本文档中,有些词要么出现在所有主题中,要么只出现在少数主题中.所以,我想执行一个过程,如果某个词不存在于一个主题中,我希望该词在该主题中的值为 0.即 BBC 出现在主题 2 中,但不存在于主题 1,所以我希望我的列表如下:
In this document, there are words that are either present in all the topics or just present in few topics. So, I would like to perform a process where if a word is not present in one topic, I would like to have the word's value in that topic as 0. That is the word BBC is present in topic 2, but is not there in topic 1, so I would like to have my list as :
TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
Mr 0
s 0
president 0
tell 0
BBC 0
TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398
site 0
Internet 0
online 0
web 0
say 0
image 0
我必须将这些值与另一个文档中存在的另一组值相乘.为此,
I have to multiply these values with another set of values present in another document. For that,
from collections import defaultdict
from itertools import groupby, imap
d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
values = map(float, f2.read().split())
for line in f:
if line.strip() and not line.startswith("TOPIC"):
name, val = line.split()
d[name].append(float(val))
for k,v in d.items():
print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))
我的 doc2 格式如下:
My doc2 is of the format:
0.566667 0.0333333 0.133333 0 0 0 2.43333 0 0.13333......... till 100 values.
上面的代码考虑了单词say".它检查单词是否在 3 个主题中,并将它们的值收集在一个列表中,如 [0.015, 0.45, 0.062].这个列表是从 doc2 中的值乘以这样的方式,即值 0.015 乘以 doc2 中的第 0 个值,doc2 中的 0.45 * 1st 值和 doc2 中的 0.062 * 2nd 值.但这不是我想要的.我们可以看到topic_2中没有SAY"这个词.这里的列表必须包含 [0.015, 0.45, 0, 0.062].因此,当这些值与它们各自来自 doc2 的位置值相乘时,它们会给出
The above code considers the word "say". It checks that the word is in 3 topics, and gathers their values in a list like [0.015, 0.45, 0.062]. This list is multiplied from values in doc2 in such a way that the value 0.015 is multiplied to the 0th value in doc2, 0.45 * 1st value in doc2 and 0.062*2nd value in doc2. But this is not what I want. We can see that there is no word "SAY" in topic_2. Here the list must contain [0.015, 0.45, 0, 0.062]. So, when these values are multiplied with their respective position values from doc2, they would give
P(SAY) = (0.566667*0.015) + (0.0333333*0.045) + (0.133333 *0) + (0*0.062)
所以,代码完全没问题,只是需要修改.
So, the code is perfectly fine but just this modification is required.
推荐答案
作为重写块的另一种简洁方式,您可以将所有名称存储在一个集合中,然后创建块的相关 OrderedDict
然后使用 set.difference
和每个块的主要词(集合 words
)获取遗漏的名称,然后将它们写在块的末尾:
As an alternative concise way for rewriting the blocks you can store all the names in a set then crate the relative OrderedDict
of your blocks then get the missed names using set.difference
with main words (the set words
) for each block, then write them at the end of block :
from itertools import tee
from collections import OrderedDict
d=OrderedDict()
with open('input.txt') as f,open('new','w') as new:
f2,f3,f=tee(f,3)
next(f3)
words={line.split()[0] for line in f if not line.startswith('TOPIC') and line.strip()}
for line in f2:
if line.startswith('TOPIC'):
key=line
next_line=next(f3)
try:
while not next_line.startswith('TOPIC'):
d.setdefault(key,[]).append(next_line)
next_line=next(f3)
except:
pass
for k,v in d.items():
block_words={line.split()[0] for line in v if line.strip()}
insec=words.difference(block_words)
new.writelines([k]+v+['{} {}
'.format(i,0) for i in insec])
结果:
TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
president 0
s 0
BBC 0
tell 0
Mr 0
TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398web 0
image 0
online 0
site 0
Internet 0
这篇关于当单词不存在时,将 0 分配给某些单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!