在 Python 中构建嵌套字典从文件中逐行读取 [英] Building Nested dictionary in Python reading in line by line from file

查看:33
本文介绍了在 Python 中构建嵌套字典从文件中逐行读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我处理嵌套字典的方式是这样的:

dicty = dict()tmp = dict()tmp["a"] = 1tmp["b"] = 2dicty["A"] = tmpdicty == {"A" : {"a" : 1, "b" : 1}}

当我尝试在大文件上逐行读取时,问题就开始了.这是打印列表中每行的内容:

['proA', 'macbook', '0.666667']['proA','智能','0.666667']['proA','ssd','0.666667']['FrontPage', 'FrontPage', '0.710145']['FrontPage', '疑难解答', '0.971014']

我想最终得到一个嵌套字典(忽略小数):

{'FrontPage': {'frontpage': '0.710145', '疑难解答': '0.971014'},'proA':{'macbook':'0.666667','智能':'0.666667','ssd':'0.666667'}}

当我逐行阅读时,我必须检查文件中是否仍然找到第一个单词(它们都已分组),然后再将其作为完整的 dict 添加到更高的 dict 中.

>

这是我的实现:

def doubleDict(filename):字典 = 字典()使用 open(filename, "r") 作为 f:行 = 0tmp = dict()旧词 = ""对于 f 中的行:values = line.rstrip().split(" ")打印(值)如果 oldword == values[0]:tmp[值[1]] = 值[2]别的:如果 oldword 不是 "":dicty[旧词] = tmptmp.clear()旧字 = 值[0]tmp[值[1]] = 值[2]行 += 1如果行 % 25 == 0:打印(字典)中断#打印(行)返回(字典)

我实际上希望在 Pandas 中有这个,但现在我会很高兴如果这能作为一个 dict 使用.出于某种原因,在阅读了前 5 行之后,我得到了:

{'proA': {'frontpage': '0.710145', '疑难解答': '0.971014'}},

这显然是不正确的.怎么了?

解决方案

使用 collections.defaultdict() 对象 自动实例化嵌套字典:

from collections import defaultdictdef doubleDict(文件名):dicty = defaultdict(dict)使用 open(filename, "r") 作为 f:对于 i,enumerate(f) 中的行:外部,内部,值 = line.split()dicty[外][内]=值如果我 % 25 == 0:打印(字典)中断#打印(行)返回(字典)

我在这里使用了 enumerate() 来生成行数;比保持一个单独的计数器运行要简单得多.

即使没有defaultdict,您也可以让外部字典保留对嵌套字典的引用,并使用values[0] 再次检索它;无需保留 temp 引用:

<预><代码>>>>字典 = {}>>>字典['A'] = {}>>>字典['A']['a'] = 1>>>字典['A']['b'] = 2>>>独裁者{'A':{'a':1,'b':1}}

所有 defaultdict 所做的就是让我们不必测试我们是否已经创建了那个嵌套字典.而不是:

如果外部不在字典中:dicty[外] = {}dicty[外][内]=值

我们只是省略了 if 测试,因为 defaultdict 将为我们创建一个新的字典,如果键还不存在.

The way I go about nested dictionary is this:

dicty = dict()
tmp = dict()
tmp["a"] = 1
tmp["b"] = 2
dicty["A"] = tmp

dicty == {"A" : {"a" : 1, "b" : 1}}

The problem starts when I try to implement this on a big file, reading in line by line. This is printing the content per line in a list:

['proA', 'macbook', '0.666667']
['proA', 'smart', '0.666667']
['proA', 'ssd', '0.666667']
['FrontPage', 'frontpage', '0.710145']
['FrontPage', 'troubleshooting', '0.971014']

I would like to end up with a nested dictionary (ignore decimals):

{'FrontPage': {'frontpage': '0.710145', 'troubleshooting': '0.971014'},
 'proA': {'macbook': '0.666667', 'smart': '0.666667', 'ssd': '0.666667'}}

As I am reading in line by line, I have to check whether or not the first word is still found in the file (they are all grouped), before I add it as a complete dict to the higher dict.

This is my implementation:

def doubleDict(filename):
    dicty = dict()
    with open(filename, "r") as f:
        row = 0
        tmp = dict()
        oldword = ""
        for line in f:
            values = line.rstrip().split(" ")
            print(values)
            if oldword == values[0]:
                tmp[values[1]] = values[2]
            else:
                if oldword is not "":
                    dicty[oldword] = tmp
                tmp.clear()
                oldword = values[0]
                tmp[values[1]] = values[2]
            row += 1
            if row % 25 == 0:
                print(dicty)
                break #print(row)
    return(dicty)

I would actually like to have this in pandas, but for now I would be happy if this would work as a dict. For some reason after reading in just the first 5 lines, I end up with:

{'proA': {'frontpage': '0.710145', 'troubleshooting': '0.971014'}},

which is clearly incorrect. What is wrong?

解决方案

Use a collections.defaultdict() object to auto-instantiate nested dictionaries:

from collections import defaultdict

def doubleDict(filename):
    dicty = defaultdict(dict)
    with open(filename, "r") as f:
        for i, line in enumerate(f):
            outer, inner, value = line.split()
            dicty[outer][inner] = value
            if i % 25 == 0:
                print(dicty)
                break #print(row)
    return(dicty)

I used enumerate() to generate the line count here; much simpler than keeping a separate counter going.

Even without a defaultdict, you can let the outer dictionary keep the reference to the nested dictionary, and retrieve it again by using values[0]; there is no need to keep the temp reference around:

>>> dicty = {}
>>> dicty['A'] = {}
>>> dicty['A']['a'] = 1
>>> dicty['A']['b'] = 2
>>> dicty
{'A': {'a': 1, 'b': 1}}

All the defaultdict then does is keep us from having to test if we already created that nested dictionary. Instead of:

if outer not in dicty:
    dicty[outer] = {}
dicty[outer][inner] = value

we simply omit the if test as defaultdict will create a new dictionary for us if the key was not yet present.

这篇关于在 Python 中构建嵌套字典从文件中逐行读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆