Pythonic搜索词典列表 [英] Pythonic search of list of dictionaries

查看:46
本文介绍了Pythonic搜索词典列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,


我正在读取CSV文件中的行。 csv.DictReader将这些行放入字典中。


实际文件包含新旧软件翻译

字符串。包含行数据的字典如下所示:


o = {''TermID'':''4'',''英语'':''系统管理'',

''波兰语'':''Zarzadzanie systemem''}


我将这些词典列入名单:


oldl = [x for x in orig]#其中orig = csv.DictReader(ofile ...


...然后在两个循环中搜索匹配的源项:


为o in oldl:

为n in newl:

如果n [''English''] == o [ ''英语'']:

...


现在,这是有效的。不过,这不仅是非Pythonic,而且还是

非常低效:复杂度为O(n ** 2),所以它非常大幅度地增加了b $ b。


我想要什么要知道是否有一些优雅高效的方式,即查找列表中包含的所有字典dx_1 ... dx_n,

(或字典) )dy,其中dx_i包含
a特定值。或者可能只是第一个dx_1字典。


我必须搜索对应于键''English''的值,因为

两个文件都有很大差距(即旧文件中有很多行

与新

文件中的行不对应,反之亦然。我不想做丑陋的事情,例如将

字典转换成字符串,所以我可以使用string.find()方法。


显然它确实如此不必以这种方式实施。如果

这里的数据结构可以用合适的方式设计,那么很好。


我确实意识到这类似于对

矩阵进行一些操作。但是我从来没有尝试过这样的事情。

Python。

#----------代码如下---------


导入系统

导入csv


类excelpoldialect(csv.Dialect):

delimiter ='';''

doublequote = True

lineterminator =''\\\\ n'

quotechar = ''"''

引用= 0

skipinitialspace = False


epdialect = excelpoldialect()

csv.register_dialect(''excelpol'',epdialect)

尝试:

ofile = open(sys.argv [1],''rb'')

除了IOError:

print"旧文件%s无法打开" %(sys.argv [1])

sys.exit(1)


尝试:

tfile = open(sys .argv [2],''rb'')

除了IOError:

print"新文件%s无法打开 %(sys.argv [2])

sys.exit(1)

titles = csv.reader(ofile,dialect =''excelpol'')。next( )

orig = csv.DictReader(ofile,titles,dialect =''excelpol'')

transl = csv.DictReader(tfile,titles,dialect =''excelpol '')


cfile = open(''cmpfile.csv'',''wb'')

titles.append(''New'' )

titles.append(''RowChanged'')

cm = csv.DictWriter(cfile,titles,dialect =''excelpol'')

cm.writerow(dict(zip(titles,titles)))

打印标题

print" ------------ - "


oldl = [x for x in orig]

newl = [x for x in transl]


all = []

在oldl中为
:在newl中n为


如果n [''英语''] == o [''英语'']:

如果n [''波兰语''] == o [''波兰语'']:

status =''''

else:

status =''已更改''

c ombined = {''TermID'':o [''TermID''],''英语'':o [''英语''],

''波兰语'':o [' 'Polish''',''New'':n [''Polish''],''RowChanged'':status}

cm.writerow(合计)

all.append(合并)

#重复


dfile = open(''dupes.csv'',''wb'')

dupes = csv.DictWriter(dfile,titles,dialect =''excelpo l'')

dupes.writerow(dict(zip(titles,titles)))


""" for for in xrange(0,len(all)-2):

for j in xrange(i + 1,len(all) -1):

if(all [i] [''English''] == all [j] [''English''])和

all [ i] [''RowChanged''] ==''已更改'':

dupes.writerow(所有[i])

dupes.writerow(全部[j] )"""


cfile.close()

ofile.close()

tfile.close()

dfile.close()




-


现实世界对谎言完全漠不关心

是左派思考的基础。

解决方案
Bulba!写道:

大家好,

我正在读取CSV文件中的行。 csv.DictReader将这些行放入词典中。

实际文件包含新旧的软件字符串翻译。包含行数据的字典如下所示:

o = {''TermID'':''4'',''英语'':''系统管理'',
''Polish'':''Zarzadzanie systemem''}

我把这些词典放到列表中:

oldl = [x for x in orig] #where orig = csv.DictReader(ofile ...

..然后在两个循环中搜索匹配的源术语:

for oldl:
for new in n :
如果n [''英语''] == o [''英语'']:
...

现在,这可行。但不仅如此非常非Pythonic,但也非常低效:复杂性是O(n ** 2),所以它非常难以扩展。

我想知道什么如果有一些优雅和有效的方式这样做,即找到所有词典dx_1 ... dx_n,
包含在列表(或字典)dy中,其中dx_i包含
一个特定的价值。或者可能只是第一个dx_1字典。


当然,只需做一点预处理。像(未经测试)的东西:


####


def make_map(l):

#这个假设每个英文键在给定的l / $
#中是唯一的,如果它不是你必须使用o的列表而不是o本身。

map = {}

for d in l:

如果'英语'在d:

key = d [''English' ']

map [key] = d


old_map = make_map(oldl)

new_map = make_map(newl)

如果在new_map中使用engphrase:

o = old_map [engphrase]

n = new_map [engphrase]

如果n [''Polish''] == o [''Polish'']:

status =''''

else:

status =''已更改''

#process ....


### #


我认为英语键在新旧
列表中都是唯一的。如果不是这个需要一些调整。但是,你的

原始算法在这种情况下表现得非常奇怪

(吐出多行具有相同的id,但可能不同

新条款和更新状态。


希望这很有用。


-tim

我必须搜索与键''英语''对应的值,因为两个文件都有很大的空白(即旧文件中有很多行
与行不对应在新的
文件中,反之亦然)。我不想做像将字典转换成字符串这样丑陋的东西,所以我可以使用string.find()方法。

显然它不必以这种方式实现。如果这里的数据结构可以用合适的方式设计(Pythonesque ;-),那很好。

我确实知道这类似于在
矩阵上做一些操作。但我从来没有尝试过在Python中做这样的事情。

#----------代码遵循---------

导入sys
导入csv
类excelpoldialect(csv.Dialect):
delimiter ='';''
doublequote = True
lineterminator =''\\ n'
quotechar =''"''
引用= 0
skipinitialspace = False

epdialect = excelpoldialect( )
csv.register_dialect(''excelpol'',epdialect)

尝试:
ofile = open(sys.argv [1],''rb'')
除了IOError:
print"旧文件%s无法打开" %(sys.argv [1])
sys.exit(1)

尝试:
tfile = open(sys.argv [2],''rb'')
除了IOError:
print"新文件%s无法打开" %(sys.argv [2])
sys.exit(1)

titles = csv.reader(ofile,dialect =''excelpol'')。next()
orig = csv.DictReader(ofile,titles,dialect =''excelpol'')
transl = csv.DictReader(tfile,titles,dialect =''excelpol'')

cfile = open(''cmpfile.csv'',''wb'')
titles.append(''New'')
titles.append(''RowChanged'')
厘米= csv.DictWriter(cfile,titles,dialect =''excelpol'')
cm.writerow(dict(zip(titles,titles)))

打印标题
打印-------------"

oldl = [x for x in orig]
newl = [x for x in transl]

所有= []

for oldl:
for new in newl:
如果n [''English''] == o [''英语'']:
如果n [''波兰语''] == o [''波兰语'']:
status =''''
其他:
状态=''已更改''
合并= {''Ter mID'':o [''TermID''],''英语'':o [''英语''],
''波兰语':o [''波兰语''],''新'':n [''Polish''],''RowChanged'':status}
cm.writerow(合并)
all.append(合并)

#duplicates

dfile = open(''dupes.csv'',''wb'')
dupes = csv.DictWriter(dfile,titles,dialect =''excelpo l'')
dupes.writerow(dict(zip(titles,titles)))

""" for x in xrange(0,len(all)-2):
for x in xrange(i + 1,len(all)-1):
if(all [i] [''English''] == all [j] [''English''])和
所有[i] [''RowChanged''] ==''已更改'':
dupes.writerow(所有[i])
dupes.writerow(所有[j])" "

cfile.close()
ofile.close()
tfile.close()
dfile.close()






现实世界对此完全无动于衷谎言
是左派思考的基础。






Bulba>我将这些词典放入列表中:


Bulba> oldl = [x for x in orig]#其中orig = csv.DictReader(ofile ...


Bulba> ..然后在两个循环中搜索匹配的源词:


Bulba> for old in oldl:

Bulba> for n in newl:

Bulba> if n [''English'' ] == o [''英语'']:

Bulba> ...


Bulba>现在,这是有效的。但是,这不仅仅是非常非Pythonic,但

Bulba>也非常低效:复杂度为O(n ** 2),因此它可以扩大

Bulba>非常糟糕。 />

如何使用套装?


oenglish = set([item [''English''] for oldl中的项目)

nenglish = set([item [''英语''] for newl中的项目])


匹配= oenglish& nenglish


一旦你有匹配的东西,你就可以将你的外环限制为

那些匹配的情况


o [''English'']匹配

如果您还没有使用2.4,那么请通过以下方式获取套装:


from sets import设置为


这仍然不是Pythonic的全部,但应该更快一点。


您可能希望通过英语键对列表进行排序。我不知道怎么用
使用新的密钥arg来list.sort(),但是你仍然可以用

老式的方式:


oldl.sort(lambda a,b:cmp(a [''English''],b [''English'']))

newl。 sort(lambda a,b:cmp(a [''English''],b [''English'']))


一旦排序,你就可以进入列表并行,这应该是b $ b给你一个O(n)算法。


Skip


Skip Montanaro写道:

...很多很棒的东西......
你可能想用''英文'键对你的清单进行排序。我不知道如何使用新的密钥arg来list.sort(),但你仍然可以用老式的方式来做:

oldl。 sort(lambda a,b:cmp(a [''English''],b [''English'']))
newl.sort(lambda a,b:cmp(a [''English'' ],b [''英语'']))


为了完成这个想法,对于2.4和新的方式之后是:


导入运算符


oldl.sort(key = operator.itemgetter(''English''))

newl.sort(key = operator.itemgetter) (''英语''))

一旦排序,你就可以并行浏览列表,这应该给你一个O(n)算法。



但总的来说,由于种种原因你会得到O(n log n)。


--Scott David Daniels
Sc *********** @ Acm.Org


Hello everyone,

I''m reading the rows from a CSV file. csv.DictReader puts
those rows into dictionaries.

The actual files contain old and new translations of software
strings. The dictionary containing the row data looks like this:

o={''TermID'':''4'', ''English'':''System Administration'',
''Polish'':''Zarzadzanie systemem''}

I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

...and then search for matching source terms in two loops:

for o in oldl:
for n in newl:
if n[''English''] == o[''English'']:
...

Now, this works. However, not only this is very un-Pythonic, but also
very inefficient: the complexity is O(n**2), so it scales up very
badly.

What I want to know is if there is some elegant and efficient
way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
contained in a list (or a dictionary) dy, where dx_i contains
a specific value. Or possibly just the first dx_1 dictionary.

I HAVE to search for values corresponding to key ''English'', since
there are big gaps in both files (i.e. there''s a lot of rows
in the old file that do not correspond to the rows in the new
file and vice versa). I don''t want to do ugly things like converting
dictionary to a string so I could use string.find() method.

Obviously it does not have to be implemented this way. If
data structures here could be designed in a proper
(Pythonesque ;-) way, great.

I do realize that this resembles doing some operation on
matrixes. But I have never tried doing smth like this in
Python.
#---------- Code follows ---------

import sys
import csv

class excelpoldialect(csv.Dialect):
delimiter='';''
doublequote=True
lineterminator=''\r\n''
quotechar=''"''
quoting=0
skipinitialspace=False

epdialect=excelpoldialect()
csv.register_dialect(''excelpol'',epdialect)
try:
ofile=open(sys.argv[1],''rb'')
except IOError:
print "Old file %s could not be opened" % (sys.argv[1])
sys.exit(1)

try:
tfile=open(sys.argv[2],''rb'')
except IOError:
print "New file %s could not be opened" % (sys.argv[2])
sys.exit(1)
titles=csv.reader(ofile, dialect=''excelpol'').next()
orig=csv.DictReader(ofile, titles, dialect=''excelpol'')
transl=csv.DictReader(tfile, titles, dialect=''excelpol'')

cfile=open(''cmpfile.csv'',''wb'')
titles.append(''New'')
titles.append(''RowChanged'')
cm=csv.DictWriter(cfile,titles, dialect=''excelpol'')
cm.writerow(dict(zip(titles,titles)))
print titles
print "-------------"

oldl=[x for x in orig]
newl=[x for x in transl]

all=[]

for o in oldl:
for n in newl:
if n[''English''] == o[''English'']:
if n[''Polish''] == o[''Polish'']:
status=''''
else:
status=''CHANGED''
combined={''TermID'': o[''TermID''], ''English'': o[''English''],
''Polish'': o[''Polish''], ''New'': n[''Polish''], ''RowChanged'': status}
cm.writerow(combined)
all.append(combined)
# duplicates

dfile=open(''dupes.csv'',''wb'')
dupes=csv.DictWriter(dfile,titles,dialect=''excelpo l'')
dupes.writerow(dict(zip(titles,titles)))

"""for i in xrange(0,len(all)-2):
for j in xrange(i+1, len(all)-1):
if (all[i][''English'']==all[j][''English'']) and
all[i][''RowChanged'']==''CHANGED'':
dupes.writerow(all[i])
dupes.writerow(all[j])"""

cfile.close()
ofile.close()
tfile.close()
dfile.close()




--

Real world is perfectly indifferent to lies that
are the foundation of leftist "thinking".

解决方案

Bulba! wrote:

Hello everyone,

I''m reading the rows from a CSV file. csv.DictReader puts
those rows into dictionaries.

The actual files contain old and new translations of software
strings. The dictionary containing the row data looks like this:

o={''TermID'':''4'', ''English'':''System Administration'',
''Polish'':''Zarzadzanie systemem''}

I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

..and then search for matching source terms in two loops:

for o in oldl:
for n in newl:
if n[''English''] == o[''English'']:
...

Now, this works. However, not only this is very un-Pythonic, but also
very inefficient: the complexity is O(n**2), so it scales up very
badly.

What I want to know is if there is some elegant and efficient
way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
contained in a list (or a dictionary) dy, where dx_i contains
a specific value. Or possibly just the first dx_1 dictionary.
Sure, just do a little preprocessing. Something like (untested):

####

def make_map(l):
# This assumes that each English key is unique in a given l
# if it''s not you''ll have to use a list of o instead of o itself.
map = {}
for d in l:
if ''English'' in d:
key = d[''English'']
map[key] = d

old_map = make_map(oldl)
new_map = make_map(newl)

for engphrase in old_map:
if engphrase in new_map:
o = old_map[engphrase]
n = new_map[engphrase]
if n[''Polish''] == o[''Polish'']:
status=''''
else:
status=''CHANGED''
# process....

####

I''ve assumed that the English key is unique in both the old and new
lists. If it''s not this will need some adjustment. However, your
original algorithm is going to behave weirdly in that case anyway
(spitting out multiple lines with the same id, but potentially different
new terms and update status).

Hope that''s useful.

-tim

I HAVE to search for values corresponding to key ''English'', since
there are big gaps in both files (i.e. there''s a lot of rows
in the old file that do not correspond to the rows in the new
file and vice versa). I don''t want to do ugly things like converting
dictionary to a string so I could use string.find() method.

Obviously it does not have to be implemented this way. If
data structures here could be designed in a proper
(Pythonesque ;-) way, great.

I do realize that this resembles doing some operation on
matrixes. But I have never tried doing smth like this in
Python.
#---------- Code follows ---------

import sys
import csv

class excelpoldialect(csv.Dialect):
delimiter='';''
doublequote=True
lineterminator=''\r\n''
quotechar=''"''
quoting=0
skipinitialspace=False

epdialect=excelpoldialect()
csv.register_dialect(''excelpol'',epdialect)
try:
ofile=open(sys.argv[1],''rb'')
except IOError:
print "Old file %s could not be opened" % (sys.argv[1])
sys.exit(1)

try:
tfile=open(sys.argv[2],''rb'')
except IOError:
print "New file %s could not be opened" % (sys.argv[2])
sys.exit(1)
titles=csv.reader(ofile, dialect=''excelpol'').next()
orig=csv.DictReader(ofile, titles, dialect=''excelpol'')
transl=csv.DictReader(tfile, titles, dialect=''excelpol'')

cfile=open(''cmpfile.csv'',''wb'')
titles.append(''New'')
titles.append(''RowChanged'')
cm=csv.DictWriter(cfile,titles, dialect=''excelpol'')
cm.writerow(dict(zip(titles,titles)))
print titles
print "-------------"

oldl=[x for x in orig]
newl=[x for x in transl]

all=[]

for o in oldl:
for n in newl:
if n[''English''] == o[''English'']:
if n[''Polish''] == o[''Polish'']:
status=''''
else:
status=''CHANGED''
combined={''TermID'': o[''TermID''], ''English'': o[''English''],
''Polish'': o[''Polish''], ''New'': n[''Polish''], ''RowChanged'': status}
cm.writerow(combined)
all.append(combined)
# duplicates

dfile=open(''dupes.csv'',''wb'')
dupes=csv.DictWriter(dfile,titles,dialect=''excelpo l'')
dupes.writerow(dict(zip(titles,titles)))

"""for i in xrange(0,len(all)-2):
for j in xrange(i+1, len(all)-1):
if (all[i][''English'']==all[j][''English'']) and
all[i][''RowChanged'']==''CHANGED'':
dupes.writerow(all[i])
dupes.writerow(all[j])"""

cfile.close()
ofile.close()
tfile.close()
dfile.close()




--

Real world is perfectly indifferent to lies that
are the foundation of leftist "thinking".





Bulba> I put those dictionaries into the list:

Bulba> oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

Bulba> ..and then search for matching source terms in two loops:

Bulba> for o in oldl:
Bulba> for n in newl:
Bulba> if n[''English''] == o[''English'']:
Bulba> ...

Bulba> Now, this works. However, not only this is very un-Pythonic, but
Bulba> also very inefficient: the complexity is O(n**2), so it scales up
Bulba> very badly.

How about using sets?

oenglish = set([item[''English''] for item in oldl])
nenglish = set([item[''English''] for item in newl])

matching = oenglish & nenglish

Once you have those that match, you can constrain your outer loop to just
those cases where

o[''English''] in matching

If you''re not using 2.4 yet, then get sets via:

from sets import Set as set

That''s still not all that Pythonic, but should be a bit faster.

You might want to sort your lists by the ''English'' key. I don''t know how to
use the new key arg to list.sort(), but you can still do it the
old-fashioned way:

oldl.sort(lambda a,b: cmp(a[''English''], b[''English'']))
newl.sort(lambda a,b: cmp(a[''English''], b[''English'']))

Once sorted, you can then march through the lists in parallel, which should
give you an O(n) algorithm.

Skip


Skip Montanaro wrote:

...lotsa great stuff ...
You might want to sort your lists by the ''English'' key. I don''t know how to
use the new key arg to list.sort(), but you can still do it the
old-fashioned way:

oldl.sort(lambda a,b: cmp(a[''English''], b[''English'']))
newl.sort(lambda a,b: cmp(a[''English''], b[''English'']))
To complete the thought, for 2.4 and after the new-fashioned way is:

import operator

oldl.sort(key=operator.itemgetter(''English''))
newl.sort(key=operator.itemgetter(''English''))
Once sorted, you can then march through the lists in parallel, which should
give you an O(n) algorithm.


But overall you will have O(n log n) because of the sorts.

--Scott David Daniels
Sc***********@Acm.Org


这篇关于Pythonic搜索词典列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆