存储一组四个(或多个)值的最佳数据结构是什么? [英] What is the best data structure for storing a set of four (or more) values?

查看:153
本文介绍了存储一组四个(或多个)值的最佳数据结构是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有以下变量及其相应的,表示记录

  name ='abc'
age = 23
weight = 60
height =

请注意,可能是不同的类型 string integer float ,引用到任何其他对象等)。



将有许多记录(至少> 100,000)。当所有这四个变量时,每个记录将是唯一 >(实际上它的)放在一起。换句话说,没有记录与所有4 是相同的。



我试图在 Python 中找到一个有效的数据结构,这将允许我(存储和)检索记录基于 log(n)时间复杂度中的任何一个变量



例如:

  def retrieve(name = None,age =无,weight = None,height = None)
如果名称不是None,年龄是None,weight是None,height是None:
/ *获取所有具有给定名称的记录* /
如果姓名为无,年龄不为无,体重为无,身高为无:
/ *获取所有具有给定年龄的记录* /
....
返回记录

应该调用检索的方式是如下:

  retrieve(name ='abc')

以上内容应返回 [{name:'abc',age:23,wight:50,height = 175},{name:'abc' ,年龄:28,wight:55,height = 170}等]

  = 23)

上面应该返回 [{name:'abc'年龄:23​​,wight:50,height = 175},{name:'def',age:23,wight:65,height = 180}等]



而且,将来可能需要再添加一个或两个变量到此记录。例如,说 sex ='m'。因此,检索函数必须是可扩展的。



所以简而言之:在$ $ c中是否有数据结构$ c> Python 这将允许存储记录 n 数量(名称,年龄,性别,体重,身高等)和检索记录根据列在对数(或理想情况下常量 - O(1) - 时间)复杂性?

解决方案

Python中没有一个内置的数据结构,可以完成您所需要的一切,但使用组合非常简单为了实现你的目标而做到这一点。



例如,你的输入是以逗号分隔值文件名为 employees.csv ,字段名称由第一行定义:

 名称,年龄,体重,身高
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,​​5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in

以下是工作代码,它说明了如何将这些数据读取并存储到记录列表中,并自动创建单独的查找表,用于查找与这些记录中每个字段中包含的值相关联的记录。



线是由 namedtuple 创建的类的实例,这是一个非常高效的内存,因为每个缺少一个 __ dict __ 属性实例通常包含。使用它们将可以使用点语法访问每个字段,如 record.fieldname



查找表是 defaultdict(list)实例,它们平均提供类似字典的O(1)查找时间,并且还允许多个值相关联与每一个。因此,查找关键字是正在寻求的字段值的值,与其相关联的数据将是存储在 Person 中的整数索引的列表, code> employees 列出了这个值 - 所以他们都会相对较小。



请注意,该类的代码是完全数据驱动,因为它不包含任何硬编码的字段名称,而是在读入时从csv数据输入文件的第一行获取。当使用实例时,任何实际的方法调用必须包含有效的字段名称关键字参数。



更新



修改为在首次读取数据文件时不为每个字段的每个唯一值创建查找表。现在, retrieve()方法仅在需要时创建它们(并保存/缓存结果以供将来使用)。

  from collections import defaultdict,namedtuple 
import csv

class DataBase(object):
def __init __(self,csv_filename,recordname):
#将数据从csv格式文件读取到名为$ t $ b的列表中,打开(csv_filename,'rb')作为输入文件:
csv_reader = csv.reader(inputfile,delimiter =',')
self.fields = csv_reader.next()#读头行
self.Record = namedtuple(recordname,self.fields)
self.records = [self.Record(* row)for csv_reader中的行]
self.valid_fieldnames = set(self.fields)
#为每个字段名称创建一个空表查找表,将每个唯一字段值映射到记录列表一个包含它的
#的列表索引
self.lookup_tables = d efaultdict(lambda:defaultdict(list))

def retrieve(self,** kwargs):
获取带有值的字段名称的记录列表
作为关键字arg(或返回None,如果没有)。 $
如果len(kwargs)!= 1:raise ValueError(
'完全一个函数'
'(%s指定)'%','所需的一个字段名称/关键字参数。 join([krk(k)for k in kwargs.keys()]))
field,value = kwargs.items()[0]#只得到关键字arg和值
如果字段不在自己.valid_fieldnames:
raise ValueError('keyword arg%sisn\'ta有效字段名称'%字段)
如果字段不在self.lookup_tables中:#必须创建字段查找表
for index,record in enumerate(self.records):
value = getattr(record,field)
self.lookup_tables [field] [value] .append(index)
matches = [self.records [index]
for self.lookup_tables [field] .get(value,[])]
返回匹配,如果匹配else无

如果__name__ = ='__main__':
empdb = DataBase('employees.csv','Person )
printretrieve(name ='Ted Kingston'):,empdb.retrieve(name ='Ted Kingston')
printretrieve(age = '27'):,empdb.retrieve (age = '27')
printretrieve(weight ='150'):,empdb.retrieve(weight ='150')
try:
print '5ft 6in'):,empdb.retrieve(hight ='5ft 6in')
除了ValueError作为e:
printValueError('{}')按预期提出.format(e)
else:
raise type('NoExceptionError',(Exception,),{})(
'从retrieve(hight = \'5ft\') 。')

输出:

  retrieve(name ='Ted Kingston'):[Person(name ='Ted Kingston',age = '28',weight ='163',height = 5ft 10in')] 
retrieve(age = '27'):[Person(name ='Mary Manson',age = '27',weight ='140',height ='5ft 6in'),
Person(name ='Sue Sommers',age = '27',weight ='132',height ='5ft 8in']]
retrieve(weight ='150'):无
retrieve ='5ft 6in'):ValueError('keyword arghight是一个无效的字段名)
按预期方式生成


Say I have the following variables and its corresponding values which represents a record.

name = 'abc'
age = 23
weight = 60
height = 174

Please note that the value could be of different types (string, integer, float, reference-to-any-other-object, etc).

There will be many records (at least >100,000). Each and every record will be unique when all these four variables (actually its values) are put together. In other words, there exists no record with all 4 values are the same.

I am trying to find an efficient data structure in Python which will allow me to (store and) retrieve records based on any one of these variables in log(n) time complexity.

For example:

def retrieve(name=None,age=None,weight=None,height=None) 
    if name is not None and age is None and weight is None and height is None:
        /* get all records with the given name */
    if name is None and age is not None and weight is None and height is None:
        /* get all records with the given age */
    ....
    return records

The way the retrieve should be called is as follows:

retrieve(name='abc') 

The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]

retrieve(age=23) 

The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]

And, I may need to add one or two more variables to this record in future. For example, say, sex = 'm'. So, the retrieve function must be scalable.

So in short: Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?

解决方案

There isn't a single data structure built into Python that does everything you want, but it's fairly easy to use a combination of the ones it has to achieve your goals and do so fairly efficiently.

For example, say your input was the following data in a comma-separated-value file called employees.csv with field names defined as shown by the first line:

name,age,weight,height
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in

The following is working code which illustrates how to read and store this data into a list of records, and automatically create separate look-up tables for finding records associated with the values of contained in the fields each of these record.

The records are instances of a class created by namedtuple which is a very memory efficient because each one lacks a __dict__ attribute that class instances normally contain. Using them will make it possible to access the fields of each by name using dot syntax, like record.fieldname.

The look-up tables are defaultdict(list) instances, which provide dictionary-like O(1) look-up times on average, and also allow multiple values to be associated with each one. So the look-up key is the value of the field value being sought and the data associated with it will be a list of the integer indices of the Person records stored in the employees list with that value — so they'll all be relatively small.

Note that the code for the class is completely data-driven in that it doesn't contain any hardcoded field names which instead are all taken from the first row of csv data input file when it's read in. When an instance is used, any actual retrieve() method calls must, of course, contain valid field name keyword arguments.

Update

Modified to not create a lookup table for every unique value of every field when the data file is first read. Now the retrieve() method creates them only as needed (and saves/caches the result for future use).

from collections import defaultdict, namedtuple
import csv

class DataBase(object):
    def __init__(self, csv_filename, recordname):
        # read data from csv format file into a list of namedtuples
        with open(csv_filename, 'rb') as inputfile:
            csv_reader = csv.reader(inputfile, delimiter=',')
            self.fields = csv_reader.next()  # read header row
            self.Record = namedtuple(recordname, self.fields)
            self.records = [self.Record(*row) for row in csv_reader]
            self.valid_fieldnames = set(self.fields)
        # create an empty table of lookup tables for each field name that maps
        # each unique field value to a list of record-list indices of the ones
        # that contain it
        self.lookup_tables = defaultdict(lambda: defaultdict(list))

    def retrieve(self, **kwargs):
        """ Fetch a list of records with a field name with the value supplied
            as a keyword arg (or return None if there aren't any). """
        if len(kwargs) != 1: raise ValueError(
            'Exactly one fieldname/keyword argument required for function '
            '(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
        field, value = kwargs.items()[0]  # get only keyword arg and value
        if field not in self.valid_fieldnames:
            raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
        if field not in self.lookup_tables:  # must create field look up table
            for index, record in enumerate(self.records):
                value = getattr(record, field)
                self.lookup_tables[field][value].append(index)
        matches = [self.records[index]
                    for index in self.lookup_tables[field].get(value, [])]
        return matches if matches else None

if __name__ == '__main__':
    empdb = DataBase('employees.csv', 'Person')
    print "retrieve(name='Ted Kingston'):", empdb.retrieve(name='Ted Kingston')
    print "retrieve(age='27'):", empdb.retrieve(age='27')
    print "retrieve(weight='150'):", empdb.retrieve(weight='150')
    try:
        print "retrieve(hight='5ft 6in'):", empdb.retrieve(hight='5ft 6in')
    except ValueError as e:
        print "ValueError('{}') raised as expected".format(e)
    else:
        raise type('NoExceptionError', (Exception,), {})(
            'No exception raised from "retrieve(hight=\'5ft\')" call.')

Output:

retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
                     Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
                           raised as expected

这篇关于存储一组四个(或多个)值的最佳数据结构是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆