在SQLAlchemy中映射大量相似的表 [英] Mapping lots of similar tables in SQLAlchemy

查看:254
本文介绍了在SQLAlchemy中映射大量相似的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多(〜2000)个位置有时间序列数据。每个时间序列有数百万行。我想把这些在一个Postgres数据库。我目前的方法是为每个位置时间系列有一个表,以及一个元表,它存储关于每个位置(坐标,高程等)的信息。我使用Python / SQLAlchemy创建和填充表。我想在元表和每个时间系列表之间进行查询,例如选择在日期A和日期B之间有数据的所有位置和选择日期A的所有数据并导出带坐标的csv。什么是创建许多具有相同结构(只有名称不同)并与元表有关系的表的最好方法?或者我应该使用不同的数据库设计?

I have many (~2000) locations with time series data. Each time series has millions of rows. I would like to store these in a Postgres database. My current approach is to have a table for each location time series, and a meta table which stores information about each location (coordinates, elevation etc). I am using Python/SQLAlchemy to create and populate the tables. I would like to have a relationship between the meta table and each time series table to do queries like "select all locations that have data between date A and date B" and "select all data for date A and export a csv with coordinates". What is the best way to create many tables with the same structure (only the name is different) and have a relationship with a meta table? Or should I use a different database design?

目前我使用这种类型的方法来生成很多类似的映射:

Currently I am using this type of approach to generate a lot of similar mappings:

from sqlalchemy import create_engine, MetaData
from sqlalchemy.types import Float, String, DateTime, Integer
from sqlalchemy import Column, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship, backref

Base = declarative_base()


def make_timeseries(name):
    class TimeSeries(Base):

        __tablename__ = name
        table_name = Column(String(50), ForeignKey('locations.table_name'))
        datetime = Column(DateTime, primary_key=True)
        value = Column(Float)

        location = relationship('Location', backref=backref('timeseries',
                                lazy='dynamic'))

        def __init__(self, table_name, datetime, value):
            self.table_name = table_name
            self.datetime = datetime
            self.value = value

        def __repr__(self):
            return "{}: {}".format(self.datetime, self.value)

    return TimeSeries


class Location(Base):

    __tablename__ = 'locations'
    id = Column(Integer, primary_key=True)
    table_name = Column(String(50), unique=True)
    lon = Column(Float)
    lat = Column(Float)

if __name__ == '__main__':
    connection_string = 'postgresql://user:pw@localhost/location_test'
    engine = create_engine(connection_string)
    metadata = MetaData(bind=engine)
    Session = sessionmaker(bind=engine)
    session = Session()

    TS1 = make_timeseries('ts1')
    # TS2 = make_timeseries('ts2')   # this breaks because of the foreign key
    Base.metadata.create_all(engine)
    session.add(TS1("ts1", "2001-01-01", 999))
    session.add(TS1("ts1", "2001-01-02", -555))

    qs = session.query(Location).first()
    print qs.timeseries.all()

这种方法有一些问题,最值得注意的是,如果我创建多个 TimeSeries 外键不工作。以前,我使用了一些工作,但这一切似乎是一个大黑客,我觉得必须有一个更好的方法这样做。如何组织和访问我的数据?

This approach has some problems, most notably that if I create more than one TimeSeries the foreign key doesn't work. Previously I've used some work arounds, but it all seems like a big hack and I feel that there must be a better way of doing this. How should I organise and access my data?

推荐答案

备选-1:表分区



立即记住分区相同的表结构。我不是一个DBA,没有太多的生产经验使用它(更多的是PostgreSQL),但
请阅读 PostgreSQL - 分区 文档。表分区寻求解决您所遇到的问题,但超过1K表/分区听起来有挑战性;因此,请对论坛/ SO做更多的研究关于此主题的可扩展性相关问题。

Alternative-1: Table Partitioning

Partitioning immediately comes to mind as soon as I read exactly the same table structure. I am not a DBA, and do not have much production experience using it (even more so on PostgreSQL), but please read PostgreSQL - Partitioning documentation. Table partitioning seeks to solve exactly the problem you have, but over 1K tables/partitions sounds challenging; therefore please do more research on forums/SO for scalability related questions on this topic.

由于您最常使用的搜索标准, datetime 组件非常重要,因此必须有固体索引战略。如果您决定使用 分区 root,那么明显的分区策略将基于日期范围。这可能允许您分区较旧的数据在不同的块与最近的数据相比,特别是假设旧的数据(几乎从不)更新,所以物理布局将是密集和有效的;

Given that both of your mostly used search criterias, datetime component is very important, therefore there must be solid indexing strategy on it. If you decide to go with partitioning root, the obvious partitioning strategy would be based on date ranges. This might allow you to partition older data in different chunks compared to most recent data, especially assuming that old data is (almost never) updated, so physical layouts would be dense and efficient; while you could employ another strategy for more "recent" data.

这基本上使你的示例代码通过欺骗SA假设所有 TimeSeries children 使用 Concrete Table Inheritance 。以下代码是自包含的,并创建50个表,其中包含最少的数据。但是,如果你已经有一个数据库,它应该允许你检查性能相当快,以便你可以做出一个决定,如果它是可能的接近。

This basically makes your sample code work by tricking SA to assume that all those TimeSeries are children of one entity using Concrete Table Inheritance. The code below is self-contained and creates 50 table with minimum data in it. But if you have a database already, it should allow you to check the performance rather quickly, so that you can make a decision if it is even a close possibility.

from datetime import date, datetime

from sqlalchemy import create_engine, Column, String, Integer, DateTime, Float, ForeignKey, func
from sqlalchemy.orm import sessionmaker, relationship, configure_mappers, joinedload
from sqlalchemy.ext.declarative import declarative_base, declared_attr
from sqlalchemy.ext.declarative import AbstractConcreteBase, ConcreteBase


engine = create_engine('sqlite:///:memory:', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base(engine)


# MODEL
class Location(Base):
    __tablename__ = 'locations'
    id = Column(Integer, primary_key=True)
    table_name = Column(String(50), unique=True)
    lon = Column(Float)
    lat = Column(Float)


class TSBase(AbstractConcreteBase, Base):
    @declared_attr
    def table_name(cls):
        return Column(String(50), ForeignKey('locations.table_name'))


def make_timeseries(name):
    class TimeSeries(TSBase):
        __tablename__ = name
        __mapper_args__ = { 'polymorphic_identity': name, 'concrete':True}

        datetime = Column(DateTime, primary_key=True)
        value = Column(Float)

        def __init__(self, datetime, value, table_name=name ):
            self.table_name = table_name
            self.datetime = datetime
            self.value = value

    return TimeSeries


def _test_model():
    _NUM = 50
    # 0. generate classes for all tables
    TS_list = [make_timeseries('ts{}'.format(1+i)) for i in range(_NUM)]
    TS1, TS2, TS3 = TS_list[:3] # just to have some named ones
    Base.metadata.create_all()
    print('-'*80)

    # 1. configure mappers
    configure_mappers()

    # 2. define relationship
    Location.timeseries = relationship(TSBase, lazy="dynamic")
    print('-'*80)

    # 3. add some test data
    session.add_all([Location(table_name='ts{}'.format(1+i), lat=5+i, lon=1+i*2)
        for i in range(_NUM)])
    session.commit()
    print('-'*80)

    session.add(TS1(datetime(2001,1,1,3), 999))
    session.add(TS1(datetime(2001,1,2,2), 1))
    session.add(TS2(datetime(2001,1,2,8), 33))
    session.add(TS2(datetime(2002,1,2,18,50), -555))
    session.add(TS3(datetime(2005,1,3,3,33), 8))
    session.commit()


    # Query-1: get all timeseries of one Location
    #qs = session.query(Location).first()
    qs = session.query(Location).filter(Location.table_name == "ts1").first()
    print(qs)
    print(qs.timeseries.all())
    assert 2 == len(qs.timeseries.all())
    print('-'*80)


    # Query-2: select all location with data between date-A and date-B
    dateA, dateB = date(2001,1,1), date(2003,12,31)
    qs = (session.query(Location)
            .join(TSBase, Location.timeseries)
            .filter(TSBase.datetime >= dateA)
            .filter(TSBase.datetime <= dateB)
            ).all()
    print(qs)
    assert 2 == len(qs)
    print('-'*80)


    # Query-3: select all data (including coordinates) for date A
    dateA = date(2001,1,1)
    qs = (session.query(Location.lat, Location.lon, TSBase.datetime, TSBase.value)
            .join(TSBase, Location.timeseries)
            .filter(func.date(TSBase.datetime) == dateA)
            ).all()
    print(qs)
    # @note: qs is list of tuples; easy export to CSV
    assert 1 == len(qs)
    print('-'*80)


if __name__ == '__main__':
    _test_model()



code>



如果你使用数据库遇到性能问题,我可能会尝试:

Alternative-3: a-la BigData

If you do get into performance problems using database, I would probably try:


  • 批量导入使用数据库引擎提供的原生解决方案的数据
  • li>
  • 使用 MapReduce - 喜欢分析。

    • 这里我会留在python和sqlalchemy和implemnent自己的分布式查询和聚合(或找到现有的东西)。这显然只有在你不需要直接在数据库上产生这些结果时才有效。

    • still keep the data in separate tables/databases/schemas like you do right now
    • bulk-import data using "native" solutions provided by your database engine
    • use MapReduce-like analysis.
      • Here I would stay with python and sqlalchemy and implemnent own distributed query and aggregation (or find something existing). This, obviously, only works if you do not have requirement to produce those results directly on the database.

      我没有使用这些大规模的经验,但绝对是一个值得考虑的选择。

      I have no experience using those on a large scale, but definitely an option worth considering.

      非常棒,如果你以后可以分享你的发现和整个决策过程对此。

      这篇关于在SQLAlchemy中映射大量相似的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆