使用SQLAlchemy查询到Pandas DataFrame中时重命名列 [英] Renaming columns when querying with SQLAlchemy into Pandas DataFrame

查看:187
本文介绍了使用SQLAlchemy查询到Pandas DataFrame中时重命名列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当您将数据查询到pandas数据框中时,是否可以保留SqlAlchemy属性名称?

Is there a way to retain the SqlAlchemy attribute names when you query the data into a pandas dataframe?

这是我的数据库的简单映射.对于学校表,我将数据库名称"SchoolDistrict"重命名为较短的"district".我已从DBA中删除了几层,因此在源代码中更改它们是不可行的.

Here's a simple mapping of my database. For the school table, I've renamed 'SchoolDistrict', the DB name, to a shorter 'district'. I'm several layers removed from the DBA, so changing them in the source isn't feasible.

class School(Base):
    __tablename__ = 'DimSchool'

    id = Column('SchoolKey', Integer, primary_key=True)
    name = Column('SchoolName', String)
    district = Column('SchoolDistrict', String)


class StudentScore(Base):
    __tablename__ = 'FactStudentScore'

    SchoolKey = Column('SchoolKey', Integer, ForeignKey('DimSchool.SchoolKey'), primary_key = True)
    PointsPossible = Column('PointsPossible', Integer)
    PointsReceived = Column('PointsReceived', Integer)

    school = relationship("School", backref='studentscore')

所以当我查询类似的内容时:

So when I query something like:

query = session.query(StudentScore, School).join(School)
df = pd.read_sql(query.statement, query.session.bind)

在返回的DataFrame df中,我最终获得了列的基础"SchoolDistrict"名称,而不是我的属性名称.

I end up with the underlying 'SchoolDistrict' name for the column, not my attribute name, in the returned DataFrame df.

甚至更烦人的情况是当表中存在重复的列名时.例如:

An even more annoying case is when duplicate column names exist across tables. For example:

class Teacher(Base):
    __tablename__ = 'DimTeacher'

    id = Column('TeacherKey', Integer, primary_key=True)
    fname = Column('FirstName', String)
    lname = Column('FirstName', String)

class Student(Base):
    __tablename__ = 'DimStudent'

    id = Column('StudentKey', Integer, primary_key=True)
    fname = Column('FirstName', String)
    lname = Column('FirstName', String)

因此,对两个表(如下面的表)进行查询都会产生一个具有重复的FirstName和LastName列的数据框.

So a query across both tables (like the one below) produces a dataframe with duplicate FirstName and LastName columns.

query = session.query(StudentScore, Student, Teacher).join(Student).join(Teacher)

在查询时是否可以重命名这些列?现在,我很难直面这两个列名系统.

Would it be possible to rename these columns at the moment of the query? Right now I'm having trouble keeping my head straight with these two systems of column names.

推荐答案

如果我此后必须维护代码,这是我会非常抱怨的一种解决方案.但是您的问题有很多限制,我找不到更好的方法.

This is the kind of solution I would bitterly complain about if I had to maintain the code afterwards. But your question has so many constraints that I cannot find anything better.

首先,您需要像这样使用内省来构造与schema和class列等效的字典(我使用的是您发布的第一个示例):

First you construct a dictionary with the equivalences of schema and class columns using introspection like this (I'm using the first example you've posted):

In [132]:

def add_to_dict(c_map, t_map, table):
    name = table.__tablename__
    t_map[name] = table.__name__
    #print name
    c_map[name] = {}
    for column in dir(table):
        c_schema_name = table.__mapper__.columns.get(column)
        if isinstance(c_schema_name, Column):
            #print column, c_schema_name.name
            c_map[name][c_schema_name.name] = column

c_map = {}
t_map = {}
add_to_dict(c_map, t_map, School)
add_to_dict(c_map, t_map, StudentScore)
print c_map['DimSchool']['SchoolKey']
print c_map['FactStudentScore']['SchoolKey']
print t_map['DimSchool']
id
SchoolKey
School

[使用内省法构建字典的方式的说明

  • c_map是列名对应关系的字典
  • t_map是表名对应的字典
  • 需要为每个表的每个类调用
  • 对于表名,对应关系很容易,因为它只是表类的属性
  • 对于类的列名,首先使用 dir
  • 对于类的每个属性(将是表的列,还有许多其他内容),尝试使用sqlalchemy
  • c_map is the dictionary of correspondences of column names
  • t_map is the dictionary of correspondences of table names
  • needs to be called for each class of each table
  • for table names the correspondence is easy, since it is just attributes of the class of the table
  • for the column names of the class, 1st iterate the attributes of the class using dir
  • for each of the attributes of the class (which will be the columns of the table, but also many other things) try to get the database column name using sqlalchemy mapper
  • the mapper will return a Column object only if the attribute is really a column
  • thus for the Column objects, add them to the column names dictionary. The database name is obtained with .name and the other is just the attribute

在创建数据库中的所有对象之后仅运行一次,每个表类调用一次.]

Run this just once after creating all the objects in the database, calling it once per table class.]

然后,您将使用sql语句并建立要获取的列的翻译列表:

Then you take your sql statement and build up a list of the translation of the columns you are going to get:

In [134]:

df_columns = []
for column in str(query.statement).split('FROM')[0].split('SELECT')[1].split(','):
    table = column.split('.')[0].replace('"', '').strip()
    c_schema = column.split('.')[1].replace('"', '').strip()
    df_columns += [t_map[table] + '.' + eq[table][c_schema]]
print df_columns
​
['StudentScore.SchoolKey', 'StudentScore.PointsPossible', 'StudentScore.PointsReceived', 'School.id', 'School.name', 'School.district']

最后,您按照问题中的说明读取数据框并更改列的名称:

Finally, you read the dataframe as in your question and change the names of the columns:

In [137]:

df.columns = df_columns
In [138]:

df
Out[138]:
StudentScore.SchoolKey  StudentScore.PointsPossible StudentScore.PointsReceived School.id   School.name School.district
0   1   1   None    1   School1 None

(数据只是我创建的一个愚蠢的寄存器).

(The data is just a silly register I've created).

希望有帮助!

这篇关于使用SQLAlchemy查询到Pandas DataFrame中时重命名列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆