从Pandas向SQLite表添加新列的工作流 [英] Workflow for adding new columns from Pandas to SQLite tables

查看:1312
本文介绍了从Pandas向SQLite表添加新列的工作流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

设置



两张表:学校学生。 SQLite中的索引(或键)对于学生将< id code>表学校时间 c>表。我的数据集是不同的,但我认为学生示例更容易理解。

 导入pandas为pd 
import numpy as np
import sqlite3

df_students = pd.DataFrame(
{'id':list(range(0,4))+ list ,4)),
'time':[0] * 4 + [1] * 4,'school':['A'] * 2 + ['B'] * 2 + * 2 + ['B'] * 2,
'satisfaction':np.random.rand(8)})
df_students.set_index(['id','time'],inplace = True )

满意学校
id时间
0 0 0.863023 A
1 0 0.929337 A
2 0 0.705265 B
3 0 0.160457 B
0 1 0.208302 A
1 1 0.029397 A
2 1 0.266651 B
3 1 0.646079 B

df_schools = pd.DataFrame({'school': ['A'] * 2 + ['B'] * 2,'time':[0] * 2 + [1] * 2,'mean_scores':np.random.rand df_schools.set_index(['school','time'],inplace = True)
df_schools


average_scores
学校时间
A 0 0.358154
A 0 0.142589
B 1 0.260951
B 1 0.683727

发送到SQLite3

conn = sqlite3.connect('schools_students。 sqlite')

df_students.to_sql('students',conn)
df_schools.to_sql('schools',conn)



我需要做什么?



我有一堆操作 pandas dataframes并创建新列,然后将其插入学校学生表(取决于我正在构造什么)。典型的函数按顺序执行:


  1. 从两个SQL表中查询列

  2. 使用 pandas 的函数,例如 groupby c $ c> rolling_mean 等(其中许多在SQL上不可用,或者很难编写)来构造一个新列。返回类型为 pd.Series np.array

  3. 添加(学校学生

这些函数是在我有一个安装在内存中的小数据库时编写的,所以它们是纯的 pandas



下面是一个伪代码示例:

  def example_f(satisfaction,mean_scores)
愚蠢的函数,每个学校的平均分数平均分数
分为平均成绩我已经写了大熊猫函数
mean_satisfaction = mean(满意)
return mean_satisfaction / mean_scores

satisf_div_score = example_f(satisfaction,mean_scores)
#这里将satisf_div_score推到`schools`表

因为我的数据集真的很大,我不能在内存中调用这些函数。想象一下,学校位于不同的地区。最初我只有一个区,所以我知道这些功能可以分别处理来自每个区的数据。



我认为工作流程是:




  • 查询区 i的相关资料

  • 区域的数据 i 并生成新列为np.array或pd.Series

  • 将此列插入相应的表 i = 的地区重复 1到 K



虽然我的数据集在SQLite






谢谢!

解决方案

p>有几种方法,您可以选择哪些更适合您的特定任务:


  1. 将所有数据移动到数据库。我个人喜欢PostgreSQL - 它对大数据集非常好。
    幸运的是pandas支持SQLAlchemy - 跨数据库ORM,因此您可以对不同的数据库使用相同的查询。


  2. 任何块分开。我将用PostgreSQL演示它,但你可以使用任何DB。

     从sqlalchemy import create_engine 
    import psycopg2
    mydb = create_engine('postgresql://user@host.domain:5432 / database')
    #允许选择一些数据组到第一个数据框中,
    #可以使用学校id而不是我的部分
    df = pd.read_sql_query('''SELECT部分​​,count(id)FROM table WHERE created_at <'2016-01-01'GROUP BY部分ORDER BY 2 DESC LIMIT 10''',con = mydb)
    print(df)#不要担心奇怪的输出 - 部分有int []类型,它支持得很好!

    节数
    0 [121,227] 104583
    1 [296,227] 48905
    2 [121] 43599
    3 [302,227 ] 29684
    4 [298,227] 26814
    5 [294,227] 24071
    6 [297,227] 23038
    7 [292,227] 22019
    8 [282,227] 20369
    9 [283,227] 19908

    #现在我们有一些部分,我们只能选择与它们相关的数据
    for section in df [ 'sections']:
    df2 = pd.read_sql_query('''SELECT section,name,created_at,updated_at,status
    FROM table
    WHERE created_at<'2016-01-01'
    AND sections =%(section)s
    ORDER BY created_at''',
    con = mydb,params = dict(section = section))
    print(section,df2.std ())

    [121,227] status 0.478194
    dtype:float64
    [296,227] status 0.544706
    dtype:float64
    [121]状态0.499901
    dtype:float64
    [302,227] status 0.504573
    dtype:float64
    [298,227] status 0.518472
    dtype:float64
    [ 294,227] status 0.46254
    dtype:float64
    [297,227] status 0.525619
    dtype:float64
    [292,227] status 0.627244
    dtype:float64
    [282,227] status 0.362891
    dtype:float64
    [283,227] status 0.406112
    dtype:float64

    当然这个例子是合成的 - 计算文章的平均状态非常可笑:)但它演示了如何分割大量数据并分部分处理。


  3. 使用特定的PostgreSQL(或Oracle或MS或任何您喜欢的)进行统计。这是关于 PostgreSQL中的窗口函数的优秀文档。幸运的是,您可以在DB中执行一些计算,并如上所述将预制数据移动到DataFrame。


UPDATE :如何将信息加载回数据库。



幸运的是,DataFrame支持方法 to_sql 以使此过程变得简单:

  from sqlalchemy import create_engine 
mydb = create_engine('postgresql://user@host.domain:5432 / database')
df2.to_sql('tablename',mydb,if_exists ='append ',chunksize = 100)

您可以指定所需的操作: if_exists = 'append'向表中添加行,如果有很多行,可以将它们拆分为块,以便db可以插入它们。


Setup

Two tables: schools and students. The index (or keys) in SQLite will be id and time for the students table and school and time for the schools table. My dataset is about something different, but I think the school-student example is easier to understand.

import pandas as pd
import numpy as np
import sqlite3

df_students = pd.DataFrame(
{'id': list(range(0,4)) + list(range(0,4)),
'time': [0]*4 + [1]*4, 'school': ['A']*2 + ['B']*2 + ['A']*2 + ['B']*2,
'satisfaction': np.random.rand(8)} )
df_students.set_index(['id', 'time'], inplace=True)

        satisfaction    school
id  time        
0   0   0.863023    A
1   0   0.929337    A
2   0   0.705265    B
3   0   0.160457    B
0   1   0.208302    A
1   1   0.029397    A
2   1   0.266651    B
3   1   0.646079    B

df_schools = pd.DataFrame({'school': ['A']*2 + ['B']*2, 'time': [0]*2 + [1]*2, 'mean_scores': np.random.rand(4)})
df_schools.set_index(['school', 'time'], inplace=True)
df_schools


               mean_scores
school  time    
A       0     0.358154
A       0     0.142589
B       1     0.260951
B       1     0.683727

## Send to SQLite3

conn = sqlite3.connect('schools_students.sqlite')

df_students.to_sql('students', conn)
df_schools.to_sql('schools', conn)

What do I need to do?

I have a bunch of functions that operate over pandas dataframes and create new columns that should then be inserted in either the schools or the students table (depending on what I'm constructing). A typical function does, in order:

  1. Queries columns from both SQL tables
  2. Uses pandas functions such as groupby, apply of custom functions, rolling_mean, etc. (many of them not available on SQL, or difficult to write) to construct a new column. The return type is either pd.Series or np.array
  3. Adds the new column to the appropriate dataframe (schools or students)

These functions were written when I had a small database that fitted in memory so they are pure pandas.

Here's an example in pseudo-code:

def example_f(satisfaction, mean_scores)
    """Silly function that divides mean satisfaction per school by mean score"""
    #here goes the pandas functions I already wrote
    mean_satisfaction = mean(satisfaction) 
    return mean_satisfaction/mean_scores

satisf_div_score = example_f(satisfaction, mean_scores)
# Here push satisf_div_score to `schools` table

Because my dataset is really large, I'm not able to call these functions in memory. Imagine that schools are located in different districts. Originally I only had one district, so I know these functions can work with data from each district separately.

A workflow that I think would work is:

  • Query relevant data for district i
  • Apply function to data for district i and produce new columns as np.array or pd.Series
  • Insert this column at the appropriate table (would fill data for district i of that columns
  • Repeat for districts from i = 1 to K

Although my dataset is in SQLite (and I'd prefer it to stay that way!) I'm open to migrating it to something else if the benefits are large.


I realize there are different reasonable answers, but it would be great to hear something that has proved useful and simple for you. Thanks!

解决方案

There are several approaches, you may select which are better for your particular task:

  1. Move all data to "bigger" database. Personally I prefer PostgreSQL - it plays very well with big datasets. Fortunately pandas support SQLAlchemy - cross-database ORM, so you may use the same queries with different databases.

  2. Split data into chunks and calculate for any chunk separately. I'll demo it with PostgreSQL, but you may use any DB.

    from sqlalchemy import create_engine
    import psycopg2
    mydb = create_engine('postgresql://user@host.domain:5432/database')
    # lets select some groups of data into first dataframe, 
    # you may use school ids instead of my sections
    df=pd.read_sql_query('''SELECT sections, count(id) FROM table WHERE created_at <'2016-01-01' GROUP BY sections ORDER BY 2 DESC LIMIT 10''', con=mydb)
    print(df)  # don't worry about strange output - sections have type int[] and it's supported well!
    
       sections     count
    0  [121, 227]  104583
    1  [296, 227]   48905
    2  [121]        43599
    3  [302, 227]   29684 
    4  [298, 227]   26814
    5  [294, 227]   24071
    6  [297, 227]   23038
    7  [292, 227]   22019
    8  [282, 227]   20369
    9  [283, 227]   19908
    
    # Now we have some sections and we can select only data related to them
    for section in df['sections']:
       df2 = pd.read_sql_query('''SELECT sections, name, created_at, updated_at, status 
                                  FROM table 
                                  WHERE created_at <'2016-01-01'   
                                      AND sections=%(section)s 
                                  ORDER BY created_at''', 
                               con=mydb, params=dict(section=section))
        print(section, df2.std())
    
    [121, 227] status    0.478194
    dtype: float64
    [296, 227] status    0.544706
    dtype: float64
    [121] status    0.499901
    dtype: float64
    [302, 227] status    0.504573
    dtype: float64
    [298, 227] status    0.518472
    dtype: float64
    [294, 227] status    0.46254
    dtype: float64
    [297, 227] status    0.525619
    dtype: float64
    [292, 227] status    0.627244
    dtype: float64
    [282, 227] status    0.362891
    dtype: float64
    [283, 227] status    0.406112
    dtype: float64
    

    Of course this example is synthetic - it's quite ridiculous to calculate average status on articles :) But it demonstrates how to split lots of data and treat it in portions.

  3. Use specific PostgreSQL (or Oracle or MS or any you like) for statistics. Here's excellent documentations on Window Functions in PostgreSQL. Luckily you may perform some calcs in DB and move prefabbed data to DataFrame as above.

UPDATE: How to load information back to database.

Fortunately, DataFrame support method to_sql to make this process easy:

from sqlalchemy import create_engine
mydb = create_engine('postgresql://user@host.domain:5432/database')
df2.to_sql('tablename', mydb, if_exists='append', chunksize=100)

You may specify action you need: if_exists='append' add rows to table, if you have a lot of rows you may split them to chunks, so db could insert them.

这篇关于从Pandas向SQLite表添加新列的工作流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆