从Pandas数据框向(大)SQLite数据库添加一个额外的列 [英] Adding an extra column to (big) SQLite database from Pandas dataframe

查看:180
本文介绍了从Pandas数据框向(大)SQLite数据库添加一个额外的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我觉得自己正在忽略一些非常简单的内容,但是我无法使其正常工作.我现在正在使用SQLite,但是SQLAlchemy中的解决方案也将非常有帮助.

I feel like I'm overlooking something really simple, but I can't make it work. I'm using SQLite now, but a solution in SQLAlchemy would also be very helpful.

让我们创建原始数据集:

Let's create our original dataset:

### This is just the setup part
import pandas as pd
import sqlite3
conn = sqlite3.connect('test.sqlite')

orig = pd.DataFrame({'COLUPC': [100001, 100002, 100003, 100004],
'L5': ['ABC ALE', 'ABC MALT LIQUOR', 'ABITA AMBER', 'ABITA AMBER'],
'attr1': [0.25, 0.25, 0.041, 0.041]})

orig.to_sql("UPCs", conn, if_exists='replace', index=False)

#Create an index just in case it's needed
conn.execute("""CREATE INDEX upc_index
ON UPCs (COLUPC);""")

现在假设我采用orig dataframe并添加名为"L5_lower"的列.然后,我在SQLite数据库中创建该列:

Now suppose I take that orig dataframe and add a column called 'L5_lower'. Then I create the column in the SQLite database:

# Create new variable
orig['L5_lower'] = orig.L5.str.lower()
conn.execute("alter table UPCs add column L5_lower TEXT;")

现在假设我想将这单个列L5_lower填充到SQLite表中,而不必传递其他列(在下面解释为什么我需要这样做)

Now suppose I want to fill in this single column L5_lower to the SQLite table, without having to pass other columns (below I explain why I need this)

我尝试将索引和新列作为元组传递:

I tried passing the index and the new column as tuples:

query='''insert or replace into UPCs (COLUPC, L5_lower) values (?,?) '''
conn.executemany(query, orig[['COLUPC', 'L5_lower']].to_records(index=False))
conn.commit() 

# But then:
df = pd.read_sql("SELECT * FROM UPCs;", conn)
conn.close()

给出混乱的结果.

    COLUPC                               L5                 attr1   L5_lower
0   100001                               ABC ALE            0.250   None
1   100002                               ABC MALT LIQUOR    0.250   None
2   100003                               ABITA AMBER        0.041   None
3   100004                               ABITA AMBER        0.041   None
4   b'\xa1\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abc ale
5   b'\xa2\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abc malt liquor
6   b'\xa3\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abita amber
7   b'\xa4\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abita amber

相反,预期的输出是:

    COLUPC                               L5                 attr1   L5_lower
0   100001                               ABC ALE            0.250   abc ale
1   100002                               ABC MALT LIQUOR    0.250   abc malt liquor
2   100003                               ABITA AMBER        0.041   abita amber
3   100004                               ABITA AMBER        0.041   abita amber

那么,为什么我要传递单个列?我有一个非常大的数据集,而我将无法在内存中存储整个数据框.我打算的工作流程是一次构造一列,然后updateinsert到SQLite数据库中.

So, why am I trying to pass a single column? I have a very big dataset and I won't be able to have the whole dataframe in memory. My intended workflow is to construct one column at a time and then update or insert into the SQLite database.

推荐答案

AFAIK,您不能使用熊猫to_sql添加列-您可以添加ROWS.一种解决方案是将新列插入临时表(具有与原始表相同的索引),然后在SQLite一侧更新源表.

AFAIK you can't add COLUMNS using Pandas to_sql - you can add ROWS. One solution would be to insert a new column into a temporary table (with the same index as the original table has) and then update the source table on the SQLite side.

这是一个有效的示例:

设置:

假设我们有以下原始DF:

assuming we have the following original DF:

In [79]: orig
Out[79]:
   COLUPC               L5  attr1
0  100001          ABC ALE  0.250
1  100002  ABC MALT LIQUOR  0.250
2  100003      ABITA AMBER  0.041
3  100004      ABITA AMBER  0.041

In [80]: orig.set_index('COLUPC', inplace=True)

In [81]: conn = sqlite3.connect('d:/temp/test.sqlite')

In [82]: orig.to_sql('upcs', conn, if_exists='replace', index=True)

In [83]: conn.close()

解决方案:

In [84]: conn = sqlite3.connect('d:/temp/test.sqlite')

In [85]: df = pd.read_sql('select * from upcs', conn, index_col='COLUPC')

In [86]: df
Out[86]:
                     L5  attr1
COLUPC
100001          ABC ALE  0.250
100002  ABC MALT LIQUOR  0.250
100003      ABITA AMBER  0.041
100004      ABITA AMBER  0.041

创建临时表:

In [87]: tmp = orig.L5.str.lower().to_frame('L5_lower')

In [88]: tmp
Out[88]:
               L5_lower
COLUPC
100001          abc ale
100002  abc malt liquor
100003      abita amber
100004      abita amber

In [89]: tmp.to_sql('tmp', conn, if_exists='replace', index=True)

向SQLite表添加新列:

add new column to SQLite table:

In [90]: conn.execute('alter table UPCs add column L5_lower varchar(50)')
Out[90]: <sqlite3.Cursor at 0xa558c00>

In [91]: qry = 'update upcs set L5_lower = (select L5_lower from tmp where tmp.COLUPC = upcs.COLUPC) where L5_lower is NULL'

In [92]: conn.execute(qry)
Out[92]: <sqlite3.Cursor at 0xa593570>

In [93]: conn.commit()

In [94]: conn.execute('drop table tmp')
Out[94]: <sqlite3.Cursor at 0xa5930a0>

检查:

In [95]: pd.read_sql('select * from upcs', conn, index_col='COLUPC')
Out[95]:
                     L5  attr1         L5_lower
COLUPC
100001          ABC ALE  0.250          abc ale
100002  ABC MALT LIQUOR  0.250  abc malt liquor
100003      ABITA AMBER  0.041      abita amber
100004      ABITA AMBER  0.041      abita amber

In [96]: conn.close()

这篇关于从Pandas数据框向(大)SQLite数据库添加一个额外的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆