使用 pyodbc 和 pandas 将 CSV 加载到 .mdb [英] Load CSV to .mdb using pyodbc and pandas

查看:35
本文介绍了使用 pyodbc 和 pandas 将 CSV 加载到 .mdb的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景故事:我从事财务工作(不是开发人员,因此非常感谢帮助),我的部门目前严重依赖 excel 和 vba 来尽可能多地自动化我们的任务.该公司刚刚验证了一个 Python 发行版,我们现在可以使用它,所以我只是想尝试一下.

挑战:我的第一个挑战是将 CSV 文件加载到 MSAcess 数据库中(因为并非所有人都精通技术,可以纯粹使用开发工具和数据库进行工作,因此需要让每个人都轻松完成工作).

我可以在互联网上找到零碎的不同 ppl 代码,我可以将它们放在一起,它运行良好,但结果却变成了弗兰肯斯坦.

它在做什么以及为什么:

  1. 将 CSV 加载到变量
  2. 去掉第一行(因为源文件不是真正的 CSV,文件开头有垃圾行)
  3. 导出到临时驱动器中的 CSV(因为无法弄清楚如何从变量加载到 panda)
  4. 使用 panda 将 CSV 加载到 SQLite(因为 panda 能够推断每列的数据类型)
  5. 将创建表"语句导出到变量
  6. 使用 pyodbc 在 .mdb 文件中创建表
  7. 将数据逐行加载到 .mdb 表中(非常慢)

TL;DR:
当前代码是不同代码的拼凑而成,既丑陋又缓慢,您会更改什么以使其更高效/优化它?

目标是编写一个将 CSV 加载到 .mdb 的代码,可能使用正确的数据类型来创建表.

导入csv导入pyodbc进口大熊猫导入 pandas.io.sql导入 sqlite3导入临时文件导入时间导入字符串def load_csv_to_access(access_path, table_name, csv_path, skip_rows):# 打开 CSV 文件,加载到一个变量,输出到一个临时文件,不包括第一个非 csv 行#文件名 = csv_pathcsv_file = 打开(文件名)txt = ""对于索引,枚举中的行(csv_file,start=0):#跳过第一行如果索引>跳过行:txt += 行csv_file.close()temp_filename = time.strftime("%y%m%d%H%M%S") + '.csv'temp_filepath = tempfile.gettempdir() + '\\' + temp_filename文件 = 打开(临时文件路径,'w+')file.write(txt) # 创建临时 csv文件.close()打印1:创建的临时文件:"+ temp_filepath# 使用panda和SQLite推断CSV字段的数据类型#df = pandas.read_csv(temp_filepath, delimiter=';', index_col=0, engine='python')df.columns = df.columns.str.replace(' ', '_')# 连接内存数据库进行测试;用文件路径替换`:memory:`con = sqlite3.connect('db.sqlite')df.to_sql(table_name, con, if_exists='replace')sqlite_query_string = "SELECT sql FROM sqlite_master where name = '" + table_name + "'"create_table_tuple = con.execute(sqlite_query_string).fetchone()关闭()create_table_string = create_table_tuple[0]打印2:推断的数据类型"#连接到AccessDB并加载临时CSV#access_string = "DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=" + access_path + "; Provider=MSDASQL;"打印访问字符串con = pyodbc.connect(access_string)cur = con.cursor()cur.execute(create_table_string)提交()打印3:已创建 MS Access 表:" + table_name打印4:加载数据行:"使用 open(temp_filepath, 'r') 作为 f:reader = csv.reader(f, delimiter=';')列 = 下一个(读者)查询=插入"+表名+({0})个值({1})"query = query.format(','.join(columns).replace(' ', '_'), ','.join('?* len(columns))) #创建插入查询(用下划线替换空格以避免数据库问题)对于索引,枚举中的数据(reader,start=0):cur.execute(query, data) #逐行插入打印索引 # 用于调试cur.commit()关闭()

谢谢,因为你们比我好得多,希望得到任何建议.

解决方案

MS Access 可以直接查询 CSV 文件并运行

清理 CSV

MS 访问表

Notice Access 可以推断列类型,例如第一列中的日期.

Background story: I work on finance (not a developer, so help is very appreciated), my department currently relies heavily on excel and vba to automate as much as possible of our tasks. The company just validated a python distribution and we're now allowed to use it, so I just thought on giving a try.

Challenge: My first challenge was to load a CSV file into a MSAcess database (because not all of us are tech savy enough to work purely using dev tools and DBs, so need to make things easy for everybody).

I could find bits and pieces of different ppl's code around the internet that I could put together, it's working, but turn out it became a Frankenstein.

What it's doing and why:

  1. Load CSV to variable
  2. Strip out first rows (because source file is not realy a CSV, has rubbish rows at the start of the file)
  3. Export to a CSV in temp drive (because could not figure out how to load to panda from a variable)
  4. Load CSV to SQLite using panda (because panda is able to infer data type of each column)
  5. Export "create table" statement to variable
  6. Create table in .mdb file using pyodbc
  7. Load data to .mdb table row by row (it's very slow)

TL;DR:
Current code is a patchwork of different codes, it's ugly and slow, what would you change to make it more efficient / to optimize it?

The goal is to have a code that loads CSV to .mdb, possibly using correct data type to create table.

import csv
import pyodbc
import pandas
import pandas.io.sql
import sqlite3
import tempfile
import time
import string


def load_csv_to_access(access_path, table_name, csv_path, skip_rows):


# open CSV file, load to a variable, output to a temp file excluding first non csv rows
#
filename = csv_path
csv_file = open(filename)
txt = ""
for index, line in enumerate(csv_file, start=0):  #Skip first rows
    if index > skip_rows:
        txt += line
csv_file.close()
temp_filename = time.strftime("%y%m%d%H%M%S") + '.csv'
temp_filepath = tempfile.gettempdir() + '\\' + temp_filename
file = open(temp_filepath, 'w+')
file.write(txt)  # create temp csv
file.close()
print "1: temp file created: " + temp_filepath

# Use panda and SQLite to infer data type of CSV fields
#
df = pandas.read_csv(temp_filepath, delimiter=';', index_col=0, engine='python')
df.columns = df.columns.str.replace(' ', '_')
# connect to in-memory database for testing; replace `:memory:` w/ file path
con = sqlite3.connect('db.sqlite')
df.to_sql(table_name, con, if_exists='replace')
sqlite_query_string = "SELECT sql FROM sqlite_master where name = '" + table_name + "'"
create_table_tuple = con.execute(sqlite_query_string).fetchone()
con.close()
create_table_string = create_table_tuple[0]
print "2: Data type inferred"

#Connect to AccessDB and load temp CSV
#
access_string = "DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=" + access_path + "; Provider=MSDASQL;"
print access_string
con = pyodbc.connect(access_string)
cur = con.cursor()
cur.execute(create_table_string)
con.commit()
print "3: MS Access table created: " + table_name

print "4: Loading data rows:"
with open(temp_filepath, 'r') as f:
    reader = csv.reader(f, delimiter=';')
    columns = next(reader)
    query = "insert into " + table_name + "({0}) values ({1})"
    query = query.format(','.join(columns).replace(' ', '_'), ','.join(
        '?' * len(columns)))  #Create insert query (replace empty space by underscore to avoid db issues)
    for index, data in enumerate(reader, start=0):
        cur.execute(query, data)  #Insert row by row
        print index # For debugging
    cur.commit()
con.close()

Thanks, as you guys are much better then me, would appreciate any suggestions.

解决方案

MS Access can directly query CSV files and run a Make-Table Query to produce a resulting table. However, some cleaning is needed to remove the rubbish rows. Below opens two files one for reading and other for writing. Assuming rubbish is in first column of csv, the if logic writes any line that has some data in second column (adjust as needed):

import os
import csv
import pyodbc

# TEXT FILE CLEAN
with open('C:\Path\To\Raw.csv', 'r') as reader, open('C:\Path\To\Clean.csv', 'w') as writer:
    read_csv = csv.reader(reader); write_csv = csv.writer(writer, lineterminator='\n')

    for line in read_csv:
        if len(line[1]) > 0:            
            write_csv.writerow(line)

# DATABASE CONNECTION
access_path = "C:\Path\To\Access\\DB.mdb"
con = pyodbc.connect("DRIVER={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={};" \
                     .format(access_path))

# RUN QUERY
strSQL = "SELECT * INTO [TableName] FROM [text;HDR=Yes;FMT=Delimited(,);" + \
         "Database=C:\Path\To\Folder].Clean.csv;"    
cur = con.cursor()
cur.execute(strSQL)
con.commit()

con.close()                            # CLOSE CONNECTION
os.remove('C\Path\To\Clean.csv')       # DELETE CLEAN TEMP 

Raw CSV

Clean CSV

MS Access Table

Notice Access can infer column types such as the Date in first column.

这篇关于使用 pyodbc 和 pandas 将 CSV 加载到 .mdb的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆