如何从PostgreSQL复制的CSV生成模式 [英] How to generate a schema from a CSV for a PostgreSQL Copy

查看:213
本文介绍了如何从PostgreSQL复制的CSV生成模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个包含几十个或更多列的CSV,如何创建一个可以在PostgreSQL中用于COPY工具的CREATE TABLE SQL表达式中使用的模式?

Given a CSV with several dozen or more columns, how can a 'schema' be created that can be used in a CREATE TABLE SQL expression in PostgreSQL for use with the COPY tool?

我看到很多COPY工具和基本的CREATE TABLE表达式的例子,但是当你手动创建模式的列数有可能是禁止的时候,没有什么详细说明。

I see plenty of examples for the COPY tool, and basic CREATE TABLE expressions, but nothing goes into detail about cases when you have a potentially prohibitive number of columns for manual creation of the schema.

推荐答案

如果csv不是过大,并且在本地计算机上可用,则 csvkit 是最简单的解决方案。它还包含许多其他用于处理CSV的实用程序,因此它是一个 usefull工具一般来说。

If the csv is not excessively large and available on your local machine then csvkit is the simplest solution. It also contains a number of other utilities for working with CSVs, so it is a usefull tool to know in general.

最简单的输入shell

At its simplest typing into the shell

$ csvsql myfile.csv

将打印所需的 CREATE TABLE SQL命令,可以使用输出重定向保存到文件。

will print out the required CREATE TABLE SQL command, which can be saved to a file using output redirection.

您还提供一个连接字符串 csvsql 将创建表并一次性上传该文件:

If you also provide a connection string csvsql will create the table and upload the file in one go:

$ csvsql --db "$MY_DB_URI" --insert myfile.csv

还有一些选项可以指定正在使用的SQL和CSV的风格。它们记录在内置帮助中:

There are also options to specify the flavor of SQL and CSV you are working with. They are documented in the builtin help:

$ csvsql -h
usage: csvsql [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
              [-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-S] [-H] [-v]
              [--zero] [-y SNIFFLIMIT]
              [-i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}]
              [--db CONNECTION_STRING] [--query QUERY] [--insert]
              [--tables TABLE_NAMES] [--no-constraints] [--no-create]
              [--blanks] [--no-inference] [--db-schema DB_SCHEMA]
              [FILE [FILE ...]]

Generate SQL statements for one or more CSV files, create execute those
statements directly on a database, and execute one or more SQL queries.
positional arguments:
  FILE                  The CSV file(s) to operate on. If omitted, will accept
                        input on STDIN.

optional arguments:
  -h, --help            show this help message and exit
  -d DELIMITER, --delimiter DELIMITER
                        Delimiting character of the input CSV file.
  -t, --tabs            Specifies that the input CSV file is delimited with
                        tabs. Overrides "-d".
  -q QUOTECHAR, --quotechar QUOTECHAR
                        Character used to quote strings in the input CSV file.
  -u {0,1,2,3}, --quoting {0,1,2,3}
                        Quoting style used in the input CSV file. 0 = Quote
                        Minimal, 1 = Quote All, 2 = Quote Non-numeric, 3 =
                        Quote None.
  -b, --doublequote     Whether or not double quotes are doubled in the input
                        CSV file.
  -p ESCAPECHAR, --escapechar ESCAPECHAR
                        Character used to escape the delimiter if --quoting 3
                        ("Quote None") is specified and to escape the
                        QUOTECHAR if --doublequote is not specified.
  -z MAXFIELDSIZE, --maxfieldsize MAXFIELDSIZE
                        Maximum length of a single field in the input CSV
                        file.
  -e ENCODING, --encoding ENCODING
                        Specify the encoding the input CSV file.
  -S, --skipinitialspace
                        Ignore whitespace immediately following the delimiter.
  -H, --no-header-row   Specifies that the input CSV file has no header row.
                        Will create default headers.
  -v, --verbose         Print detailed tracebacks when errors occur.
  --zero                When interpreting or displaying column numbers, use
                        zero-based numbering instead of the default 1-based
                        numbering.
  -y SNIFFLIMIT, --snifflimit SNIFFLIMIT
                        Limit CSV dialect sniffing to the specified number of
                        bytes. Specify "0" to disable sniffing entirely.
  -i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}, --dialect {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}
                        Dialect of SQL to generate. Only valid when --db is
                        not specified.
  --db CONNECTION_STRING
                        If present, a sqlalchemy connection string to use to
                        directly execute generated SQL on a database.
  --query QUERY         Execute one or more SQL queries delimited by ";" and
                        output the result of the last query as CSV.
  --insert              In addition to creating the table, also insert the
                        data into the table. Only valid when --db is
                        specified.
  --tables TABLE_NAMES  Specify one or more names for the tables to be
                        created. If omitted, the filename (minus extension) or
                        "stdin" will be used.
  --no-constraints      Generate a schema without length limits or null
                        checks. Useful when sampling big tables.
  --no-create           Skip creating a table. Only valid when --insert is
                        specified.
  --blanks              Do not coerce empty strings to NULL values.
  --no-inference        Disable type inference when parsing the input.
  --db-schema DB_SCHEMA
                        Optional name of database schema to create table(s)
                        in.

其他几个工具也做模式推理,包括:

Several other tools also do schema inference including:


  • Apache Spark

  • Pandas(Python)

  • Blaze(Python)

  • read.csv + R中最喜欢的数据库包

  • Apache Spark
  • Pandas (Python)
  • Blaze (Python)
  • read.csv + your favorite db package in R

每一个都具有读取csv(和其他格式)成表格数据结构的功能,通常称为DataFrame或类似的,推断过程中的列类型。然后,他们有其他意见,要么写出等效的SQL模式,要么将DataFrame直接上传到指定的数据库中。工具的选择将取决于数据量,存储方式,csv的特性,目标数据库和您喜欢使用的语言。

Each of these have functinality to read a csv (and other formats) into a tabular data structure usually called a DataFrame or similar, inferring the column types in the process. They then have other commends to either write out an equivalent SQL schema or upload the DataFrame directly into a specified database. The choice of tool will depend on the volume of data, how it is stored, idiosyncracies of your csv, the target database and the language you prefer to work in.

这篇关于如何从PostgreSQL复制的CSV生成模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆