Linux上的文本文件的SQL查询引擎? [英] SQL query engine for text files on Linux?

查看:158
本文介绍了Linux上的文本文件的SQL查询引擎?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用grep,cut,sort,uniq和join在命令行中进行数据分析。他们工作伟大,虽然有缺点。例如,您必须为每个工具指定列编号。我们经常有宽文件(许多列)和列标题给出列名。事实上,我们的文件看起来很像SQL表。我确定有一个驱动程序(ODBC?)将操作分隔的文本文件和一些查询引擎将使用该驱动程序,所以我们可以只使用SQL查询我们的文本文件。因为做分析通常是临时的,它必须是最小的设置来查询新文件(只使用我在这个目录中指定的文件),而不是在一些配置中声明特定的表。

We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.

实际上,最简单的是什么?也就是说,最容易设置和使用的SQL引擎和驱动程序应用于文本文件?

Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?

推荐答案

else的建议,这里是一个Python脚本sqlite3。有点冗长,但它的工作原理。

Riffing off someone else's suggestion, here is a Python script for sqlite3. A little verbose, but it works.

我不喜欢完全复制文件以删除标题行,但我不知道如何说服sqlite3的.import跳过它。

I don't like having to completely copy the file to drop the header line, but I don't know how else to convince sqlite3's .import to skip it. I could create INSERT statements, but that seems just as bad if not worse.

示例调用:


$ sql.py --file foo --sql "select count(*) from data"


#!/usr/bin/env python

"""Run a SQL statement on a text file"""

import os
import sys
import getopt
import tempfile
import re

class Usage(Exception):
    def __init__(self, msg):
        self.msg = msg

def runCmd(cmd):
    if os.system(cmd):
        print "Error running " + cmd
        sys.exit(1)
        # TODO(dan): Return actual exit code

def usage():
    print >>sys.stderr, "Usage: sql.py --file file --sql sql"

def main(argv=None):
    if argv is None:
        argv = sys.argv

    try:
        try:
            opts, args = getopt.getopt(argv[1:], "h",
                                       ["help", "file=", "sql="])
        except getopt.error, msg:
            raise Usage(msg)
    except Usage, err:
        print >>sys.stderr, err.msg
        print >>sys.stderr, "for help use --help"
        return 2

    filename = None
    sql = None
    for o, a in opts:
        if o in ("-h", "--help"):
            usage()
            return 0
        elif o in ("--file"):
            filename = a
        elif o in ("--sql"):
            sql = a
        else:
            print "Found unexpected option " + o

    if not filename:
        print >>sys.stderr, "Must give --file"
        sys.exit(1)
    if not sql:
        print >>sys.stderr, "Must give --sql"
        sys.exit(1)

    # Get the first line of the file to make a CREATE statement
    #
    # Copy the rest of the lines into a new file (datafile) so that
    # sqlite3 can import data without header.  If sqlite3 could skip
    # the first line with .import, this copy would be unnecessary.
    foo = open(filename)
    datafile = tempfile.NamedTemporaryFile()
    first = True
    for line in foo.readlines():
        if first:
            headers = line.rstrip().split()
            first = False
        else:
            print >>datafile, line,
    datafile.flush()
    #print datafile.name
    #runCmd("cat %s" % datafile.name)
    # Create columns with NUMERIC affinity so that if they are numbers,
    # SQL queries will treat them as such.
    create_statement = "CREATE TABLE data (" + ",".join(
        map(lambda x: "`%s` NUMERIC" % x, headers)) + ");"

    cmdfile = tempfile.NamedTemporaryFile()
    #print cmdfile.name
    print >>cmdfile,create_statement
    print >>cmdfile,".separator ' '"
    print >>cmdfile,".import '" + datafile.name + "' data"
    print >>cmdfile, sql + ";"
    cmdfile.flush()
    #runCmd("cat %s" % cmdfile.name)
    runCmd("cat %s | sqlite3" % cmdfile.name)

if __name__ == "__main__":
    sys.exit(main())

这篇关于Linux上的文本文件的SQL查询引擎?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆