如何使用awk/shell脚本执行SQL Where子句和SQL连接,例如行和列的过滤和合并? [英] How to use awk/shell scripting to do SQL Where clause and SQL join like filtering and merging of rows and columns?

查看:109
本文介绍了如何使用awk/shell脚本执行SQL Where子句和SQL连接,例如行和列的过滤和合并?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个庞大的数据集,例如15-20 GB,这是一个制表符分隔的文件.虽然我可以用Python或SQL进行操作,但在Shell脚本中完成操作会更容易,更简单,从而避免移动CSV文件

I have a huge data set with say 15 - 20 GB and it is a tab delimited file. While I can either do it in Python or in SQL, It would be easier and simple to have it done in Shell script to avoid moving the csv files

例如,以竖线分隔的文件输入为例:

Say, For example, taking a pipe delimited file input:

----------------------------------------
Col1 | Col2 | Col3 | Col4 | Col5 | Col6
----------------------------------------
 A   |  H1  | 123  | abcd | a1   | b1   
----------------------------------------
 B   |  H1  | 124  | abcd | a2   | b1   
----------------------------------------
 C   |  H2  | 127  | abd  | a3   | b1   
----------------------------------------
 D   |  H1  | 128  | acd  | a4   | b1   
----------------------------------------

SQL查询看起来像

从WHERE col2 ='H1'中选择Col1,Col4,Col5,Col6

SELECT Col1, Col4, Col5, Col6 FROM WHERE col2='H1'

输出:

--------------------------
Col1 | Col4 | Col5 | Col6
--------------------------
 A   | abcd | a1   | b1   
--------------------------
 B   | abcd | a2   | b1   
--------------------------
 D   | acd  | a4   | b1   
--------------------------

然后,我只需要接受其中的Col4即可在下面进行一些字符串解析和输出OutputFile1:

Then, I need to take in only the Col4 of this to do some string parsing and output below OutputFile1:

--------------------------------
Col1 | Col4 | Col5 | Col6 | New1
--------------------------------
 A   | abcd | a1   | b1   | a,b,c,d
--------------------------------
 B   | abcd | a2   | b1   | a,b,c,d
--------------------------------
 D   | acd  | a4   | b1   | a,c,d
--------------------------------

Col4是一个URL.我需要解析URL参数.请参阅问题-如何在shell脚本中解析URL参数

The Col4 is a URL. I need to parse the URL params. Refer Question - How to parse URL params in shell script

我想知道我是否还有另一个文件

And I would like to know if I have another file where I have

文件2:

--------------
ColA | ColB | 
--------------
 A   | abcd | 
--------------
 B   | abcd | 
--------------
 D   | qst  | 
--------------

我需要为ColB生成类似的解析输出.

I need to generate a similar parsed output for ColB.

OutputFile2:

OutputFile2:

--------------
ColA | ColB | New1
--------------
 A   | abcd | a,b,c,d
--------------
 B   | abcd | a,b,c,d
--------------
 D   | qst  | q,s,t
--------------

合并OutputFile1和OutputFile2的SQL查询将对

SQL Query to merge OutputFile1 and OutputFile2 would do a inner join on

OutputFile1.Col1 = OutputFile2.ColA和OutputFile1.New1 = OutputFile2.New1

OutputFile1.Col1 = OutputFile2.ColA and OutputFile1.New1 = OutputFile2.New1

最终输出:

--------------------------------
Col1 | Col4 | Col5 | Col6 | New1
--------------------------------
 A   | abcd | a1   | b1   | a,b,c,d
--------------------------------
 B   | abcd | a2   | b1   | a,b,c,d
--------------------------------

请分享实施相同建议.

主要限制是文件的大小.

The major constraint being the size of the file.

谢谢

推荐答案

There's a very simple database management program named "unity" available for UNIX at http://open-innovation.alcatel-lucent.com/projects/unity/. In unity you have 2 main files:

  1. 一个名为您喜欢的名称的数据文件,例如"foo"和
  2. 一个描述符文件,其名称与数据文件的名称相同,但描述符的前缀为"D",例如"Dfoo"

这两个都是简单的文本文件,您可以使用自己喜欢的任何编辑器进行编辑(或者具有自己的名为uedit的数据库感知编辑器).

These are both simple text files that you can edit with whatever editor you like (or it has it's own database-aware editor named uedit).

Dfoo在foo中的每一列上都有一行,描述了出​​现在foo的该列中的数据的属性,它是与下一列的分隔符.

Dfoo would have one row for each column in foo describing attributes of the data that appears in that column in foo and it's separator from the next column.

foo将拥有数据.

已经有一段时间了,因为我在原始文件中使用了统一(我有脚本在后台使用它),但是对于上面显示的第一个表:

It's been a while since I used unity in the raw (I have scripts that use it behind the scenes) but for the first table you show above:

----------------------------------------
Col1 | Col2 | Col3 | Col4 | Col5 | Col6
----------------------------------------
 A   |  H1  | 123  | abcd | a1   | b1   
----------------------------------------
 B   |  H1  | 124  | abcd | a2   | b1   
----------------------------------------
 C   |  H2  | 127  | abd  | a3   | b1   
----------------------------------------
 D   |  H1  | 128  | acd  | a4   | b1   
----------------------------------------

描述符文件(Dfoo)类似于:

the Descriptor file (Dfoo) would be something like:

Col1 | 5c
Col2 | 6c
Col3 | 6c
Col4 | 6c
Col5 | 6c
Col6 \n 6c

和数据文件(foo)将是:

and the data file (foo) would be:

A|H1|123|abcd|a1|b1
B|H1|124|abcd|a2|b1
C|H2|127|abd|a3|b1
D|H1|128|acd|a4|b1

然后您可以运行以下统一命令:

You can then run unity commands like:

uprint -d- foo

打印表时,行之间用下划线和描述符文件中指定宽度的单元格分隔(例如6c = 6个字符居中,而6r = 6个字符右对齐).

to print the table with rows separated by lines of underscores and cells of the width specified in your descriptor file (e.g. 6c = 6 characters Centered while 6r = 6 characters Right-justified).

uselect Col2 from foo where Col3 leq abd

从Col2列中选择值,其中Col3中的对应值在词法上等于字符串"abd".

to select the values from column Col2 where the corresponding value in Col3 is Lexically EQual to the string "abd".

有统一的命令可让您进行连接,合并,插入,删除等操作-基本上,您希望对关系型数据库执行的操作都是如此,但这全都基于简单的文本文件.

There are unity commands to let you do joins, merges, inserts, deletes, etc. - basically whatever you'd expect to be able to do with a relational database but it's all just based on simple text files.

您可以统一地在每列之间指定不同的分隔符,但是如果所有分隔符都相同(最后一个分隔符将为'\ n'除外),那么也可以仅使用awk在文件上运行awk脚本-F和分隔符.

In unity you can specify different separators between each column but if all of the separators are the same (except the final one which will be '\n') then you can run awk scripts on the file too just by using awk -F with the separator.

您可能会看到的其他几个工具集可能更易于安装,但可能没有统一性(自1970年代就已经存在!)那样多的功能是recutils(来自GNU)和csvDB,因此您的完整的作业/研究清单是:

A couple of other toolsets you could look at that might be easier to install but probably don't have as much functionality as unity (which has been around since the 1970s!) would be recutils (from GNU) and csvDB so your full homework/research list is:

请注意,recutils具有rec2csv和csv2rec工具,可在recutils和CSV格式之间进行转换.

Note that recutils has rec2csv and csv2rec tools for converting between the recutils and CSV formats.

这篇关于如何使用awk/shell脚本执行SQL Where子句和SQL连接,例如行和列的过滤和合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆