用unix sort，uniq和awk替换SQL查询 [英] Replacing an SQL query with unix sort, uniq and awk

查看：146 发布时间：2018/6/12 13:59:34 sql sorting awk hive uniq

本文介绍了用unix sort，uniq和awk替换SQL查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们目前在HDFS群集上有一些数据，我们使用Hive生成报告。基础设施正在退役，我们的任务是提供一个生成数据报告的替代方案（我们将这些数据作为制表符分隔的文件导入到我们的新环境中）。

假设我们有一个包含以下字段的表。
$ b

查询

IPAddress

LocationCode

我们以前在Hive上运行的原始SQL查询是不是......但类似的东西）
select COUNT（DISTINCT Query，IPAddress）as c1， LocationCode as c2，查询c3 从表查询，LocationCode
我想知道是否有人能够使用标准的unix / linux工具（如sort，uniq和awk）为我提供最高效的脚本，它可以作为上述查询的替代品。

假设脚本的输入是文本文件的目录。该目录将包含大约2000个文件。每个文件将包含任意数量的Tab分隔的表单记录：

Query< TAB> LocationCode< TAB> IP地址< NEWLINE>

解决方案
一旦你有一个包含所有唯一
查询< TAB> LocationCode< TAB> IP地址< NEWLINE>
您可以：

awk -F'\t''NR == 1 {q = $ 1;升= $ 2; count = 0} q == $ 1&& l == $ 2 {count ++} q！= $ 1 || l！= $ 2 {printf％s\t％s\t％d\\\ ，q，l，count; Q = $ 1;升= $ 2; count = 1} END {printf％s\t％s\t％d\\\ ，q，l，count}'sorted_uniq_file
要得到这个 sorted_uniq_file 那么简单的方法可以是：

sort -u dir / *> sorted_uniq_file
但是这可能会非常耗时且耗费内存。

更快的选项（以及更少的内存消耗）可以尽快消除重复，先排序并稍后合并。这需要一个临时空间来存放排序后的文件，让我们使用一个名为 sorted 的目录：

mkdir排序; for f in dir / *;做 sort -u $ f> sort / $ f 完成 sort -mu sorted / *> sorted_uniq_file rm -rf排序
如果上面的解决方案遇到了一些shell或排序限制（扩展 dir / * 或 sorted / * ，或 sort> / code>）：
mkdir sorted; ls dir |同时读f;做 sort -u dir / $ f>排序/ $ f 完成，同时[`ls sorted | wc -l` -gt 1];做 mkdir sorted_tmp ls sorted |同时读取f1;如果阅读f2，做 ;然后 sort -mu sorted / $ f1 sorted / $ f2> sorted_tmp / $ f1 else mv排序/ $ f1 sorted_tmp fi 完成 rm -rf排序 mv sorted_tmp排序完成 mv sorted / * sorted_uniq_file rm -rf sorted
上面的解决方案可以优化，以同时合并更多的2个文件。
We currently have some data on an HDFS cluster on which we generate reports using Hive. The infrastructure is in the process of being decommissioned and we are left with the task of coming up with an alternative of generating the report on the data (which we imported as tab separated files into our new environment) Assuming we have a table with the following fields. Query IPAddress LocationCode Our original SQL query we used to run on Hive was (well not exactly.. but something similar) select COUNT(DISTINCT Query, IPAddress) as c1, LocationCode as c2, Query as c3 from table group by Query, LocationCode I was wondering if someone could provide me with an the most efficient script using standard unix/linux tools such as sort, uniq and awk which can act as a replacement for the above query. Assume the input to the script would be a directory of text files. the dir would contain about 2000 files. Each file would contain arbitrary number of tab separated records of the form : Query <TAB> LocationCode <TAB> IPAddress <NEWLINE> 解决方案 Once you have a sorted file containing all the unique Query <TAB> LocationCode <TAB> IPAddress <NEWLINE> you could: awk -F '\t' 'NR == 1 {q=$1; l=$2; count=0} q == $1 && l == $2{count++} q != $1 || l != $2{printf "%s\t%s\t%d\n", q, l, count; q=$1; l=$2; count=1} END{printf "%s\t%s\t%d\n", q, l, count}' sorted_uniq_file To get this sorted_uniq_file the naive way can be: sort -u dir/* > sorted_uniq_file But this can be very long and memory consuming. A faster option (and less memory consuming) could be to eliminate duplicate as soon as possible, sorting first and merging later. This needs a temporary space for the sorted file, let use a directory named sorted: mkdir sorted; for f in dir/*; do sort -u $f > sorted/$f done sort -mu sorted/* > sorted_uniq_file rm -rf sorted If the solution above hit some shell or sort limit (expansion of dir/*, or of sorted/*, or number of parameters of sort): mkdir sorted; ls dir | while read f; do sort -u dir/$f > sorted/$f done while [ `ls sorted | wc -l` -gt 1 ]; do mkdir sorted_tmp ls sorted | while read f1; do if read f2; then sort -mu sorted/$f1 sorted/$f2 > sorted_tmp/$f1 else mv sorted/$f1 sorted_tmp fi done rm -rf sorted mv sorted_tmp sorted done mv sorted/* sorted_uniq_file rm -rf sorted The solution above can be optimized to merge more that 2 files at the same time. 这篇关于用unix sort，uniq和awk替换SQL查询的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用unix sort，uniq和awk替换SQL查询 [英] Replacing an SQL query with unix sort, uniq and awk

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用unix sort，uniq和awk替换SQL查询 [英] Replacing an SQL query with unix sort, uniq and awk

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭