用unix sort,uniq和awk替换SQL查询 [英] Replacing an SQL query with unix sort, uniq and awk

查看:146
本文介绍了用unix sort,uniq和awk替换SQL查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们目前在HDFS群集上有一些数据,我们使用Hive生成报告。基础设施正在退役,我们的任务是提供一个生成数据报告的替代方案(我们将这些数据作为制表符分隔的文件导入到我们的新环境中)。



假设我们有一个包含以下字段的表。
$ b


  • 查询

  • IPAddress
  • LocationCode



我们以前在Hive上运行的原始SQL查询是不是......但类似的东西)

  select 
COUNT(DISTINCT Query,IPAddress)as c1,
LocationCode as c2,
查询c3
从表
查询,LocationCode

我想知道是否有人能够使用标准的unix / linux工具(如sort,uniq和awk)为我提供最高效的脚本,它可以作为上述查询的替代品。



假设脚本的输入是文本文件的目录。该目录将包含大约2000个文件。每个文件将包含任意数量的Tab分隔的表单记录:

  Query< TAB> LocationCode< TAB> IP地址< NEWLINE> 


解决方案

一旦你有一个包含所有唯一

 查询< TAB> LocationCode< TAB> IP地址< NEWLINE> 

您可以:

  awk -F'\t''NR == 1 {q = $ 1;升= $ 2; count = 0} 
q == $ 1&& l == $ 2 {count ++}
q!= $ 1 || l!= $ 2 {printf%s\t%s\t%d\\\
,q,l,count; Q = $ 1;升= $ 2; count = 1}
END {printf%s\t%s\t%d\\\
,q,l,count}'sorted_uniq_file

要得到这个 sorted_uniq_file 那么简单的方法可以是:

  sort -u dir / *> sorted_uniq_file 

但是这可能会非常耗时且耗费内存。



更快的选项(以及更少的内存消耗)可以尽快消除重复,先排序并稍后合并。这需要一个临时空间来存放排序后的文件,让我们使用一个名为 sorted 的目录:

  mkdir排序; 
for f in dir / *;做
sort -u $ f> sort / $ f
完成
sort -mu sorted / *> sorted_uniq_file
rm -rf排序

如果上面的解决方案遇到了一些shell或排序限制(扩展 dir / * sorted / * ,或 sort> / code>):

  mkdir sorted; 
ls dir |同时读f;做
sort -u dir / $ f>排序/ $ f
完成
,同时[`ls sorted | wc -l` -gt 1];做
mkdir sorted_tmp
ls sorted |同时读取f1;如果阅读f2,做
;然后
sort -mu sorted / $ f1 sorted / $ f2> sorted_tmp / $ f1
else
mv排序/ $ f1 sorted_tmp
fi
完成
rm -rf排序
mv sorted_tmp排序
完成
mv sorted / * sorted_uniq_file
rm -rf sorted

上面的解决方案可以优化,以同时合并更多的2个文件。


We currently have some data on an HDFS cluster on which we generate reports using Hive. The infrastructure is in the process of being decommissioned and we are left with the task of coming up with an alternative of generating the report on the data (which we imported as tab separated files into our new environment)

Assuming we have a table with the following fields.

  • Query
  • IPAddress
  • LocationCode

Our original SQL query we used to run on Hive was (well not exactly.. but something similar)

select 
COUNT(DISTINCT Query, IPAddress) as c1,
LocationCode as c2, 
Query as c3
from table
group by Query, LocationCode

I was wondering if someone could provide me with an the most efficient script using standard unix/linux tools such as sort, uniq and awk which can act as a replacement for the above query.

Assume the input to the script would be a directory of text files. the dir would contain about 2000 files. Each file would contain arbitrary number of tab separated records of the form :

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

解决方案

Once you have a sorted file containing all the unique

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

you could:

awk -F '\t' 'NR == 1 {q=$1; l=$2; count=0}
q == $1 && l == $2{count++}
q != $1 || l != $2{printf "%s\t%s\t%d\n", q, l, count; q=$1; l=$2; count=1}
END{printf "%s\t%s\t%d\n", q, l, count}' sorted_uniq_file

To get this sorted_uniq_file the naive way can be:

sort -u dir/* > sorted_uniq_file

But this can be very long and memory consuming.

A faster option (and less memory consuming) could be to eliminate duplicate as soon as possible, sorting first and merging later. This needs a temporary space for the sorted file, let use a directory named sorted:

mkdir sorted;
for f in dir/*; do
   sort -u $f > sorted/$f
done
sort -mu sorted/* > sorted_uniq_file
rm -rf sorted

If the solution above hit some shell or sort limit (expansion of dir/*, or of sorted/*, or number of parameters of sort):

mkdir sorted;
ls dir | while read f; do
   sort -u dir/$f > sorted/$f
done
while [ `ls sorted | wc -l` -gt 1 ]; do
  mkdir sorted_tmp
  ls sorted | while read f1; do
    if read f2; then
      sort -mu sorted/$f1 sorted/$f2 > sorted_tmp/$f1
    else
      mv sorted/$f1 sorted_tmp
    fi
  done
  rm -rf sorted
  mv sorted_tmp sorted
done
mv sorted/* sorted_uniq_file
rm -rf sorted

The solution above can be optimized to merge more that 2 files at the same time.

这篇关于用unix sort,uniq和awk替换SQL查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆