优化循环,从外部文件传递参数,命名参数数组内的awk [英] optimizing loop, passing parameters from external file, naming array arguments within awk

查看:125
本文介绍了优化循环,从外部文件传递参数,命名参数数组内的awk的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是一个AWK新手。在UNXUTILS使用Windows的GNU GAWK。

Am an awk newbie. Using Windows-GNU gawk in UNXUTILS.

有2种记录顺序的日期和时间顺序在我的文件(S),30场单记录(开始与O),其中数量是15场,18个排列-field交易记录(开始与T),其中数量是8场。底层的研究数据是历史 - 档案印度股市数据跨越15天2006年4月,1000公司,并包含所有约100单独订货或交易记录。我的测试数据是500个记录为2日期,有的200家企业。

Have 2 kinds of records arranged sequentially in date and time order in my file(s), 30-field Order records (start with "O") where quantity is the 15th field, and 18-field Trade records (start with "T") where quantity is the 8th field. The underlying research data is historical-archival Indian stock market data spanning 15 days in April 2006, about 1000 firms, and comprising in all about 100 million separate order or trade records. My test data is 500 records for 2 dates, and some 200 firms.

我在这一点上的目标是只计算每个公司和每个日期,该公司最新的累计订单量和交易量。

中的原始数据是日期和时间排序(企业明显混乱的,就像谁的选民通常不会按字母顺序投票!)。而且我现在有两个独立的文本文件,一个仅包含不同的公司符号列表;另,鲜明的日期,每行一个。

The raw data IS ordered by date and time (firms obviously jumbled up, just like voters who don't usually vote in alphabetical order!). And I do now have two separate text files, one containing a list of just the distinct firm symbols; and the other, the distinct dates, one per line.

我想尝试完成的计算中,不需要让我去THRU所有记录一遍又一遍为每一个企业和日期的方式。给出的企业的基本计算= FIRM_1和日期= DATE_1是容易的,例如我有什么似的。

I want try to complete the computations in a way that does not require making me go thru all of the records over and over again for each of the firms and dates. The basic computations given a firm=FIRM_1 and a date=DATE_1 are easy, for e.g. what I have resembles

# For each order record with firm_symbol = FIRM_1, date = DATE_1, 
# cumulate its Order quantity ($15).

( /^O/ && $4~/FIRM_1/ ) && $2~/DATE_1/ 
            { Order_Q[FIRM_1_DATE_1]=Order_Q[FIRM_1_DATE_1]+$15] }

# For each trade record with firm_symbol = FIRM_1, date = DATE_1, 
#cumulate its Trade quantity ($8).

( /^T/ && $4~/FIRM_1/ ) && $2~/DATE_1/ 
            { Trade_Q[FIRM_1_DATE_1]=Trade_Q[FIRM_1_DATE_1]+$8] }

END { print "FIRM_1 ", "DATE_1 ", Order_Q[FIRM_1_DATE_1], Trade_Q[FIRM_1_DATE_1] }

问题是如何构建一个智能遍历所有企业和日期,考虑到基础数据的大小。有几个相关问题。

The question is how to construct an intelligent loop over all firms and dates, given the size of the underlying data. There are several related questions.


  1. 我知道这个名字FIRM_1不一定是这个awk脚本里面硬codeD,但可以给作为命令行参数。 但是可以更进一步,让awk来取的名字相继从名称列表,在一个单独的文件,每行一个?(如果这是可行的,然后取日期从日期列表会也有可能)。

  1. I know the name FIRM_1 need not be hard-coded inside this awk script, but could be given as a command line parameter. But can one go one step further and get awk to take that name sequentially from a list of names in a separate file, one per line? (If that's feasible, then taking dates from a list of dates would also be possible.)

我构建了数组参数名称持有知道FIRM_1和DATE_1订单数量和交易量。如果我们在上面解决1成功,可以构造一个动态,AWK里面像FIRM_1_DATE_1和FIRM_1_DATE_1数组参数名,而它正在运行?将字符串连接,帮助形成一个名字被允许吗

I constructed the array argument names to hold Order quantity and Trade quantity knowing FIRM_1 and DATE_1. If we succeed in resolving 1 above, can one construct array argument names like FIRM_1_DATE_1 and FIRM_1_DATE_1 on the fly, inside awk, while it is running? Will string concatenation to help form a name be allowed?

我意识到,我可以使用编辑器宏或一些这样的方法,在做任何这之前,我的2把钥匙,企业(1000个值)和日期(15个值)合并成一个FIRM_DATE键(15000值) ,在独立的一步。如果上述2是可行的,我假设有在做这个没有价值。这将有助于反正

I realize that I could use an editor macro or some such method, to combine my 2 keys, FIRM (1000 values) and DATE (15 values) into one FIRM_DATE key (15000 values) before doing any of this, in a separate step. If 2 above is feasible, I'm assuming there's no value to doing this. Would it help anyway?

在原则上,我们正在寻找在内存中保存1000或许企业倍15天次2变量= 30,000细胞条目2阵列,ORDER_Q和TRADE_Q。这是很多?我使用的是温和的Windows桌面操作系统上,我认为8GB的内存。

In principle we are looking to hold in memory perhaps 1000 firms times 15 days times 2 variables = 30,000 cell entries in 2 arrays, ORDER_Q and TRADE_Q. Is this a lot? I use a modest Windows desktop with I think 8GB RAM.

任何建议或参考,或例子,将有助于减少不必去比原先大的输入数据多次将非常欢迎。如果事情涉及到学习更多的不只是AWK而是shell脚本,这也将是非常欢迎的。

Any suggestion or reference or example that will help reduce having to go over the original large input data several times will be very welcome. If something involves learning more not just about awk but about shell scripts, that will also be very welcome.

推荐答案

使用关联数组。假设 $ 2 包含公司的名称, $ 4'/ code>的日期,然后:

Use associative arrays. Assuming that $2 contains the name of the firm and $4 the date, then:

awk '/^O/ { order_qty[$2,$4] += $15 }
     /^T/ { trade_qty[$2,$4] += $8  }
     END  { for (key in order_qty) { print key, "O", order_qty[key]; }
            for (key in trade_qty) { print key, "T", trade_qty[key]; }
          }'

这不会给你输出的公司或日期的定义的顺序。有技术来做到这一点。这使得在数据积累为所有公司的业绩和所有日期的单程都在一个回合。

That does not give you a defined order for the companies or dates in the output. There are techniques to do that. This makes a single pass over the data accumulating the results for all the companies and all the dates all in one turn.

awk '     { if (date[$4]++ == 0) date_list[d++] = $4; # Dates appear in order
            if (firm[$2]++ == 0) firm_list[f++] = $2; # Firms appear out of order
          }
     /^O/ { order_qty[$2,$4] += $15 }
     /^T/ { trade_qty[$2,$4] += $8  }
     END  { for (i = 0; i < f; i++)
            {
                for (j = 0; j < d; j++)
                {
                    if ((qty = order_qty[firm_list[i],date_list[j]]) > 0)
                        print firm_list[i], date_list[j], "O", qty
                    if ((qty = trade_qty[firm_list[i],date_list[j]]) > 0)
                        print firm_list[i], date_list[j], "T", qty
                }
            }
          }'

如果你想在一个特定的公司(例如排序)顺序排序打印前公司名单。 GNU AWK 提供了内置的排序功能。否则,你就必须写一个 AWK 函数来做到这一点。 (请参见编程珠玑或的更多编程珍珠(或两者)的详细信息,在 AWK 写排序功能。)

If you want the firms in a specific (e.g. sorted) order, sort the firm list before printing. GNU awk provides built-in sort functions. Otherwise, you'll have to write an awk function to do it. (See Programming Pearls or More Programming Pearls (or both) for more information on writing sort functions in awk.)

警告:未经测试code

这篇关于优化循环,从外部文件传递参数,命名参数数组内的awk的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆