参考寻求理解一个模式"!_ [$ 0] ++" [英] seeking reference to understand one pattern "!_[$0]++"

查看:120
本文介绍了参考寻求理解一个模式"!_ [$ 0] ++"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是一个AWK新手,使用移植到Windows(UNXUtils)GNU工具和呆子,而不是AWK。在这个论坛上一个解决方案的工作就像绝对魔术,然后我试图找到一个源,我可以读更好地了解该解决方案提供的模式前pression。

Am an AWK newbie, using GNU utilities ported to Windows (UNXUtils) and gawk instead of awk. A solution on this forum worked like absolute magic, and I'm trying to find a source I can read to understand better the pattern expression offered in that solution.

在<一个href=\"http://stackoverflow.com/questions/618378/select-unique-or-distinct-values-from-a-list-in-unix-shell-script\">Select在UNIX shell脚本列表中唯一的或不同的值通过Dimitre Radoulov的答案提供了以下code

In Select unique or distinct values from a list in UNIX shell script an answer by Dimitre Radoulov offering the following code

zsh-4.3.9[t]%   awk '!_[$0]++' file

作为选择列表元素反复和混乱的元素,列出每个元件仅一次的溶液

as a solution for selecting elements of a list with repeated and jumbled elements, listing each element only once.

我已经previously使用排序| uniq的来做到这一点,这对小测试文件工作得很好。对于我的实际问题(提取来自印度国家证券交易所16天在2006年4月档案订单调研数据公司符号列表,与129+万条记录中的多个文件),分类负担变得太多。和柱不仅消除相邻重复。

I had previously used sort | uniq to do this, which worked fine for small test files. For my actual problem (extracting the list of company symbols from archival order book research data from India's National Stock Exchange for 16 days in April 2006, with 129+ million records in multiple files), the sorting burden became too much. And uniq only eliminates adjacent duplicates.

复制上面的线为我的Win-GNU GAWK,我用

Copying the above line for my Win-GNU gawk, I used

C:\Users\PAPERS\>  cat ..\Full*_Symbols.txt | gawk "!_[$0]++"  | wc -l

946

这表明129+万条记录,涉及到946家不同的公司,这是一个非常合理的答案。它把在我微薄的Windows机器上5分钟,试图SORT小时后穿我出去。

suggesting that the 129+ million records pertained to 946 different firms, which is a VERY reasonable answer. And it took under 5 minutes on my modest Windows machine, after hours of trying to SORT wore me out.

在所有awk的文字我已经看过,并搜索了一下网上,并同时为模式的一部分,为什么它的工作的解释是明确的(作为NOT , $ 1,0 是当前整个记录),用于下划线 _ 我无法找到任何解释,并有看到 ++ 的例子只是显示为1。更新计数器

Looked at all the awk texts I have and searched a bit online, and while for part of the pattern the explanation of why it worked is clear (! serves as NOT, $0 is the whole current record), for the underscore _ I am not able to find any explanation, and have seen ++ in examples only as "update the counter by 1."

将必须对相应的文本或Web参考充分理解这个例子感激,因为我认为这将有助于我在其他相关案件。谢谢。最好的,

Will be grateful for any appropriate text or web reference to understand this example fully, as I think it will help me in other related cases as well. Thanks. Best,

推荐答案

这实在是很聪明的!

它创建了一个关联数组(意为指数可以是任何东西,而不仅仅是一个数字)。如果元素不存在(为零),则创建(通过增加它),当有一个匹配 AWK 执行默认的动作(这是打印输入线)。一旦价值已经发现,在 _ [$ 0] 将是非零因此,如果相同的值再次遇到前pression是假的,打印什么

It creates an associative array (meaning the "index" can be anything, not just a number). If the element doesn't exist (is zero) it is created (by incrementing it), and when there is a match awk performs the default action (which is to print the input line). Once the value has been found, the _[$0] will be non-zero so if the same value is encountered again the expression is false and nothing is printed.

我觉得下划线只是一个香草变量名(你需要一个名称为您的阵列,并强调是猴子但更多的匿名为有效。
一个经典的!

I think the underscore is just a "vanilla" variable name (you need a name for your array and underscore is as valid as monkey but more "anonymous". A classic!

这篇关于参考寻求理解一个模式&QUOT;!_ [$ 0] ++&QUOT;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆