如果该行包含多个特定数量的非数值,请删除该行 [英] Delete the row if it contains more than specific number of non numeric values
问题描述
我有一个大的(2GB)逗号分隔文本文件,其中包含来自Sensor的一些数据.有时传感器关闭并且没有数据.如果每行中的No Data
或Off
或any non-numeric
值的数量超过指定的数量,我想删除这些行;不包括标题.我只想从第3栏起算.例如:我的数据如下:
I have a large (2GB) comma separated textfile containing some data from Sensors. Sometimes the sensors are off and there is no data. I want to delete the rows if there are more than specified number of No Data
or Off
or any non-numeric
values in each row; excluding the header. I am only interested in counting from 3rd column onwards. For example: my data looks like:
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1871XYZR/KB.RAT,Data from process value,Off,No Data, No Data
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
1871XKZR/KB.VAT,Data from process value,No Data,Off,2
第一行是标题,我想保持原样.但是我想删除那些从第3列起在任何列/字段中具有2个或大于2个No Data
或Off
或任何non numeric
字段的行.换句话说,行中有4个或文本字段中有五个.在示例中,上面的第3和第6行有2个或超过2个No Data
或Off
字段,我想删除它们.因此,我的首选输出将是
Here first row is the header and I want to keep it as is. But I want delete those rows that have 2 or more than 2 No Data
or Off
or any non numeric
fields in any columns/fields from 3rd column onwards. In other words, rows having 4 or text fields out of five. In the example, above 3rd and 6th row have 2 or more than 2 No Data
or Off
fields and I want to delete them. Therefore, my preferred output would be
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
我可以针对特定情况使用循环来做到这一点,
I can do this for specific case with a loop as:
awk -F, '{ non_numeric=0;
for(i=1;i<=NF;i++){
if($i ~ // ) non_numeric++
}
if(non_numeric<2) print $0
}' testfile.txt
在这里,我只考虑No Data
和Off
.如何计算所有非数字字符串.如果我将if语句更改为
Here, I am considering only No Data
and Off
. How can I count all non-numeric strings. If I change the if statement to
if($i ~ /[^0-9]/ ) non_numeric++
它不起作用,并且不提供任何输出.另外,由于我正在使用循环,因此我认为它会很慢.我们可以以某种方式加快它的速度吗?任何命令行解决方案都可以.
it does not work and gives no output. Also since I am using loop, I reckon it is going to be slow. Can we speed this up, somehow. Any Commandline solution is Ok.
推荐答案
您可以使用grep
做到这一点:
You could do this with grep
:
grep -vP '((?<=,|^)(No Data|Off)(?=,|$).*){2,}' input
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
说明:(No Data|Off)
与No Data
或Off
匹配.我们用(?<=,|^)
和(?=,|$)
包围它;这些是零宽度的向后查找和向前查找,它们与,
或字符串的开头(或结尾)匹配.这样可以确保我们仅与整个字段匹配.由于我们想与一个字段多次匹配,因此我们将所有内容都放在量化的(...){2,}
中,并且还添加了.*
来说明字段之间的内容.
Explanation: (No Data|Off)
matches with either No Data
or Off
. We surround it by (?<=,|^)
and (?=,|$)
; these are a zero-width lookbehind and lookahead that match with a ,
or the beginning (or the end) of the string. This ensures that we are matching with a whole field only. Since we want to match with a field multiple times, we put everything inside a quantified (...){2,}
and we also add a .*
to account for the stuff between the fields.
这篇关于如果该行包含多个特定数量的非数值,请删除该行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!