如果该行包含多个特定数量的非数值,请删除该行 [英] Delete the row if it contains more than specific number of non numeric values

查看:93
本文介绍了如果该行包含多个特定数量的非数值,请删除该行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的(2GB)逗号分隔文本文件,其中包含来自Sensor的一些数据.有时传感器关闭并且没有数据.如果每行中的No DataOffany non-numeric值的数量超过指定的数量,我想删除这些行;不包括标题.我只想从第3栏起算.例如:我的数据如下:

I have a large (2GB) comma separated textfile containing some data from Sensors. Sometimes the sensors are off and there is no data. I want to delete the rows if there are more than specified number of No Data or Off or any non-numeric values in each row; excluding the header. I am only interested in counting from 3rd column onwards. For example: my data looks like:

Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1871XYZR/KB.RAT,Data from process value,Off,No Data, No Data
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
1871XKZR/KB.VAT,Data from process value,No Data,Off,2

第一行是标题,我想保持原样.但是我想删除那些从第3列起在任何列/字段中具有2个或大于2个No DataOff或任何non numeric字段的行.换句话说,行中有4个或文本字段中有五个.在示例中,上面的第3和第6行有2个或超过2个No DataOff字段,我想删除它们.因此,我的首选输出将是

Here first row is the header and I want to keep it as is. But I want delete those rows that have 2 or more than 2 No Data or Off or any non numeric fields in any columns/fields from 3rd column onwards. In other words, rows having 4 or text fields out of five. In the example, above 3rd and 6th row have 2 or more than 2 No Data or Off fields and I want to delete them. Therefore, my preferred output would be

Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80  

我可以针对特定情况使用循环来做到这一点,

I can do this for specific case with a loop as:

awk -F, '{ non_numeric=0;
  for(i=1;i<=NF;i++){
    if($i ~ // ) non_numeric++
  }
  if(non_numeric<2) print $0
}' testfile.txt

在这里,我只考虑No DataOff.如何计算所有非数字字符串.如果我将if语句更改为

Here, I am considering only No Data and Off. How can I count all non-numeric strings. If I change the if statement to

if($i ~ /[^0-9]/ ) non_numeric++

它不起作用,并且不提供任何输出.另外,由于我正在使用循环,因此我认为它会很慢.我们可以以某种方式加快它的速度吗?任何命令行解决方案都可以.

it does not work and gives no output. Also since I am using loop, I reckon it is going to be slow. Can we speed this up, somehow. Any Commandline solution is Ok.

推荐答案

您可以使用grep做到这一点:

You could do this with grep:

grep -vP '((?<=,|^)(No Data|Off)(?=,|$).*){2,}' input

Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80

说明:(No Data|Off)No DataOff匹配.我们用(?<=,|^)(?=,|$)包围它;这些是零宽度的向后查找和向前查找,它们与,或字符串的开头(或结尾)匹配.这样可以确保我们仅与整个字段匹配.由于我们想与一个字段多次匹配,因此我们将所有内容都放在量化的(...){2,}中,并且还添加了.*来说明字段之间的内容.

Explanation: (No Data|Off) matches with either No Data or Off. We surround it by (?<=,|^) and (?=,|$); these are a zero-width lookbehind and lookahead that match with a , or the beginning (or the end) of the string. This ensures that we are matching with a whole field only. Since we want to match with a field multiple times, we put everything inside a quantified (...){2,} and we also add a .* to account for the stuff between the fields.

这篇关于如果该行包含多个特定数量的非数值,请删除该行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆