验证shell中列的唯一值 [英] Validating unique values of a column in shell
问题描述
我得到一个输入文件vendor.csv,其中有一列称为零售商。
我有一个预定义的有效零售商值列表,即a,b,c。如果 d出现在零售商列中,我将不得不采取一些措施,主要是将其回显到日志中并停止处理并通知用户。
I get an input file vendor.csv which has a column called retailer. I have a predefined list of valid retailer values which are a,b,c. If 'd' comes in the retailer column I will have to take some action , mostly echo it to a log and stop the processing and notify the user.
我已经完成了到目前为止的以下
I have done the following so far
f1=/stage/Scripts/ecommerce/vendor/final*.csv
k=`cut -d, -f1 $f1 |sort -u`
echo $k
我
a b c d
上面的o / p不用逗号分隔
The above o/p is not comma seperated
对于上述情况,我可以将有效值abc存储在文件或字符串中
I can store the valid values a b c in a file or a string , for the above case
如何立即进行支票?这是最好的方法吗?
How do I make a check now ? Is this the best way to go about this
有效值为 ALB / SFY总计Ecom TA Peapod总计Ecom TA目标总计Ecom TA
现有数据包含以下唯一数据点
ALB / SFY总计Ecom TA Hy-Vee总计Ecom TA Peapod总Ecom TA目标总Ecom TA
The existing data contains the following unique data points
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
因此, Hy-Vee总Ecom TA是无效值。
So the "Hy-Vee Total Ecom TA" is an invalid value.
这是我尝试使用grep
Here is my attempt with grep
$ echo $s
ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
echo $k
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
grep -v "$s" "$k"
错误
grep: ALB/SFY Total Ecom TA
Hy-Vee Total Ecom TA
Peapod Total Ecom TA
Target Total Ecom TA: No such file or directory
某些解决方案已指出我以正确的方式在RI中,上述任务将作为
Some of the solutions have pointed me in the right way, In R I would go about the above task as
valid_values = ['a','b','c']
invalid_retailer = unique(vendorfile$retailer) %not% in valid_values
我正在尝试复制
推荐答案
也许是这样的事情?
awk -F, 'NR==FNR { ++a[$1]; next }
!a[$1] { print FILENAME ":" FNR ": Invalid label " $1 >>"/dev/stderr" }' valid.txt final*.csv
其中 valid.txt
包含有效标签,每行一个。
where valid.txt
contains your valid labels, one per line.
awk'NR == FNR {++ a [$ 1]}'
的一般模式是将一组文件中的第一个读取到内存中的数组中,然后在脚本的其余部分中,对其他输入文件中的字段执行某种联接(从数据库的角度而言)。 Awk一次只处理一行,因此其他文件实际上可以任意大。不过,您确实需要能够将第一个文件中的数据存储在内存中。
The general pattern of awk 'NR==FNR { ++a[$1] }'
is a common way to read the first of a set of files into an array in memory and then in the remainder of the script perform some sort of join (in the database sense) with fields in the other input files. Awk simply processes one line at a time so the other files can be arbitrarily large really. You do need to be able to store the data from the first file in memory, though.
相对于基本 cut $ c的优势$ c> +
grep
的尝试是,我们可以打印整个输入行,而不仅仅是告诉您哪些标签无效,请您返回并手动找出其中的哪些行文件实际上包含违规。
The advantage over your basic cut
+grep
attempt is that we can print the entire input line rather than just tell you which labels are invalid and have you go back and manually find out which lines in which files actually contained the violation.
可能,您的 grep
尝试有很多问题。首先,如果您要处理的不仅仅是玩具数据,则要避免将数据存储在外壳变量中。其次,您可能想调整选项,以告诉 grep
您要字面上匹配文本( -F
-否则, ac
匹配 abc
,因为该点是正则表达式通配符,例如),并且您想要匹配覆盖整行( -x
-如果没有此内容, b
匹配 abc
,因为它是子字符串。
Tangentially, your grep
attempt has a number of issues. Firstly, if you are dealing with anything more than toy data, you want to avoid storing your data in shell variables. Secondly, you probably want to tweak your options to tell grep
that you want to match text literally (-F
-- without this, a.c
matches abc
because the dot is a regex wildcard character, for example) and that you want matches to cover an entire line (-x
-- without this, b
matches abc
because it is a substring).
cut -d, -f1 final*.csv | sort -u |
grep -vxFf valid.txt
-f
filename 选项表示从文件读取模式,而没有另一个文件名, grep
处理标准输入(从管道,在这种情况下)。
The -f
filename option says to read the patterns from a file, and without another file name, grep
processes standard input (from the pipe, in this case).
这篇关于验证shell中列的唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!