验证shell中列的唯一值 [英] Validating unique values of a column in shell

查看:88
本文介绍了验证shell中列的唯一值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到一个输入文件vendor.csv,其中有一列称为零售商。
我有一个预定义的有效零售商值列表,即a,b,c。如果 d出现在零售商列中,我将不得不采取一些措施,主要是将其回显到日志中并停止处理并通知用户。

I get an input file vendor.csv which has a column called retailer. I have a predefined list of valid retailer values which are a,b,c. If 'd' comes in the retailer column I will have to take some action , mostly echo it to a log and stop the processing and notify the user.

我已经完成了到目前为止的以下

I have done the following so far

f1=/stage/Scripts/ecommerce/vendor/final*.csv
k=`cut -d, -f1 $f1 |sort -u`
echo $k

a b c d

上面的o / p不用逗号分隔

The above o/p is not comma seperated

对于上述情况,我可以将有效值abc存储在文件或字符串中

I can store the valid values a b c in a file or a string , for the above case

如何立即进行支票?这是最好的方法吗?

How do I make a check now ? Is this the best way to go about this

有效值为 ALB / SFY总计Ecom TA Peapod总计Ecom TA目标总计Ecom TA

现有数据包含以下唯一数据点
ALB / SFY总计Ecom TA Hy-Vee总计Ecom TA Peapod总Ecom TA目标总Ecom TA

The existing data contains the following unique data points ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

因此, Hy-Vee总Ecom TA是无效值。

So the "Hy-Vee Total Ecom TA" is an invalid value.

这是我尝试使用grep

Here is my attempt with grep

$ echo $s
ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

 echo $k
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

grep -v "$s" "$k"

错误

grep: ALB/SFY Total Ecom TA
Hy-Vee Total Ecom TA
Peapod Total Ecom TA
Target Total Ecom TA: No such file or directory

某些解决方案已指出我以正确的方式在RI中,上述任务将作为

Some of the solutions have pointed me in the right way, In R I would go about the above task as

valid_values = ['a','b','c']
invalid_retailer = unique(vendorfile$retailer) %not% in valid_values 

我正在尝试复制

推荐答案

也许是这样的事情?

awk -F, 'NR==FNR { ++a[$1]; next }
    !a[$1] { print FILENAME ":" FNR ": Invalid label " $1 >>"/dev/stderr" }' valid.txt final*.csv

其中 valid.txt 包含有效标签,每行一个。

where valid.txt contains your valid labels, one per line.

awk'NR == FNR {++ a [$ 1]}'的一般模式是将一组文件中的第一个读取到内存中的数组中,然后在脚本的其余部分中,对其他输入文件中的字段执行某种联接(从数据库的角度而言)。 Awk一次只处理一行,因此其他文件实际上可以任意大。不过,您确实需要能够将第一个文件中的数据存储在内存中。

The general pattern of awk 'NR==FNR { ++a[$1] }' is a common way to read the first of a set of files into an array in memory and then in the remainder of the script perform some sort of join (in the database sense) with fields in the other input files. Awk simply processes one line at a time so the other files can be arbitrarily large really. You do need to be able to store the data from the first file in memory, though.

相对于基本 cut + grep 的尝试是,我们可以打印整个输入行,而不仅仅是告诉您哪些标签无效,请您返回并手动找出其中的哪些行文件实际上包含违规。

The advantage over your basic cut+grep attempt is that we can print the entire input line rather than just tell you which labels are invalid and have you go back and manually find out which lines in which files actually contained the violation.

可能,您的 grep 尝试有很多问题。首先,如果您要处理的不仅仅是玩具数据,则要避免将数据存储在外壳变量中。其次,您可能想调整选项,以告诉 grep 您要字面上匹配文本( -F -否则, ac 匹配 abc ,因为该点是正则表达式通配符,例如),并且您想要匹配覆盖整行( -x -如果没有此内容, b 匹配 abc ,因为它是子字符串。

Tangentially, your grep attempt has a number of issues. Firstly, if you are dealing with anything more than toy data, you want to avoid storing your data in shell variables. Secondly, you probably want to tweak your options to tell grep that you want to match text literally (-F -- without this, a.c matches abc because the dot is a regex wildcard character, for example) and that you want matches to cover an entire line (-x -- without this, b matches abc because it is a substring).

cut -d, -f1 final*.csv | sort -u |
grep -vxFf valid.txt

-f filename 选项表示从文件读取模式,而没有另一个文件名, grep 处理标准输入(从管道,在这种情况下)。

The -f filename option says to read the patterns from a file, and without another file name, grep processes standard input (from the pipe, in this case).

这篇关于验证shell中列的唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆