我应该使用AWK或SED从CSV删除引号之间的逗号文件? (BASH) [英] Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

查看:465
本文介绍了我应该使用AWK或SED从CSV删除引号之间的逗号文件? (BASH)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 CSV 格式一堆日常打印机日志的,我写一个脚本来跟踪多少纸张使用和保存信息到数据库,但我已经遇到了一个小问题。

I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem

从本质上讲,一些日志文件名称中都包含在其中逗号(这是所有封闭的双引号内),并且由于它是逗号分隔的格式,我的code是搞乱,推动一切一列正确的某些记录。

Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.

这是我一直在读什么,好像去修复,这将要使用的最佳方法 AWK SED ,但我不能确定这是我的情况是最好的选择,而我应该究竟如何实现它。

From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed, but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.

下面是我输入数据的样本:

Here's a sample of my input data:

 2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,

这是我到目前为止有:

#!/bin/bash

#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"

# Remove commas in between quotes.

#Loop through CSV file

OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
        #Remove headers
        if [  "$user" != "" ] && [ "$user" != "User" ]
        then
                #Remove any file name with an apostrophe

                if [[ "$document" =~ "'" ]];
                then
                        document="REDACTED"; # Lazy. Need to figure out a proper solution later.
                fi

                echo "$time"
                #Save results to database
                mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
        fi
done < $filename
IFS=$OLDIFS

哪种选择更适合这项任务?请问我要创建第二个临时文件,以完成这件事?

Which option is more suitable for this task? Will I have to create a second temporary file to get this done?

在此先感谢!

推荐答案

正如我在另一个答案写道:

As I wrote in another answer:

而不是什么显然是源数据,即引号里面的东西干扰,你可能会考虑更换现场分隔符逗号(有说 | ),而不是:

Rather than interfere with what is evidently source data, i.e. the stuff inside the quotes, you might consider replacing the field-separator commas (with say |) instead:

s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g

,然后分裂 | (假设您的任何数据都有 | 在里面)。

And then splitting on | (assuming none of your data has | in it).

<一个href=\"http://stackoverflow.com/questions/33054559/is-it-possible-to-write-a-regular-ex$p$pssion-that-matches-a-particular-pattern-a/33056188#33056188\">Is可以编写符合特定的模式,然后做了与图案的一部分替换

这篇关于我应该使用AWK或SED从CSV删除引号之间的逗号文件? (BASH)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆