如何在BASH中将制表符分隔值(TSV)文件转换为逗号分隔值(CSV)文件? [英] How do I convert a tab-separated values (TSV) file to a comma-separated values (CSV) file in BASH?
问题描述
我有一些TSV文件需要转换为CSV文件. BASH中是否有任何解决方案,例如使用awk
来转换这些?我可以这样使用sed
,但担心它会出错:
I have some TSV files that I need to convert to CSV files. Is there any solution in BASH, e.g. using awk
, to convert these? I could use sed
, like this, but am worried it will make some mistakes:
sed 's/\t/,/g' file.tsv > file.csv
- 不需要添加行情.
如何将TSV转换为CSV?
How can I convert a TSV to a CSV?
推荐答案
更新:尽管以下解决方案总体上不可靠 在OP的特定用例中进行工作;请参见底部部分,以获取基于awk
的可靠解决方案.
Update: The following solutions are not generally robust, although they do work in the OP's specific use case; see the bottom section for a robust, awk
-based solution.
总结选项(有趣的是,它们的表现大致相同):
To summarize the options (interestingly, they all perform about the same):
tr :
devnull 的解决方案(在问题注释中提供)是最简单的:
devnull's solution (provided in a comment on the question) is the simplest:
tr '\t' ',' < file.tsv > file.csv
固定:
OP自己的sed
解决方案非常好,因为输入不包含带引号的字符串(可能嵌入了\t
字符.):
The OP's own sed
solution is perfectly fine, given that the input contains no quoted strings (with potentially embedded \t
chars.):
sed 's/\t/,/g' file.tsv > file.csv
唯一需要注意的是,在某些平台(例如macOS)上,不支持转义序列\t
,因此使用文字制表符char.必须使用ANSI引号($'\t'
)拼接到命令字符串中:
The only caveat is that on some platforms (e.g., macOS) the escape sequence \t
is not supported, so a literal tab char. must be spliced into the command string using ANSI quoting ($'\t'
):
sed 's/'$'\t''/,/g' file.tsv > file.csv
awk :
awk
的警告是FS
-输入字段分隔符-必须设置为\t
明确-默认行为否则会剥离前导和尾随制表符并替换内部跨度只有一个,
The caveat with awk
is that FS
- the input field separator - must be set to \t
explicitly - the default behavior would otherwise strip leading and trailing tabs and replace interior spans of multiple tabs with only a single ,
:
awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv
请注意,简单地为其分配$1
会导致awk
使用OFS
- output 字段分隔符重建输入行;这有效地替换了所有\t
字符.与,
字符. print
然后简单地打印重建的行.
Note that simply assigning $1
to itself causes awk
to rebuild the input line using OFS
- the output field separator; this effectively replaces all \t
chars. with ,
chars. print
then simply prints the rebuilt line.
强大的awk
解决方案:
Robust awk
solution:
为A. Rabus 指出,以上解决方案无法正确处理本身包含,
字符的未加引号的输入字段-您最终将获得额外的CSV字段.
As A. Rabus points out, the above solutions do not handle unquoted input fields that themselves contain ,
characters correctly - you'll end up with extra CSV fields.
下面的awk
解决方案通过按需将这些字段包含在"..."
中来解决此问题(有关该方法的部分说明,请参见上面的非稳健的awk
解决方案).
The following awk
solution fixes this, by enclosing such fields in "..."
on demand (see the non-robust awk
solution above for a partial explanation of the approach).
如果此类字段也嵌入了"
字符,则会按照 RFC 4180 .谢谢,怀亚特以色列.
If such fields also have embedded "
chars., these are escaped as ""
, in line with RFC 4180.Thanks, Wyatt Israel.
awk 'BEGIN { FS="\t"; OFS="," } {
rebuilt=0
for(i=1; i<=NF; ++i) {
if ($i ~ /,/ && $i !~ /^".*"$/) {
gsub("\"", "\"\"", $i)
$i = "\"" $i "\""
rebuilt=1
}
}
if (!rebuilt) { $1=$1 }
print
}' file.tsv > file.csv
-
$i ~ /[,"]/ && $i !~ /^".*"$/
检测到任何包含,
和/或"
并且尚未用双引号引起来的字段$i ~ /[,"]/ && $i !~ /^".*"$/
detects any field that contains,
and/or"
and isn't already enclosed in double quotesgsub("\"", "\"\"", $i)
转义嵌入的"
字符.将它们加倍gsub("\"", "\"\"", $i)
escapes embedded"
chars. by doubling them$i = "\"" $i "\""
通过将结果括在双引号中来更新结果$i = "\"" $i "\""
updates the result by enclosing it in double quotes如前所述,更新任何字段都会导致
awk
用OFS
值(即,
)从字段重建在这种情况下,相当于有效的TSV-> CSV转换;标志rebuilt
用于确保至少一次重新构建每个输入记录.As stated before, updating any field causes
awk
to rebuild the line from the fields with theOFS
value, i.e.,,
in this case, which amounts to the effective TSV -> CSV conversion; flagrebuilt
is used to ensure that each input record is rebuilt at least once.这篇关于如何在BASH中将制表符分隔值(TSV)文件转换为逗号分隔值(CSV)文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!