Linux Join实用程序抱怨输入文件未排序 [英] Linux join utility complains about input file not being sorted
问题描述
我有两个文件:
file1的格式为:
file1 has the format:
field1;field2;field3;field4
(文件1最初未排序)
file2的格式为:
file2 has the format:
field1
(对file2进行排序)
(file2 is sorted)
我运行以下2条命令:
sort -t\; -k1 file1 -o file1 # to sort file 1
join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2
我收到以下消息:
join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order
为什么会这样?
(我还尝试对file1进行排序,不仅考虑了该行的第一行,而且没有成功)
(I also tried to sort file1 taking into consideration the entire line not only the first filed of the line but with no success)
sort -t\; -c file1
不输出任何内容.在27497行周围,情况确实很奇怪,这意味着排序无法正确完成其工作:
sort -t\; -c file1
doesn't output anything. Around line 27497, the situation is indeed strange which means that sort doesn't do its job correctly:
XYZ113017;...
line 27497--> XYZ11301;...
XYZ11301;...
推荐答案
以 Wumpus Q. Wumbley的有用答案补充(因为我发现这篇文章研究的是一个稍微不同的问题).
To complement Wumpus Q. Wumbley's helpful answer with a broader perspective (since I found this post researching a slightly different problem).
- 使用
join
时,输入文件必须仅通过连接字段进行排序 ,否则您可能会看到由以下人员报告的警告OP.
有两种常见方案,其中对输入文件进行排序时错误地包含了比感兴趣的字段 :
There are two common scenarios in which more than the field of interest is mistakenly included when sorting the input files:
-
如果您确实指定了一个字段,则很容易忘记您还必须指定一个 stop 字段-即使仅定位了 1 字段也很容易-因为如果仅指定 start 字段,
sort
将使用该行的其余部分;例如:
If you do specify a field, it's easy to forget that you must also specify a stop field - even if you target only 1 field - because
sort
uses the remainder of the line if only a start field is specified; e.g.:
-
sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
-
sort -t, -k1,1 ... # Field 1 only
sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
sort -t, -k1,1 ... # Field 1 only
如果您的排序字段是输入中的第一个字段,则试图完全不指定任何字段选择器.
- 但是,如果字段值可以是彼此的前缀子字符串,则对整行进行排序不会(仅)产生与仅按第一个字段进行排序相同的排序顺序:
-
sort ... # NOT always the same as 'sort -k1,1'! see below for example
- However, if field values can be prefix substrings of each other, sorting whole lines will NOT (necessarily) result in the same sort order as just sorting by the 1st field:
sort ... # NOT always the same as 'sort -k1,1'! see below for example
陷阱示例:
#!/usr/bin/env bash
# Input data: fields separated by '^'.
# Note that, when properly sorting by field 1, the order should
# be "nameA" before "nameAA" (followed by "nameZ").
# Note how "nameA" is a substring of "nameAA".
read -r -d '' input <<EOF
nameA^other1
nameAA^other2
nameZ^other3
EOF
# NOTE: "WRONG" below refers to deviation from the expected outcome
# of sorting by field 1 only, based on mistaken assumptions.
# The commands do work correctly in a technical sense.
echo '--- just sort'
sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort FROM field 1'
sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort with field 1 ONLY'
sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first
说明:
-
当不将排序限制在第一个字段时,它是char的相对排序顺序.在此示例中很重要的
^
和A
(列索引6).换句话说:将字段分隔符与数据进行了比较,这是问题的根源:^
的ASCII值比A
更高,因此对进行排序'A',导致以nameAA^
开头的行在以nameA^
开头的行之前进行排序.
When NOT limiting sorting to the first field, it is the relative sort order of chars.
^
andA
(column index 6) that matters in this example. In other words: the field separator is compared to data, which is the source of the problem:^
has a HIGHER ASCII value thanA
, and therefore sorts after 'A', resulting in the line starting withnameAA^
sorting BEFORE the one withnameA^
.
注意:问题可能会在一个平台上浮出水面,但会根据地区和环境在另一个 上掩盖字符集设置和/或使用的sort
实现;例如,有效的语言环境为en_US.UTF-8
,以,
作为分隔符,并且允许-
在内部字段中:
Note: It is possible for problems to surface on one platform, but be masked on another, based on locale and character-set settings and/or the sort
implementation used; e.g., with a locale of en_US.UTF-8
in effect, with ,
as the separator and -
permissible inside fields:
-
在OSX 10.10.2(是旧 GNU
-
sort
在-
之前对,
进行排序(与ASCII值一致)
在Ubuntu 14.04(GNU -
sort
执行相反:在,
[1] -进行排序>
sort
版本,5.93)上使用的sort
8.21)上使用的sort
as used on OSX 10.10.2 (which is an old GNUsort
version, 5.93) sorts,
before-
(in line with ASCII values)sort
as used on Ubuntu 14.04 (GNUsort
8.21) does the opposite: sorts-
before,
[1]
[1]我不知道为什么-如果有人知道,请告诉我.用sort <<<$'-\n,'
这篇关于Linux Join实用程序抱怨输入文件未排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!