Linux Join实用程序抱怨输入文件未排序 [英] Linux join utility complains about input file not being sorted

查看:550
本文介绍了Linux Join实用程序抱怨输入文件未排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件:

file1的格式为:

file1 has the format:

field1;field2;field3;field4

(文件1最初未排序)

file2的格式为:

file2 has the format:

field1

(对file2进行排序)

(file2 is sorted)

我运行以下2条命令:

sort -t\; -k1 file1 -o file1 # to sort file 1
join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2

我收到以下消息:

join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order

为什么会这样?

(我还尝试对file1进行排序,不仅考虑了该行的第一行,而且没有成功)

(I also tried to sort file1 taking into consideration the entire line not only the first filed of the line but with no success)

sort -t\; -c file1不输出任何内容.在27497行周围,情况确实很奇怪,这意味着排序无法正确完成其工作:

sort -t\; -c file1 doesn't output anything. Around line 27497, the situation is indeed strange which means that sort doesn't do its job correctly:

              XYZ113017;...
line 27497--> XYZ11301;...
              XYZ11301;...

推荐答案

Wumpus Q. Wumbley的有用答案补充(因为我发现这篇文章研究的是一个稍微不同的问题).

To complement Wumpus Q. Wumbley's helpful answer with a broader perspective (since I found this post researching a slightly different problem).

  • 使用join 时,输入文件必须仅通过连接字段进行排序 ,否则您可能会看到由以下人员报告的警告OP.

有两种常见方案,其中对输入文件进行排序时错误地包含了比感兴趣的字段 :

There are two common scenarios in which more than the field of interest is mistakenly included when sorting the input files:

  • 如果您确实指定了一个字段,则很容易忘记您还必须指定一个 stop 字段-即使仅定位了 1 字段也很容易-因为如果仅指定 start 字段,sort将使用该行的其余部分;例如:

  • If you do specify a field, it's easy to forget that you must also specify a stop field - even if you target only 1 field - because sort uses the remainder of the line if only a start field is specified; e.g.:

  • sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
  • sort -t, -k1,1 ... # Field 1 only
  • sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
  • sort -t, -k1,1 ... # Field 1 only

如果您的排序字段是输入中的第一个字段,则试图完全不指定任何字段选择器.

  • 但是,如果字段值可以是彼此的前缀子字符串,则对整行进行排序不会(仅)产生与仅按第一个字段进行排序相同的排序顺序:
  • sort ... # NOT always the same as 'sort -k1,1'! see below for example
  • However, if field values can be prefix substrings of each other, sorting whole lines will NOT (necessarily) result in the same sort order as just sorting by the 1st field:
  • sort ... # NOT always the same as 'sort -k1,1'! see below for example

陷阱示例:

#!/usr/bin/env bash

# Input data: fields separated by '^'.
# Note that, when properly sorting by field 1, the order should
# be "nameA" before "nameAA" (followed by "nameZ").
# Note how "nameA" is a substring of "nameAA".
read -r -d '' input <<EOF
nameA^other1
nameAA^other2
nameZ^other3
EOF

# NOTE: "WRONG" below refers to deviation from the expected outcome
#       of sorting by field 1 only, based on mistaken assumptions.
#       The commands do work correctly in a technical sense.

echo '--- just sort'
sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first

echo '--- sort FROM field 1'
sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first

echo '--- sort with field 1 ONLY'
sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first

说明:

  • 当不将排序限制在第一个字段时,它是char的相对排序顺序.在此示例中很重要的^A(列索引6).换句话说:将字段分隔符与数据进行了比较,这是问题的根源:^的ASCII值比A更高,因此对进行排序'A',导致以nameAA^开头的行在以nameA^开头的行之前进行排序.

  • When NOT limiting sorting to the first field, it is the relative sort order of chars. ^ and A (column index 6) that matters in this example. In other words: the field separator is compared to data, which is the source of the problem: ^ has a HIGHER ASCII value than A, and therefore sorts after 'A', resulting in the line starting with nameAA^ sorting BEFORE the one with nameA^.

注意:问题可能会在一个平台上浮出水面,但会根据地区和环境在另一个 上掩盖字符集设置和/或使用的sort实现;例如,有效的语言环境为en_US.UTF-8,以,作为分隔符,并且允许-在内部字段中:

Note: It is possible for problems to surface on one platform, but be masked on another, based on locale and character-set settings and/or the sort implementation used; e.g., with a locale of en_US.UTF-8 in effect, with , as the separator and - permissible inside fields:

    在OSX 10.10.2(是 GNU sort版本,5.93)上使用的
  • sort-之前对,进行排序(与ASCII值一致)
  • 在Ubuntu 14.04(GNU sort 8.21)上使用的
  • sort执行相反:在, [1] -进行排序>
  • sort as used on OSX 10.10.2 (which is an old GNU sort version, 5.93) sorts , before - (in line with ASCII values)
  • sort as used on Ubuntu 14.04 (GNU sort 8.21) does the opposite: sorts - before ,[1]

[1]我不知道为什么-如果有人知道,请告诉我.用sort <<<$'-\n,'

这篇关于Linux Join实用程序抱怨输入文件未排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆