在Bash中查看重音文件 [英] Globbing accented files in Bash

查看:92
本文介绍了在Bash中查看重音文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试验证Bash中是否存在文件.我知道文件名(在变量中),但不知道扩展名(可以是.pmdl.umdl).

I'm trying to verify a file exists in Bash. I know the file name (in a variable) but not the extension (can be .pmdl or .umdl).

在OSX上有效:

$> ls
ecole.pmdl
$> filename="ecole"
$> ls "$filename."[pu]mdl
ecole.pmdl

但是当文件名包含重音符号时不会:

But it doesn't when the file name contains an accent:

$> ls
école.pmdl
$> filename="école"
$> ls "$filename."[pu]mdl
ls: école.[pu]mdl: No such file or directory

但是,如果我不使用通配符,它​​会起作用:

However it works if I don't use globbing:

$> ls "$filename."pmdl
école.pmdl

我正在寻找一种既可以在Linux& amp; OSX. 这是我在该主题上发现的最接近的问题

I'm looking for a simple solution that works in both Linux & OSX. This is the closest question I found on that topic.

$> bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)
Copyright (C) 2007 Free Software Foundation, Inc.

简短版本以证明该方案在系统上失败(在系统上),并且在OSX Bash v3.2.57上使用相同的é char. Linux Bash 4.3.30上的相同场景可以系统地工作(找到).

Short version to prove that the scenario fails (systematically) with same é char on OSX Bash v3.2.57. The same scenario on Linux Bash 4.3.30 works systematically (found).

$> touch é.txt
$> ls é*
ls: é*: No such file or directory

推荐答案

tl; dr

  • 任一:使用以下解决方法之一:

  • ls "$(iconv -t UTF-8-MAC <<<'école')."[pu]mdl-最通用,但麻烦.
  • ls $'e\x{cc}\x{81}cole'.[pu]mdl-难以记忆,并且针对手音符号(´).
  • ls e?cole.[pu]mdl-易于键入和记住,但仅限于1个变音符号的组合,并且可以产生假阳性.
  • ls "$(iconv -t UTF-8-MAC <<<'école')."[pu]mdl - most generic, but cumbersome.
  • ls $'e\x{cc}\x{81}cole'.[pu]mdl - hard to remember, and specific to the diacritic at hand (acute accent, ´).
  • ls e?cole.[pu]mdl - simple to type and remember, but limited to 1 combining diacritic and can yield false positives.

或:通过 Homebrew 安装Bash 4.3.30或更高版本并使用它代替了macOS仍随附的Bash 3.x:brew install bash.

Or: install Bash 4.3.30 or higher via Homebrew and use it instead of the Bash 3.x that macOS still comes with: brew install bash.

下面的详细信息.

关于非ASCII字符

  • macOS文件系统 HFS + ,仅说 NFD ( 已分解 Unicode标准化表格),其中带重音的字母 2个或更多 Unicode代码点表示:基本字母,后跟组合变音符号(带重音符号):

  • the macOS filesystem, HFS+, speaks only NFD (decomposed Unicode normalization form), where accented letters are represented by 2 or more Unicode codepoints: the base letter, followed by the combining diacritic(s) (accent mark(s)):

  • 对于é:
    • ASCII 基字母-e(U+0065,UTF-8编码0x65)
    • 后跟组合为组合的(前一个基本字母上方的´U+0301,UTF-8编码为0xcc 0x81).
    • In the case of é:
      • The ASCII base letter - e (U+0065, UTF-8 encoding 0x65)
      • followed by the combining acute accent (the ´ above the preceding base letter, U+0301, UTF-8 encoding 0xcc 0x81).

      通常,例如-在终端或大多数编辑器中键入字符时- NFC ( 组成 Unicode规范化形式),其中(惯用的)带重音字母 1 Unicode代码点表示:

      Typically, however - such as when you type characters in a terminal or in most editors - NFC (composed Unicode normalization form) is used, where (customary) accented letters are represented by 1 Unicode codepoint:

      • 对于é: Unicode字符U+00E9,UTF-8编码0xc3 0xa9.
      • NFD和NFC 被视为等效 ,但是从 Bash 3.x开始-在macOS上可以找到-aren 't :当 globbing (在终端或(由大多数编辑者保存为UTF-8编码的脚本),并将其与文件系统的NFD表示形式逐个代码点匹配,而不会识别等效的NFC和NFD表示形式.
        实际上,这意味着在终端中键入或由大多数编辑者生成的带重音符号的NFC字符与HFS +文件系统中的NFD对等字符不匹配.
      • 相比之下,
      • 指定文字文件名-无全局干扰-不受影响:ls école表示为NFC,但确实找到了名为école的文件存储在NFD中-大概是因为Bash只是将NFC表示传递给确实识别等效项的 system 函数.
      • In the case of é: single Unicode character U+00E9, UTF-8 encoding 0xc3 0xa9.
      • NFD and NFC should be treated as equivalent, but as of Bash 3.x - as found on macOS - aren't: NFC (and also NFD) input is taken as-is when globbing (either as typed in the terminal or as saved by most editors in UTF-8-encoded scripts) and matches it codepoint by codepoint against the filesystem's NFD representation, without recognizing equivalent NFC and NFD representations.
        In effect, that means that accented NFC characters typed in the terminal or as produced by most editors do NOT match their NFD equivalents in the HFS+ filesystem.
      • By contrast, specifying literal filenames - without globbing - is not affected: ls école, expressed as NFC, does find the file named école, which is stored in NFD - presumably, because Bash just passes the NFC representation to a system function that does recognize the equivalence.

      此处处阅读有关这些 Unicode正常(规范化)格式的信息.

      Read about these Unicode normal (normalization) forms here.

      简而言之: Bash 应该将NFD和NFC表示形式识别为等效,但从macOS 10.12.1随附的过时版本开始-Bash 3.2.57.

      In short: Bash should recognize NFD and NFC representations as equivalent, but doesn't, as of the obsolete version that macOS 10.12.1 comes with - Bash 3.2.57.

      尽管至少在Bash 4.3.30 在macOS上运行时已解决此问题,但 Apple并未将 Bash更新为Bash 4.x 版本>许可原因(有关解决方案,请参见下文).

      While the problem has been fixed as of at least Bash 4.3.30 when run on macOS, Apple isn't updating to Bash 4.x versions for licensing reasons (see below for a solution).

      请参阅这篇文章的底部以了解 Linux 世界.

      See the bottom of this post for a look at the Linux world.

      在MacOS上,使用解决方法将带有重音符号的文件名遍历:

      There are workarounds for globbing filenames with accented characters on macOS:

      • [如果可行],使用自制软件安装最新版本4.x Bash版本,并使用它代替macOS随附的版本:brew install bash.

      • [if feasible] Using Homebrew, install the latest 4.x Bash version and use it instead of the one that comes with macOS: brew install bash.

      • 请注意,如果您使用这样的Bash版本(> = 4.3.30),不仅不再不再需要下面描述的其他变通办法 ,而且它们实际上会停止工作 ,因为Bash然后仅支持 NFC 输入作为球形模式的一部分(但可以将其正确映射到文件系统中的NFD等效项上).
      • Note that if you use such a Bash version (>= 4.3.30), not only are the other workarounds described below no longer necessary, they actually stop working, because Bash then only supports NFC input as part of globbing patterns (but maps it correctly onto NFD equivalents in the filesystem).

      [健壮,但更复杂] 使用iconv -t UTF-8-MAC 将Bash字符串文字从NFC转换为NFD,以使其与文件系统表示形式匹配:
      ls "$(iconv -t UTF-8-MAC <<<'école')."[pu]mdl

      [robust, but more elaborate] Use iconv -t UTF-8-MAC to convert your Bash string literal from NFC to NFD so that it matches the filesystem representation:
      ls "$(iconv -t UTF-8-MAC <<<'école')."[pu]mdl

      • 或者,也可以使用
      • Alternatively, it is possible, but obscure and cumbersome, to use an ANSI C-quoted string to represent the exact NFD UTF-8 byte sequence:
        ls $'e\x{cc}\x{81}cole'.[pu]mdl

      [简单但次优]将每个​​重音字符表示为<base-char>?,因为从Bash的角度来看,文件系统报告的重音字符等于基本字符e后跟一个另一个字符(组合变音符号;对多个组合变音符号进行相应调整). (此方法显然是次优的,因为它不匹配 just é,而是 any e开头的两个字符的序列):
      ls e?cole.[pu]mdl

      [simpler, but suboptimal] Represent each accented character as <base-char>?, because, from Bash's perspective, the accented character, as reported by the filesystem, amounts to the base character e followed by another character (the combining diacritic; adjust accordingly for multiple combining diacritics). (This approach is obviously suboptimal, because it won't match just é, but any two-character sequence starting with e):
      ls e?cole.[pu]mdl

      许多 Linux使用的 ext 文件系统 发行版完全按照指定的名称存储文件名 :

      The ext filesystem used by many Linux distros stores filenames exactly as specified:

      换句话说:使用NFC名称创建的文件和具有NFD名称的文件一样存储.

      In other words: a file created with an NFC name is stored as such, as is a file with an NFD name.

      因此,ext考虑了NFC和NFD的不同形式,因为它们的字节级表示形式有所不同,因此它甚至允许(概念上)同名的文件仅以Unicode正常形式有所不同-例如,名为ls(école)打印时,>和$'\xc3\xa9cole'是无法区分的,但是它们是不同的文件(!).

      Therefore, ext considers NFC and NFD distinct forms, because their byte-level representations differ, so it even allows files of the (conceptually) same name that differ only in Unicode normal form - for instance, files named $'e\xcc\x81cole' and $'\xc3\xa9cole' are indistinguishable when printed by ls (école), but are distinct files(!).

      并且适当地, Linux 上的Bash版本不能不能识别NFC/NFD等效性,即使在版本== 4.3.30中也是如此(与macOS不同).

      Consequently - and appropriately - Bash versions on Linux do not recognize NFC / NFD equivalence, even in versions >= 4.3.30 (unlike on macOS).

      Caveat :dash,例如在Ubuntu 16.04上,它在Ubuntu上起/bin/sh的作用,不支持区域设置(支持多字节字符编码),至少在以下情况下可用: globbing :球形符号?与单个 byte 匹配,而不与单个 character 匹配(由活动区域设置的字符编码定义,如语言环境类别LC_CTYPE,通常为UTF-8).因此,为了匹配单个非ASCII字符,您需要知道该字符的UTF-8编码由多少个字节组成,并对每个字节使用?;例如,NFC é(2个字节)必须与??匹配. [1]

      Caveat: dash, which acts as /bin/sh on Ubuntu, for instance, as of Ubuntu 16.04 is not locale-aware (multi-byte character-encoding aware), at least when globbing: globbing symbol ? matches a single byte rather than a single character (as defined by the active locale's character encoding, as reflected in locale category LC_CTYPE, which is typically UTF-8). Thus, in order to match a single non-ASCII character, you need to know how many bytes the UTF-8 encoding of that character is composed of, and use a ? for each byte; for instance, NFC é (2 bytes) would have to be matched with ??.[1]

      当您在shebang行为#!/bin/sh的脚本中使用globbing时,这可能很重要.

      This may matter when you use globbing inside scripts whose shebang line is #!/bin/sh.

      在实践中,很少遇到NFD字符串,因此使用NFC字符串创建文件和稍后通过glob对其进行匹配时,macOS遇到的不同Unicode普通格式的问题在Linux上很少出现.

      In practice, NFD strings are rarely encountered, so with NFC strings used both to create files and match them later by globs, the problem with differing Unicode normal forms that macOS experiences rarely surfaces on Linux.

      [1] dash旨在成为一种快速的,与POSIX兼容的shell实现(主要是限制到POSIX功能),但是在这种情况下,它似乎没有实现: POSIX规范的部分.描述模式匹配符号显然是在谈论字符,而不是 bytes :A <question-mark> is a pattern that shall match any character.
      字符集.

      [1] dash aims to be a fast, POSIX-compliant shell implementation (that is largely confined to POSIX features), but in this case it appears to fall short: the part of the POSIX spec. describing the pattern-matching notation clearly talks about characters, not bytes: A <question-mark> is a pattern that shall match any character.
      Support for multi-byte character encodings is described in the section on Character Sets.

      这篇关于在Bash中查看重音文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆