git, msysgit, 口音, utf-8, 最终答案 [英] git, msysgit, accents, utf-8, the definitive answers

查看:18
本文介绍了git, msysgit, 口音, utf-8, 最终答案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在某些地方读到 git(或只是 msysgit?)和字符编码存在问题 - 我相信这只是文件名的问题.

我想要的是关于以下方面的一些确定性"(或至少是权威性的)信息:

  1. 究竟是什么问题"?(症状)
  2. 原因是什么?(简要)
  3. 在什么情况下这是一个表演障碍?
  4. 是否有任何解决方案,或者没有任何解决方法?

我希望这个问题不是太模糊,我认为将所有这些信息放在一个地方以便能够将人们指向它会很好...

解决方案

2021 年 10 月更新:在 Git 2.34(2021 年第 4 季度)中,Unicode 字符宽度表(用于输出对齐)已更新.

参见 commit 187fc8b(2021 年 9 月 17 日)Carlo Marcelo Arenas Belón (carenas).
(由 Junio C Hamano 合并 -- gitster --commit 3d875f9,2021 年 9 月 28 日)

<块引用>

unicode:将宽度表更新为Unicode 14

2017 年 2 月更新 (Git 2.12):字符宽度表已更新以匹配 Unicode 9.0.
update_unicode.sh将其移入 contrib/update-unicode:请参阅其自述文件.

2014 年 8 月更新 (git 2.1):commit a67c821"https://github.com/tboegi" rel="nofollow noreferrer">Torsten Bögershausen (tboegi)) 添加了对 Unicode 7.0 的支持.

2014 年 4 月更新:提交 d813ab9(Torsten Bögershausen (tboegi)) 添加了对 Unicode 6.3 的支持
(git 1.9.2):

<块引用>

Unicode 6.3 将更多代码点定义为组合或重音.
例如,字符ö"可以表示为o";后跟 U+0308 COMBINING DIARESIS(又名元音变音,双点以上).
我们应该考虑这样一个由两个码点组成的序列为了对齐目的而占据一个显示列,为此,git_wcwidth() 应该为它们返回 0.

受影响的代码点是:

U+0358..U+035CU+0487U+05A2、U+05BA、U+05C5、U+05C7U+0604, U+0616..U+061A, U+0659..U+065F

早期的 unicode 标准将这些定义为保留".

仅检查范围 0..U+07FF 以查看在准备此提交时哪些代码点需要标记为 0 宽度;可能需要更多更新.


2012 年 4 月更新:Unicode 支持在 1.7.10 版中发布.请参阅此页面了解您应该设置的注意事项和设置.

即:

git config [--global] core.quotepath offgit config [--global] i18n.logoutputencoding utf8git config [--global] i18n.commitencoding utf8git config [--global] --unset svn.pathnameencoding

recodetree 检查 命令扫描 git 存储库的整个历史记录并打印所有非 ASCII 文件名.如果输出为空,则不需要迁移.


2012 年 2 月更新:UTF-8 支持补丁正在 msysgit 的分支devel"中发布GitHub 上的存储库,包括更新较少的 UTF-8 设置.

Git for Windows Google+ 页面提到:

<块引用>

Karsten Blees 的适用于 Windows 的 Git 的 UTF-8 补丁现已合并到devel".
这意味着即将发布的版本将支持 Unicode 文件名!


2011 年 5 月

我相信 msysgit issue 80 有最新的关于那个错误.
issue 376 中也有描述.

例如:

<块引用>

事情是这样的:

  1. Windows 上的 git 对文件名进行操作,并将它们本质上视为字节流.在您的情况下,流恰好是 UTF8 编码的文本.

  2. Windows 上的 git 要求运行时创建一个文件,并将字节流传递给它.

  3. 由于在 Windows 内部,一切都是 Unicode,运行时将字节转换为使用当前设置的语言环境(又名代码页")流式传输到 UTF16.
    也就是说,它有效地将字节流解释为 CP949(韩语)编码文本.
    显然,一些UTF8字节序列是无效的CP949序列,转换失败(Invalid argument");或者,如果 UTF8 序列恰好是正确的 CP949 序列,则结果(很可能)是不同的字符.

真正的修复应该在 MingW 上:

<块引用>

我想到一个解决方案是:在 GCC C 运行时解决它图书馆级别.
也就是说,对于 Windows 上的 mingw GCC 运行时库,可以通过构建时选项处于命令行参数(传递给 main())和文件 I 的模式/O 函数使用底层的 Windows Unicode API 调用,并在使用字节字符串的 C 标准函数 API 中转换为/从 UTF-8 编码.
那将正常工作"也许适用于 git,并且可能对运行 Windows 环境的其他源自 Linux 的开源项目有用.

ak2 评论 MingW 不适合进行此修复:

<块引用>

"MinGW 编译器提供对 Microsoft C 运行时和一些特定于语言的运行时功能的访问.
MinGW 是极简主义者,不会也永远不会尝试为 MS-Windows 上的 POSIX 应用程序部署提供 POSIX 运行时环境.
如果您希望在此平台上部署 POSIX 应用程序,请考虑使用 Cygwin."

支持 unicode 的 msysgit 变体.

I've read in some places that there are problems with git (or just msysgit?) and character encoding - I believe it's only a problem in file names.

What I'd like is some 'definitive' (or at least authoritative) information about:

  1. What exactly are the 'problems'? (The symptoms)
  2. What are the causes? (Briefly)
  3. In what scenarios is this a show stopper?
  4. Is there any resolution in sight, or failing that any workarounds?

I hope this question isn't too vague, I think it would be good to have all of this information in one place to be able to point people to it...

解决方案

Update Oct. 2021: With Git 2.34 (Q4 2021), the unicode character width table (used for output alignment) has been updated.

See commit 187fc8b (17 Sep 2021) by Carlo Marcelo Arenas Belón (carenas).
(Merged by Junio C Hamano -- gitster -- in commit 3d875f9, 28 Sep 2021)

unicode: update the width tables to Unicode 14

Update Feb. 2017 (Git 2.12): The character width table has been updated to match Unicode 9.0.
The update_unicode.sh is moved it into contrib/update-unicode: see its README.

Update August 2014 (git 2.1): commit a67c821 (Torsten Bögershausen (tboegi)) adds support for Unicode 7.0.

Update April 2014: commit d813ab9 (Torsten Bögershausen (tboegi)) adds support for Unicode 6.3
(git 1.9.2):

Unicode 6.3 defines more code points as combining or accents.
For example, the character "ö" could be expressed as an "o" followed by U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above).
We should consider that such a sequence of two codepoints occupies one display column for the alignment purposes, and for that, git_wcwidth() should return 0 for them.

Affected codepoints are:

U+0358..U+035C
U+0487
U+05A2, U+05BA, U+05C5, U+05C7
U+0604, U+0616..U+061A, U+0659..U+065F

Earlier unicode standards had defined these as "reserved".

Only the range 0..U+07FF has been checked to see which codepoints need to be marked as 0-width while preparing for this commit; more updates may be needed.


Update April 2012: Unicode support is released in version 1.7.10. See this page for notes and settings you should set.

Namely:

git config [--global] core.quotepath off
git config [--global] i18n.logoutputencoding utf8
git config [--global] i18n.commitencoding utf8
git config [--global] --unset svn.pathnameencoding

The recodetree check command scans the entire history of a git repository and prints all non-ASCII file names. If the output is empty, no migration is necessary.


Update February 2012: patches for UTF-8 supports are comming in branch 'devel' of msysgit repo on GitHub, including Update less settings for UTF-8 .

The Git for Windows Google+ page mentions:

Karsten Blees' UTF-8 patches for Git for Windows has now been merged to 'devel'.
This means the upcoming release will support Unicode filenames!


May 2011

I believe the msysgit issue 80 has the latest on that bug.
Also described in issue 376.

For example:

This is what happens:

  1. git on Windows operates on file names and treats them essentially as byte streams. In your case, the streams happen to be UTF8 encoded text.

  2. git on Windows asks the runtime to create a file, and passes it the byte stream.

  3. Since internally on Windows everything is Unicode, the runtime converts the byte stream to UTF16 using the currently set locale (aka "codepage").
    That is, it effectively interprets the byte stream as CP949 (Korean) encoded text.
    Apparently, some of the UTF8 byte sequences are invalid CP949 sequences, and the conversion fails ("Invalid argument"); or if the UTF8 sequences happen to be correct CP949 sequences, the result is (most likely) a different character.

The true fix should be on MingW though:

It occurs to me that one solution would be this: solve it at the GCC C run-time library level.
That is, for the mingw GCC run-time library on Windows, make it possible via build-time options to be in a mode where the command-line parameters (passed to main()) and file I/O functions use the underlying Windows Unicode API calls, and translate to/from UTF-8 encoding in C's standard function APIs that use byte-strings.
That would "just work" for git perhaps, and could be useful for other Linux-originated open source projects running the Windows environment.

ak2 comments that MingW isn't the right place for this fix:

"MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes.
MinGW, being Minimalist, does not, and never will, attempt to provide a POSIX runtime environment for POSIX application deployment on MS-Windows.
If you want POSIX application deployment on this platform, please consider Cygwin instead."

There is some work in progress on a msysgit variant to support unicode.

这篇关于git, msysgit, 口音, utf-8, 最终答案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆