freebcp:“Unicode数据是列的奇数字节大小。应该是偶数字节大小“ [英] freebcp: "Unicode data is odd byte size for column. Should be even byte size"

查看:1616
本文介绍了freebcp:“Unicode数据是列的奇数字节大小。应该是偶数字节大小“的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此文件正常(UTF-8):

This file works fine (UTF-8):

$ cat ok.txt
291054  Ţawī Rifā

此文件导致错误(UTF-8):

This file causes an error (UTF-8):

$ cat bad.txt
291054  Ţawī Rifā‘

以下是消息:

$ freebcp 'DB.dbo.table' in bad.txt ... -c
Starting copy...
Msg 20050, Level 4
Attempt to convert data stopped by syntax error in source field

Msg 4895, Level 16, State 2
Server '...', Line 1
    Unicode data is odd byte size for column 2. Should be even byte size.
Msg 20018, Level 16
General SQL Server error: Check messages from the SQL Server

唯一的区别是最后一个字符,即unicode 2018(左单引号)

The only difference is the last character, which is unicode 2018 (left single quotation mark)

任何想法是什么导致此错误?

Any idea what is causing this error?

SQL Server使用UTF-16LE(虽然TDS以UCS-2LE开头,我相信会切换)

The SQL Server uses UTF-16LE (though TDS starts with UCS-2LE and switches over I believe)

nvarchar(200)

这是在错误前发送的数据包:

Here's the packet sent right before the error:

packet.c:741:Sending packet
0000 07 01 00 56 00 00 01 00-81 02 00 00 00 00 00 08 |...V.... ........|
0010 00 38 09 67 00 65 00 6f-00 6e 00 61 00 6d 00 65 |.8.g.e.o .n.a.m.e|
0020 00 69 00 64 00 00 00 00-00 09 00 e7 90 01 09 04 |.i.d.... ...ç....|
0030 d0 00 34 04 6e 00 61 00-6d 00 65 00 d1 ee 70 04 |Ð.4.n.a. m.e.Ñîp.|
0040 00 13 00 62 01 61 00 77-00 2b 01 20 00 52 00 69 |...b.a.w .+. .R.i|
0050 00 66 00 01 01 18      -                        |.f....|


推荐答案

这个问题与UTF-正在传输的数据,如传输数据包(问题的底部)所示是UTF-16小端(正如SQL Server期望的)。它是完全好的UTF-16LE,除了丢失的最后一个字节,就像错误消息暗示的。

This issue has nothing to do with UTF-8 given that the data being transmitted, as shown in the transmission packet (bottom of the question) is UTF-16 Little Endian (just as SQL Server would be expecting). And it is perfectly good UTF-16LE, all except for the missing final byte, just like the error message implies.

这个问题很可能是一个小错误< > freetds 不正确地应用逻辑意味着从可变长度字符串字段中去除尾随空格。有没有尾随空格,你说?好吧,如果它没有被砍掉,那将是一个更清晰(但是,如果没有被砍掉,将不会有这个错误)。所以,让我们看看数据包看看是否可以重建它。

The problem is most likely a minor bug in freetds that incorrectly applies logic meant to strip off trailing spaces from variable length string fields. There are no trailing spaces, you say? Well, if it hadn't gotten chopped off then it would be a little clearer (but, if it hadn't gotten chopped off there wouldn't be this error). So, let's look at what the packet to see if we can reconstruct it.

数据中的错误可能被忽略,因为数据包包含偶数个字节。但不是所有字段都是双字节的,因此不需要是偶数。如果我们知道好的数据是什么(在错误之前),那么我们可以在数据中找到起点并向前移动。最好从Ţ开始,因为它希望在255 / FF值之上,因此需要2个字节。下面的任何东西都有一个 00 ,许多字符都有两边。虽然我们应该能够假定小尾数编码,但最好是知道一定。为此,我们至少需要一个有两个非< - code> 00 字节的字符,以及不同的字节(字符之一是 01 两个字节,这不帮助确定排序)。此字符串字段的第一个字符,Ţ,确认这是因为它是代码点0162,但显示为 62 01

The error in the data is probably being overlooked because the packet contains an even number of bytes. But not all fields are double-byte, so it doesn't need to be an even number. If we know what the good data is (prior to the error), then we can find a starting point in the data and move forwards. It is best to start with Ţ as it will hopefully be above the 255 / FF value and hence take 2 bytes. Anything below will have a 00 and many of the characters have that on both sides. While we should be able to assume Little Endian encoding, it is best to know for certain. To that end, we need at least one character that has two non-00 bytes, and bytes that are different (one of the character is 01 for both bytes and that does not help determine ordering). The first character of this string field, Ţ, confirms this as it is Code Point 0162 yet shows up as 62 01 in the packet.

以下是与数据包顺序相同的字符,其UTF-16 LE值以及指向其完整详细信息的链接。第一个字符的字节序列 62 01 给出了我们的起始点,因此我们可以忽略初始 00 13 00 的行 0040 (为了便于阅读,它们已经在下面的副本中删除)。请注意,右侧显示的翻译不会解释Unicode,因此 62 01 的2字节序列显示为 62 < (即不可打印字符;显示为。)本身(即小写拉丁语b)和 01 p>

Below are the characters, in the same order as the packet, their UTF-16 LE values, and a link to their full details. The first character's byte sequence of 62 01 gives us our starting point, and so we can ignore the initial 00 13 00 of line 0040 (they have been removed in the copy below for readability). Please note that the "translation" shown to the right does not interpret Unicode, so the 2-byte sequence of 62 01 is displayed as 62 by itself (i.e. lower-case Latin "b") and 01 by itself (i.e. non-printable character; displayed as ".").

0040 xx xx xx 62 01 61 00 77-00 2b 01 20 00 52 00 69 |...b.a.w .+. .R.i|  
0050 00 66 00 01 01 18 ??   -                        |.f....|




  • Ţ - 62 01 - http:// unicode-table.com/en/0162/

  • a - 61 00 - http://unicode-table.com/en/0061/

  • w - 77 00 - http://unicode-table.com/en/0077/

  • ī - 2B 01 - http://unicode-table.com/en/012B/

  •   - 20 00 - http:// unicode-table.com/en/0020/

  • R - 52 00 - http://unicode-table.com/en/0052/

  • i - 69 00 - http://unicode-table.com/en/0069/

  • f - 66 00 - http://unicode-table.com/en/0066/

  • ā - 01 01 - http://unicode-table.com/en/0101/

  • ' - 18 20 - http://unicode-table.com / en / 2018 /

    • Ţ -- 62 01 -- http://unicode-table.com/en/0162/
    • a -- 61 00 -- http://unicode-table.com/en/0061/
    • w -- 77 00 -- http://unicode-table.com/en/0077/
    • ī -- 2B 01 -- http://unicode-table.com/en/012B/
    •   -- 20 00 -- http://unicode-table.com/en/0020/
    • R -- 52 00 -- http://unicode-table.com/en/0052/
    • i -- 69 00 -- http://unicode-table.com/en/0069/
    • f -- 66 00 -- http://unicode-table.com/en/0066/
    • ā -- 01 01 -- http://unicode-table.com/en/0101/
    • -- 18 20 -- http://unicode-table.com/en/2018/
    • 如您所见,最后一个字符真的 18 20 (即一个字节交换的 20 18 由于小尾数编码),而不是 01 18 数据包从末尾开始。不知何故,最后的字节 - hex 20 - 丢失,因此 Unicode数据是奇数字节大小错误。

      As you can see, the last character is really 18 20 (i.e. a byte-swapped 20 18 due to the Little Endian encoding), not 01 18 as it might appear if reading the packet starting at the end. Somehow, the final byte -- hex 20 -- is missing, hence the Unicode data is odd byte size error.

      现在, 20 本身或后面跟 00 ,是一个空格。这将解释为什么@GordThompson能够通过添加一个额外的字符到最后(最终字符不再可裁剪)工作。这可以通过以另一个字符结束来进一步证明,该字符是 U + 20xx 代码点。例如,如果我对此正确,则以 / - Fraction Slash U + 2044 - 会有相同的错误,以 - Turned Sans-Serif Capital Y U + 2144 - even with the '刚刚之前,应该工作很好(@GordThompson很好,证明以结束工作,以 / 导致相同的错误。)

      Now, 20 by itself, or followed by 00, is a space. This would explain why @GordThompson was able to get it working by adding an additional character to the end (the final character was no longer trimmable). This could be further proven by ending with another character that is a U+20xx Code Point. For example, if I am correct about this, then ending with -- Fraction Slash U+2044 -- would have the same error, while ending with -- Turned Sans-Serif Capital Y U+2144 -- even with the just before it, should work just fine (@GordThompson was kind enough to prove that ending with did work, and that ending with resulted the same error).

      如果输入文件 null $ c> 00 )终止,那么它可以简单地是 20 00 结束序列,在这种情况下,它。这也可以通过测试两行文件来证明:第1行是 bad.txt 中的现有行,第2行是应该工作的行。例如:

      If the input file is null (i.e. 00) terminated, then it could simply be the 20 00 ending sequence that does it, in which case ending with a newline might fix it. This can also be proven by testing a file with two lines: line 1 is the existing row from bad.txt, and line 2 is a line that should work. For example:

      291054  Ţawī Rifā‘
      999999  test row, yo!
      

      如果上面显示的两行文件工作,那么证明它是 ,并且该代码点是公开该错误的最后一个字符(超过该文件的传输)。但是,如果这个双行文件也得到错误,那么它证明有一个U + 20xx代码点作为字符串字段的最后一个字符是问题(并且这是合理的假设这个错误将发生,即使字符串字段不是行的最后一个字段,因为在这种情况下已经排除了传输的空终止符)。

      If the two-line file shown directly above works, that proves that it is the combination of a U+20xx Code Point and that Code Point being the last character (of the transmission more than of the file) that exposes the bug. BUT, if this two-line file also gets the error, then it proves that having a U+20xx Code Point as the last character of a string field is the issue (and it would be reasonable to assume that this error would happen even if the string field were not the final field of the row, since the null terminator for the transmission has already been ruled out in this case).

      这似乎是这是一个与freetds / freebcp的错误,或者有一个配置选项,不试图修剪尾随空格,或者一种方法它将此字段视为 NCHAR 而不是 NVARCHAR

      It seems like either this is a bug with freetds / freebcp, or perhaps there is a configuration option to not have it attempt trimming trailing spaces, or maybe a way to get it to see this field as being NCHAR instead of NVARCHAR.

      @GordThompson和OP(@NeilMcGuigan)都测试并确认此问题存在,无论文件中的字符串字段位于何处:中间,行的末尾,最后一行,而不是最后一行。因此,这是一个一般的问题。

      Both @GordThompson and the O.P. (@NeilMcGuigan) have tested and confirmed that this issue exists regardless of where the string field is in the file: in the middle of a row, at the end of the row, on the last row, and not on the last row. Hence it is a general issue.

      事实上,我发现源代码,这是有意义的,因为没有考虑多字节字符集。我将在GitHub存储库上提交一个问题。 rtrim 函数的源代码如下:

      And in fact, I found the source code and it makes sense that the issue would happen since there is no consideration for multi-byte character sets. I will file an Issue on the GitHub repository. The source for the rtrim function is here:

      https://github.com/FreeTDS/freetds/blob/master/src/dblib/bcp.c#L2267

      关于此声明:


      SQL Server使用UTF-16LE(尽管TDS以UCS-2LE开头,我相信会切换)

      The SQL Server uses UTF-16LE (though TDS starts with UCS-2LE and switches over I believe)

      从编码的角度来看,UCS-2和UTF-16之间没有区别。字节序列是相同的。唯一的区别是对代理对的解释(即高于U + FFFF / 65535的代码点)。 UCS-2具有用于构造保留的代理对的代码点,但是当时没有任何代理对的实现。 UTF-16只是添加了代理对的实现,以创建补充字符。因此,SQL Server存储和检索UTF-16 LE数据没有问题。唯一的问题是内置函数不知道如何解释代理对,除非排序规则以 _SC 结尾( S upplementary C 字符),并且这些Collices在SQL Server 2012中引入。

      From an encoding stand-point, there is really no difference between UCS-2 and UTF-16. The byte sequences are identical. The only difference is in the interpretation of Surrogate Pairs (i.e. Code Points above U+FFFF / 65535). UCS-2 has the Code Points used to construct Surrogate Pairs reserved, but there was no implementation at that time of any Surrogate Pairs. UTF-16 simply added the implementation of the Surrogate Pairs in order to create Supplementary Characters. Hence, SQL Server stores and retrieves UTF-16 LE data without a problem. The only issue is that the built-in functions don't know how to interpret Surrogate Pairs unless the Collation ends with _SC (for Supplementary Characters), and those Collations were introduced in SQL Server 2012.

      这篇关于freebcp:“Unicode数据是列的奇数字节大小。应该是偶数字节大小“的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆