根据字节长度缩短UTF8字符串的最佳方法 [英] Best way to shorten UTF8 string based on byte length

查看:185
本文介绍了根据字节长度缩短UTF8字符串的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个最近的项目,要求将数据导入Oracle数据库.将执行此操作的程序是C#.Net 3.5应用程序,我正在使用Oracle.DataAccess连接库来处理实际的插入.

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.

我遇到一个问题,在插入特定字段时会收到此错误消息:

I ran into a problem where I'd receive this error message when inserting a particular field:

ORA-12899值对于X列而言太大

ORA-12899 Value too large for column X

我使用了Field.Substring(0, MaxLength);,但仍然出现错误(尽管不是每条记录).

I used Field.Substring(0, MaxLength); but still got the error (though not for every record).

最后,我看到了应该是显而易见的东西,我的字符串在ANSI中,字段为UTF8.它的长度以字节为单位,而不是字符.

Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.

这使我想到了我的问题.修剪弦线以固定MaxLength的最佳方法是什么?

This gets me to my question. What is the best way to trim my string to fix the MaxLength?

我的子字符串代码按字符长度工作.是否有简单的C#函数可以按字节长度智能地修整UT8字符串(即不砍掉半个字符)?

My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?

推荐答案

这里有两种可能的解决方案:LINQ单行处理从左到右的输入,而传统的for循环从右到左处理.哪个处理方向更快取决于字符串长度,允许的字节长度以及多字节字符的数量和分布,并且很难给出一般建议. LINQ和传统代码之间的决定可能与我的口味(或速度)有关.

Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).

如果速度很重要,则可以考虑仅累积每个字符的字节长度直到达到最大长度,而不是在每次迭代中计算整个字符串的字节长度.但是我不确定这是否行得通,因为我不太了解UTF-8编码.我可以从理论上想象一个字符串的字节长度不等于所有字符的字节长度之和.

If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.

public static String LimitByteLength(String input, Int32 maxLength)
{
    return new String(input
        .TakeWhile((c, i) =>
            Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        .ToArray());
}

public static String LimitByteLength2(String input, Int32 maxLength)
{
    for (Int32 i = input.Length - 1; i >= 0; i--)
    {
        if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        {
            return input.Substring(0, i + 1);
        }
    }

    return String.Empty;
}

这篇关于根据字节长度缩短UTF8字符串的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆