我如何扭转一个UTF-8字符串的地方? [英] How do I reverse a UTF-8 string in place?

查看:136
本文介绍了我如何扭转一个UTF-8字符串的地方?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

近日,有人问起的<一个href="http://stackoverflow.com/questions/198199/how-do-you-reverse-a-string-in-place-in-c-or-c">algorithm在C 扭转一个字符串的地方。大多数提出的解决方案时,非单字节字符串处理有很多麻烦。所以,我想知道这可能是一个很好的算法,专门处理UTF-8字符串。

Recently, someone asked about an algorithm for reversing a string in place in C. Most of the proposed solutions had troubles when dealing with non single-byte strings. So, I was wondering what could be a good algorithm for dealing specifically with utf-8 strings.

我想出了一些code,这我张贴作为一个答案,但我很高兴地看到其他人的想法或建议。我preferred使用实际的code,所以我选择C#,因为它似乎是最流行的语言在这个网站中的一个,但我不介意,如果你的code是在另一个语言,只要它可以被任何人谁是熟悉命令式语言来合理地理解。而且,这是为了看到这种算法可以在低层次的实现(由低层次的我只是说处理字节),这个想法是避免使用图书馆为核心的code。

I came up with some code, which I'm posting as an answer, but I'd be glad to see other people's ideas or suggestions. I preferred to use actual code, so I've chosen C#, as it seems to be one of the most popular language in this site, but I don't mind if your code is in another language, as long as it could be reasonably understood by anyone who is familiar with an imperative language. And, as this is intended to see how such an algorithm could be implemented at a low-level (by low-level I just mean dealing with bytes), the idea is to avoid using libraries for the core code.

注:

我感兴趣的算法本身,它的性能和如何对它进行优化(我的意思是算法级的优化,而不是取代我++与++ i和这样的,我没有实际基准要么很感兴趣)。

I'm interested in the algorithm itself, its performance and how could it be optimized (I mean algorithm-level optimization, not replacing i++ with ++i and such; I'm not really interested in actual benchmarks either).

我不是说要真正使用它在生产中code或重新发明轮子。这仅仅是出于好奇和练习。

I don't mean to actually use it in production code or "reinventing the wheel". This is just out of curiosity and as an exercise.

我使用C#的字节数组,所以我假设你可以得到字符串的长度不运行,虽然该字符串,直到找到一个NUL。 也就是说,我不占寻找字符串的长度的复杂性。但是,如果你使用的是C,比如,你可以因素,通过调用核心code之前使用的strlen()。

I'm using C# byte arrays so I'm assuming you can get the length of the string without running though the string until you find a NUL. That is, I'm not accounting for the complexity of finding the length of the string. But if you're using C, for instance, you could factor that out by using strlen() before calling the core code.

编辑:

由于迈克F积分了,我的code(以及其他人的code贴在这里)没有处理的复合字符。那些 href="http://www.uni$c$c.org/faq/char_combmark.html"这里。我不熟悉的概念,但如果这意味着有组合字符,即字符/ code,它只能与其他基地字符/ code点,结合有效的分这种字符查找表可用于preserve的全球字符(基地+组合字符)倒车时。

As Mike F points out, my code (and other people's code posted here) is not dealing with composite characters. Some info about those here. I'm not familiar with the concept, but if that means that there are "combining characters", i.e., characters / code points that are only valid in combination with other "base" characters / code points, a look-up table of such characters could be used to preserve the order of the "global" character ("base" + "combining" characters) when reversing.

推荐答案

我会做一遍扭转字节,那么反转在任何多字节字符的字节(这很容易在UTF8检测)第二回传给他们正确的顺序。

I'd make one pass reversing the bytes, then a second pass that reverses the bytes in any multibyte characters (which are easily detected in UTF8) back to their correct order.

您绝对可以在一个单一的及格线处理这个问题,但我不会理会,除非例行成为一个瓶颈。

You can definitely handle this in line in a single pass, but I wouldn't bother unless the routine became a bottleneck.

这篇关于我如何扭转一个UTF-8字符串的地方?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆