如何完成对UTF8文件的随机读取 [英] How do I accomplish random reads of a UTF8 file

查看:118
本文介绍了如何完成对UTF8文件的随机读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的理解是,由于偶然的替代字节(例如,在东方语言中使用),对UTF8或UTF16编码文件的读取不一定是随机的.

My understanding is that reads to a UTF8 or UTF16 Encoded file can't necessarily be random because of the occasional surrogate byte (used in Eastern languages for example).

如何使用.NET跳到文件中的大致位置,并从半随机位置读取unicode文本?

How can I use .NET to skip to an approximate position within the file, and read the unicode text from a semi-random position?

我是否丢弃代理字节并等待分词继续读取?如果是这样,有效字词是什么中断我应该等到开始解码吗?

Do I discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks I should wait for until I start the decoding?

推荐答案

容易,UTF-8可以自我同步.
只需跳到文件中的随机字节,然后跳过读取所有前导位为10的字节(连续字节)即可.没有前导10的第一个字节是正确的UFT-8字符的起始字节,您可以使用常规的UTF-8编码读取以下字节.

Easy, UTF-8 is self-synchronizing.
Simply jump to random byte in a file and skip-read all bytes with leading bits 10 (continuation bytes). The first byte that does not have leading 10 is the starting byte of a proper UFT-8 character and you can read the following bytes using a regular UTF-8 encoding.

这篇关于如何完成对UTF8文件的随机读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆