如何从数据源读取unicode字符 [英] How to read unicode characters from data source

查看:103
本文介绍了如何从数据源读取unicode字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码能够读取数据源(遵循所有读取规则),并具有文本(具有1字节大小的UTF-8编码):

Below code is able to read data source(following all reading rules), having text(with UTF-8 encodings of size one byte):

package main

import (
    "fmt"
    "io"
)

type MyStringData struct {
    str       string
    readIndex int
}

func (myStringData *MyStringData) Read(p []byte) (n int, err error) {

    // convert `str` string to slice of bytes
    strBytes := []byte(myStringData.str)

    // if `readIndex` is GTE source length, return `EOF` error
    if myStringData.readIndex >= len(strBytes) {
        return 0, io.EOF // `0` bytes read
    }

    // get next readable limit (exclusive)
    nextReadLimit := myStringData.readIndex + len(p)

    if nextReadLimit >= len(strBytes) {
        nextReadLimit = len(strBytes)
        err = io.EOF
    }

    // get next bytes to copy and set `n` to its length
    nextBytes := strBytes[myStringData.readIndex:nextReadLimit]
    n = len(nextBytes)

    // copy all bytes of `nextBytes` into `p` slice
    copy(p, nextBytes)

    // increment `readIndex` to `nextReadLimit`
    myStringData.readIndex = nextReadLimit

    // return values
    return
}

func main() {

    // create data source
    src := MyStringData{str: "Hello Amazing World!"} // 学中文

    p := make([]byte, 3) // slice of length `3`

    // read `src` until an error is returned
    for {
        // read `p` bytes from `src`
        n, err := src.Read(p)
        fmt.Printf("%d bytes read, data:%s\n", n, p[:n])

        // handle error
        if err == io.EOF {
            fmt.Println("--end-of-file--")
            break
        } else if err != nil {
            fmt.Println("Oops! some error occured!", err)
            break
        }
    }
}


输出:


Output:

$
$
$ go run src/../Main.go
3 bytes read, data:Hel
3 bytes read, data:lo 
3 bytes read, data:Ama
3 bytes read, data:zin
3 bytes read, data:g W
3 bytes read, data:orl
2 bytes read, data:d!
--end-of-file--
$
$


但是上面的代码无法读取具有文本的数据源(具有大于1个字节的UTF-8编码),如下所示:


But the above code is unable to read data source having text(with UTF-8 encodings of size greater than one byte) as shown below:

  src := MyStringData{str: "Hello Amazing World!学中文"} 

下面是输出:

$
$
$ go run src/../Main.go
3 bytes read, data:Hel
3 bytes read, data:lo 
3 bytes read, data:Ama
3 bytes read, data:zin
3 bytes read, data:g W
3 bytes read, data:orl
3 bytes read, data:d!�
3 bytes read, data:���
3 bytes read, data:���
2 bytes read, data:��
--end-of-file--
$
$


在给出有关使用 strings.NewReader()的注释后,下面是修改后的代码:

With the comments given on usage of strings.NewReader(), below is the code modified:

// create data source
src := strings.NewReader("Hello Amazing World!学中文") // 学中文

// p := make([]byte, 3) // slice of length `3`

// read `src` until an error is returned
for {
    // read `p` bytes from `src`
    ch, n, err := src.ReadRune()
    // n, err := src.Read(p)
    fmt.Printf("%d bytes read, data:%c\n", n, ch)

    // handle error
    if err == io.EOF {
        fmt.Println("--end-of-file--")
        break
    } else if err != nil {
        fmt.Println("Oops! some error occured!", err)
        break
    }
}


如何读取unicode字符而不将字符(例如)拆分为两个 Read 调用?

推荐答案

使用 bufio 中的内容,例如一个 bufio.Reader ReadRune 函数,或者一个 bufio.Scanner 具有扫描功能的扫描函数,仅返回一个或多个完整符文(使用<来自 unicode/utf8 DecodeRune 和 FullRune >进行验证,就像stdlib bufio.ScanRunes 所做的那样.)

Use something from bufio, e.g. a bufio.Reader's ReadRune function, or a bufio.Scanner with a scan function that only returns one or more complete runes (using DecodeRune and FullRune from unicode/utf8 to validate, as the stdlib bufio.ScanRunes does).

您可以自己完成此操作,方法是在切片中缓冲不完整的符文,并通过连续读取附加到其上,但这只会复制 Scanner 的功能.

You could do it yourself by buffering incomplete runes in a slice and appending to it with successive reads, but that would just be duplicating what Scanner does.

这篇关于如何从数据源读取unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆