如何通过io.Reader转换HTML实体 [英] How to transform HTML entities via io.Reader

查看:242
本文介绍了如何通过io.Reader转换HTML实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Go程序发出HTTP请求,该请求的响应主体是大型JSON文档,其字符串将与号字符&编码为&(可能是由于某些Microsoft平台的怪癖?).我的程序需要以与 json.Decoder .

My Go program makes HTTP requests whose response bodies are large JSON documents whose strings encode the ampersand character & as & (presumably due to some Microsoft platform quirk?). My program needs to convert those entities back to the ampersand character in a way that is compatible with json.Decoder.

示例响应如下所示:

{"name":"A&B","comment":"foo&bar"}

对应对象如下:

pkg.Object{Name:"A&B", Comment:"foo&bar"}

文档具有各种形状,因此在解码后转换HTML实体是不可行的.理想情况下,可以通过将响应正文阅读器包装在另一个执行转换的阅读器中来完成.

The documents come in various shapes so it's not feasible to convert the HTML entities after decoding. Ideally it would be done by wrapping the response body reader in another reader that performs the transformation.

是否有一种简单的方法将http.Response.Body包裹在某些io.ReadCloser中,从而用&替换&的所有实例(或者在一般情况下,用字符串Y替换任何字符串X)?

Is there an easy way to wrap the http.Response.Body in some io.ReadCloser which replaces all instances of & with & (or in the general case, replaces any string X with string Y)?

我怀疑可以通过 x/text/transform ,但不立即看到如何.特别是,我担心一个实体跨越成批字节的边缘情况.也就是说,例如,一个批次以&am结尾,而下一个批次以p;开始.是否有一些库或惯用语可以很好地处理这种情况?

I suspect this is possible with x/text/transform but don't immediately see how. In particular, I'm concerned about edge cases wherein an entity spans batches of bytes. That is, one batch ends with &am and the next batch starts with p;, for example. Is there some library or idiom that gracefully handles that situation?

推荐答案

如果您不想依赖像transform.Reader这样的外部程序包,则可以编写自定义的io.Reader包装器.

If you don't want to rely on an external package like transform.Reader you can write a custom io.Reader wrapper.

以下将处理find元素可能跨越两个Read()调用的极端情况:

The following will handle the edge case where the find element may span two Read() calls:

type fixer struct {
    r        io.Reader // source reader
    fnd, rpl []byte    // find & replace sequences
    partial  int       // track partial find matches from previous Read()
}

// Read satisfies io.Reader interface
func (f *fixer) Read(b []byte) (int, error) {
    off := f.partial
    if off > 0 {
        copy(b, f.fnd[:off]) // copy any partial match from previous `Read`
    }

    n, err := f.r.Read(b[off:])
    n += off

    if err != io.EOF {
        // no need to check for partial match, if EOF, as that is the last Read!
        f.partial = partialFind(b[:n], f.fnd)
        n -= f.partial // lop off any partial bytes
    }

    fixb := bytes.ReplaceAll(b[:n], f.fnd, f.rpl)

    return copy(b, fixb), err // preserve err as it may be io.EOF etc.
}

与此助手一起(可能会使用一些优化方法):

Along with this helper (which could probably use some optimization):

// returns number of matched bytes, if byte-slice ends in a partial-match
func partialFind(b, find []byte) int {
    for n := len(find) - 1; n > 0; n-- {
        if bytes.HasSuffix(b, find[:n]) {
            return n
        }
    }
    return 0 // no match
}

正在使用游乐场示例.

注意:要测试边缘情况逻辑,可以使用narrowReader来确保短的Read并强制将匹配项拆分到Read中,如下所示:

Note: to test the edge-case logic, one could use a narrowReader to ensure short Read's and force a match is split across Reads like this: validation playground example

这篇关于如何通过io.Reader转换HTML实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆