如何通过io.Reader转换HTML实体 [英] How to transform HTML entities via io.Reader
问题描述
我的Go程序发出HTTP请求,该请求的响应主体是大型JSON文档,其字符串将与号字符&
编码为&
(可能是由于某些Microsoft平台的怪癖?).我的程序需要以与 json.Decoder
.
My Go program makes HTTP requests whose response bodies are large JSON documents whose strings encode the ampersand character &
as &
(presumably due to some Microsoft platform quirk?). My program needs to convert those entities back to the ampersand character in a way that is compatible with json.Decoder
.
示例响应如下所示:
{"name":"A&B","comment":"foo&bar"}
对应对象如下:
pkg.Object{Name:"A&B", Comment:"foo&bar"}
文档具有各种形状,因此在解码后转换HTML实体是不可行的.理想情况下,可以通过将响应正文阅读器包装在另一个执行转换的阅读器中来完成.
The documents come in various shapes so it's not feasible to convert the HTML entities after decoding. Ideally it would be done by wrapping the response body reader in another reader that performs the transformation.
是否有一种简单的方法将http.Response.Body
包裹在某些io.ReadCloser
中,从而用&
替换&
的所有实例(或者在一般情况下,用字符串Y替换任何字符串X)?
Is there an easy way to wrap the http.Response.Body
in some io.ReadCloser
which replaces all instances of &
with &
(or in the general case, replaces any string X with string Y)?
I suspect this is possible with x/text/transform
but don't immediately see how. In particular, I'm concerned about edge cases wherein an entity spans batches of bytes. That is, one batch ends with &am
and the next batch starts with p;
, for example. Is there some library or idiom that gracefully handles that situation?
推荐答案
如果您不想依赖像transform.Reader
这样的外部程序包,则可以编写自定义的io.Reader
包装器.
If you don't want to rely on an external package like transform.Reader
you can write a custom io.Reader
wrapper.
以下将处理find
元素可能跨越两个Read()
调用的极端情况:
The following will handle the edge case where the find
element may span two Read()
calls:
type fixer struct {
r io.Reader // source reader
fnd, rpl []byte // find & replace sequences
partial int // track partial find matches from previous Read()
}
// Read satisfies io.Reader interface
func (f *fixer) Read(b []byte) (int, error) {
off := f.partial
if off > 0 {
copy(b, f.fnd[:off]) // copy any partial match from previous `Read`
}
n, err := f.r.Read(b[off:])
n += off
if err != io.EOF {
// no need to check for partial match, if EOF, as that is the last Read!
f.partial = partialFind(b[:n], f.fnd)
n -= f.partial // lop off any partial bytes
}
fixb := bytes.ReplaceAll(b[:n], f.fnd, f.rpl)
return copy(b, fixb), err // preserve err as it may be io.EOF etc.
}
与此助手一起(可能会使用一些优化方法):
Along with this helper (which could probably use some optimization):
// returns number of matched bytes, if byte-slice ends in a partial-match
func partialFind(b, find []byte) int {
for n := len(find) - 1; n > 0; n-- {
if bytes.HasSuffix(b, find[:n]) {
return n
}
}
return 0 // no match
}
正在使用游乐场示例.
注意:要测试边缘情况逻辑,可以使用narrowReader
来确保短的Read
并强制将匹配项拆分到Read
中,如下所示:
Note: to test the edge-case logic, one could use a narrowReader
to ensure short Read
's and force a match is split across Read
s like this: validation playground example
这篇关于如何通过io.Reader转换HTML实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!