如何在Go中请求具有特定字符集的页面? [英] How to request a page with a specific charset in Go?

查看:148
本文介绍了如何在Go中请求具有特定字符集的页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从Python重写软件到Go.在获取以iso-8859-1编码的页面时,我遇到了http.Get的问题. Python版本正在运行,但Go版本中没有.

I am rewriting a software from Python to Go. I am facing an issue with the http.Get while fetching a page encoded in iso-8859-1. The Python version is working but not the one in Go.

这可以正常工作:Python

This is working: Python

r = requests.get("https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=show_document&print=yes&highlight_docid=aza://27-01-2016-5A_718-2015")
r.encoding = 'iso-8859-1'
file = open('tmp_python.txt', 'w')
file.write(r.text.strip())
file.close()

这不起作用:开始

package main

import (
    "golang.org/x/net/html/charset"
    "io/ioutil"
    "log"
    "net/http"
)

func main() {
    link := "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=show_document&print=yes&highlight_docid=aza://27-01-2016-5A_718-2015"
    resp, err := http.Get(link)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()

    reader, err := charset.NewReader(resp.Body, "iso-8859-1")
    if err != nil {
        panic(err)
    }

    content, err := ioutil.ReadAll(reader)
    if err != nil {
        panic(err)
    }
    log.Println(string(content))
}

我的浏览器和Python给出了相同的结果,但Go版本却没有.我该如何解决?

My browser and Python give the same result but not the Go version. How can I fix that?

修改

我认为Go有重定向.使用Python不会发生这种情况.

I think there is redirection with Go. This does not happen with Python.

编辑2

我的问题写得不好.我有两个问题:1)编码2)返回了错误的页面.我不知道是否相关.

My question was badly written. I had two problems: 1) the encoding 2) the wrong page returned. I do not know if there are related.

我将为第二个问题打开一个新话题.

I will open a new thread for the second question.

推荐答案

The second argument of NewReader is documented as contentType and not as a character encoding. This means it expects the value of the Content-Type field in the HTTP header instead. Thus, the proper usage would be:

reader, err := charset.NewReader(resp.Body, "text/html; charset=iso-8859-1")

这很好用.

请注意,如果给定的contentType内部没有有用的字符集定义,它将查看主体本身以确定字符集.而且,尽管此页面的HTTP标头具有清晰的

Note that if the given contentType has no useful charset definition inside it will look at the body itself in order to determine the charset. And while the HTTP header of this page has a clear

Content-Type: text/html;charset=iso-8859-1

实际返回的HTML文档定义了不同的字符集编码:

the actual HTML document returned defines a different charset encoding:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

在代码中错误地设置了contentType,因此将采用HTML中错误声明的字符集编码.

With the wrong setting of contentType in your code it will thus take the charset encoding declared wrongly in the HTML.

这篇关于如何在Go中请求具有特定字符集的页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆