带有 Unicode 参数的 Python 请求 URL [英] Python Requests URL with Unicode Parameters

查看:32
本文介绍了带有 Unicode 参数的 Python 请求 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试访问 google tts 网址,http://translate.google.com/translate_tts 使用请求库在 python 中使用日语字符和短语.

这是一个例子:

http://translate.google.com/translate_tts?tl=ja&q=ひとつ

但是,当我尝试使用 python requests 库下载端点返回的 mp3 时,生成的 mp3 是空白的.我已验证我可以在使用非 unicode 字符(通过罗马字)的请求中访问此 URL 并得到正确响应.

这是我用来发出请求的代码的一部分

langs = {'japanese': 'ja','英语': 'en'}def get_sound_file_for_text(text, download=False, lang='japanese'):r = StringIO()glang = langs[lang]text = text.replace('*', '')text = text.replace('/', '')text = text.replace('x', '')url = 'http://translate.google.com/translate_tts'如果下载:结果 = requests.get(url, params={'tl': glang, 'q': text})r.write(result.content)r.seek(0)返回别的:返回网址

此外,如果我在此代码段中打印 texturl,假名/汉字会在我的控制台中正确呈现.

如果我尝试对 unicode 进行编码并引用它,我仍然会得到相同的响应.

# -*- 编码:utf-8 -*-从 StringIO 导入 StringIO导入 urllib进口请求__作者__ = '雅各布'langs = {'japanese': 'ja','英语': 'en'}def get_sound_file_for_text(text, download=False, lang='japanese'):r = StringIO()glang = langs[lang]text = text.replace('*', '')text = text.replace('/', '')text = text.replace('x', '')text = urllib.quote(text.encode('utf-8'))url = 'http://translate.google.com/translate_tts?tl=%(glang)s&q=%(text)s' % locals()打印网址如果下载:结果 = requests.get(url)r.write(result.content)r.seek(0)返回别的:返回网址

返回这个:

http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

这看起来应该有效,但没有.

编辑 2:

如果我尝试使用 urlllb/urllib2,我会收到 403 错误.

编辑 3:

所以,这个问题/行为似乎仅限于这个端点.如果我尝试以下 URL,则使用不同的端点.

http://www.kanjidamage.com/kanji/13-un-%E4%B8%8D

从请求和我的浏览器中,我得到相同的响应(它们匹配).如果我什至尝试将 ascii 字符发送到服务器,就像这个 url.

http://translate.google.com/translate_tts?tl=ja&q=sayon​​ara

我也得到了相同的响应(它们再次匹配).但是,如果我尝试将 unicode 字符发送到此 URL,我会在浏览器上获得正确的音频文件,但不会从请求中获得正确的音频文件,它发送了一个音频文件,但没有声音.

http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

那么,这种行为似乎仅限于 Google TTL URL?

解决方案

用户代理可能是问题的一部分,但是,在这种情况下并非如此.translate_tts 服务拒绝(使用 HTTP 403)一些用户代理,例如任何以 Pythoncurlwget 和其他可能的开头的.这就是您在使用 urllib2.urlopen() 时看到 HTTP 403 响应的原因——它将用户代理设置为 Python-urllib/2.7(版本可能会有所不同).

您发现将用户代理设置为 Mozilla/5.0 解决了该问题,但这可能会起作用,因为 API 可能会基于用户代理假设特定的编码.

您实际上应该做的是使用 ie 字段明确指定 URL 字符编码.您的 URL 请求应如下所示:

<前>http://translate.google.com/translate_tts?ie=UTF-8&tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

注意 ie=UTF-8 明确设置 URL 字符编码.规范确实声明 UTF-8 是默认值,但似乎并不完全正确,因此您应该始终在请求中设置 ie.

API 支持汉字、平假名和片假名(可能还有其他?).这些网址都产生nihongo",尽管为平假名输入产生的音频与其他网址略有不同.

导入请求一 = u'u3072u3068u3064'汉字 = u'u65e5u672cu8a9e'平假名 = u'u306bu307bu3093u3054'片假名 = u'u30cbu30dbu30f3u30b4'url = 'http://translate.google.com/translate_tts'对于文字一、汉字、平假名、片假名:r = requests.get(url, params={'ie': 'UTF-8', 'tl': 'ja', 'q': text})打印 u"{} -> {}".format(text, r.url)open(u'/tmp/{}.mp3'.format(text), 'wb').write(r.content)

I'm currently trying to hit the google tts url, http://translate.google.com/translate_tts with japanese characters and phrases in python using the requests library.

Here is an example:

http://translate.google.com/translate_tts?tl=ja&q=ひとつ

However, when I try to use the python requests library to download the mp3 that the endpoint returns, the resulting mp3 is blank. I have verified that I can hit this URL in requests using non-unicode characters (via romanji) and have gotten correct responses back.

Here is a part of the code I am using to make the request

langs = {'japanese': 'ja',
         'english': 'en'}

def get_sound_file_for_text(text, download=False, lang='japanese'):

    r = StringIO()
    glang = langs[lang]
    text = text.replace('*', '')
    text = text.replace('/', '')
    text = text.replace('x', '')
    url = 'http://translate.google.com/translate_tts'
    if download:
        result = requests.get(url, params={'tl': glang, 'q': text})
        r.write(result.content)
        r.seek(0)
        return r
    else:
        return url

Also, if I print textor url within this snippet, the kana/kanji is rendered correctly in my console.

Edit:

If I attempt to encode the unicode and quote it as such, I still get the same response.

# -*- coding: utf-8 -*-

from StringIO import StringIO
import urllib
import requests

__author__ = 'jacob'

langs = {'japanese': 'ja',
         'english': 'en'}

def get_sound_file_for_text(text, download=False, lang='japanese'):

    r = StringIO()
    glang = langs[lang]
    text = text.replace('*', '')
    text = text.replace('/', '')
    text = text.replace('x', '')
    text = urllib.quote(text.encode('utf-8'))
    url = 'http://translate.google.com/translate_tts?tl=%(glang)s&q=%(text)s' % locals()
    print url
    if download:
        result = requests.get(url)
        r.write(result.content)
        r.seek(0)
        return r
    else:
        return url

Which returns this:

http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

Which seems like it should work, but doesn't.

Edit 2:

If I attempt to use urlllb/urllib2, I get a 403 error.

Edit 3:

So, it seems that this problem/behavior is simply limited to this endpoint. If I try the following URL, a different endpoint.

http://www.kanjidamage.com/kanji/13-un-%E4%B8%8D

From within requests and my browser, I get the same response (they match). If I even try ascii characters to the server, like this url.

http://translate.google.com/translate_tts?tl=ja&q=sayonara

I get the same response as well (they match again). But if I attempt to send unicode characters to this URL, I get a correct audio file on my browser, but not from requests, which sends an audio file, but with no sound.

http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

So, it seems like this behavior is limited to the Google TTL URL?

解决方案

The user agent can be part of the problem, however, it is not in this case. The translate_tts service rejects (with HTTP 403) some user agents, e.g. any that begin with Python, curl, wget, and possibly others. That is why you are seeing a HTTP 403 response when using urllib2.urlopen() - it sets the user agent to Python-urllib/2.7 (the version might vary).

You found that setting the user agent to Mozilla/5.0 fixed the problem, but that might work because the API might assume a particular encoding based on the user agent.

What you actually should do is to explicitly specify the URL character encoding with the ie field. Your URL request should look like this:

http://translate.google.com/translate_tts?ie=UTF-8&tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

Note the ie=UTF-8 which explicitly sets the URL character encoding. The spec does state that UTF-8 is the default, but doesn't seem entirely true, so you should always set ie in your requests.

The API supports kanji, hiragana, and katakana (possibly others?). These URLs all produce "nihongo", although the audio produced for hiragana input has a slightly different inflection to the others.

import requests

one = u'u3072u3068u3064'
kanji = u'u65e5u672cu8a9e'
hiragana = u'u306bu307bu3093u3054'
katakana = u'u30cbu30dbu30f3u30b4'
url = 'http://translate.google.com/translate_tts'

for text in one, kanji, hiragana, katakana:
    r = requests.get(url, params={'ie': 'UTF-8', 'tl': 'ja', 'q': text})
    print u"{} -> {}".format(text, r.url)
    open(u'/tmp/{}.mp3'.format(text), 'wb').write(r.content)

这篇关于带有 Unicode 参数的 Python 请求 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆