将BeautifulSoup4与Google Translate一起使用 [英] Using BeautifulSoup4 with Google Translate

查看:73
本文介绍了将BeautifulSoup4与Google Translate一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在浏览AutomateTheBoringStuff的Web抓取部分,并试图编写一个脚本,使用BeautifulSoup4从Google Translate提取翻译的单词.

I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.

我检查了说明"为翻译词的页面的html内容:

I inspected the html content of a page where 'Explanation' is the translated word:

<span id="result_box" class="short_text" lang="en">  
    <span class>Explanation</span>
</span>

使用BeautifulSoup4,我尝试了不同的选择器,但没有任何东西会返回已翻译的单词.这是我尝试的一些示例,但它们根本不返回任何结果:

Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:

soup.select('span[id="result_box"] > span')  
soup.select('span span') 

我什至直接从开发人员工具复制了选择器,这给了我 #result_box>跨度.这再次不返回任何结果.

I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.

有人可以向我解释如何将BeautifulSoup4用于我的目的吗?这是我第一次使用BeautifulSoup4,但我认为我或多或少正确地使用BeautifulSoup,因为选择器

Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector

soup.select('span[id="result_box"]')

让我获得外部跨度元素**

gets me the outer span element**

[<span class="short_text" id="result_box"></span>]

**不确定是否丢失了'leng ="en"'部分,但我可以肯定地说我已经找到了正确的元素.

**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.

这是完整的代码:

import bs4, requests

url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)

如果我将Google翻译页面另存为脱机html文件,然后从该html文件中创建汤对象,则定位该元素将没有问题.

If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.

import bs4

file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)

推荐答案

result_box div是正确的元素,但是只有当您保存在浏览器中看到的内容(包括动态内容)时,代码才起作用生成的内容,使用请求您只能获得源本身,而不包含任何动态生成的内容.翻译是通过以下网址的ajax调用生成的:

The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:

"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"

对于您的请求,它返回:

For your requests it returns:

[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]

因此,您将不得不模仿请求,传递所有必要的参数,或者使用支持动态内容的内容,例如

So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium

这篇关于将BeautifulSoup4与Google Translate一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆