Python requests.get(url) 返回 javascript 代码而不是页面 html [英] Python requests.get(url) returning javascript code instead of the page html
问题描述
我有一个非常简单的问题.我正在尝试从linkedIn 页面的html 中获取工作描述,但是我没有获取页面的html,而是得到了几行看起来像javascript 代码的行.我对此很陌生,因此将不胜感激任何帮助!谢谢
I have a very simple problem. I'm trying to get the job description from the html of a linkedIn page, but instead of getting the html of the page I'm getting few lines that look like a javascript code instead. I'm very new to this so any help will be greatly appreciated! Thanks
这是我的代码:
import requests
url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
page_html = requests.get(url).text
print(page_html)
当我运行它时,我没有得到我期望包含工作描述的 html...我只是得到了几行 javascript 代码.
When I run this I don't get the html that I expect containing the job description...I just get few lines of javascript code instead.
推荐答案
一些网站根据访问网站的浏览器类型呈现不同的内容.LinkedIn 就是这种行为的完美例子.如果浏览器具有高级功能,网站可能会呈现更丰富"的内容——更具动态和风格的内容.使用机器人无助于查看这些网站.
Some websites present different content based on the type of browser that is accessing the site. LinkedIn is a perfect example of such behavior. If the browser has advanced capabilities, the website may present "richer" content – something more dynamic and styled. And using the bot won't help to see these websites.
要解决此问题,您需要按照以下步骤操作:
To solve this problem, you need to follow these steps:
- 从此处下载 chrome-driver.选择与您的操作系统相匹配的那个.
- 解压驱动,放到某个目录下.例如,
\usr
- 通过运行
pip install selenium
安装Selenium
,这是一个python模块.请注意,selenium 依赖于另一个名为msgpack
的包.所以,你应该首先使用这个命令安装它pip install msgpack
. - 现在,我们准备运行以下代码
- Download chrome-driver from here. Choose the one that matches your OS.
- Extract the driver and put it in a certain directory. For example,
\usr
- Install
Selenium
which is a python module by runningpip install selenium
. Note that, selenium depends on another package calledmsgpack
. So, you should install it first using this commandpip install msgpack
. - Now, we are ready to run the following code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
browser = create_browser('/usr/chromedriver') #DON'T FORGET TO CHANGE THIS AS YOUR DIRECTORY
browser.get(url)
page_html = browser.page_source
print(page_html[-10:]) #prints dy></html>
现在,您拥有了整个页面.我希望这能回答你的问题!!
Now, you have the whole page. I hope this answers your question!!
这篇关于Python requests.get(url) 返回 javascript 代码而不是页面 html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!