Python get请求返回的HTML与查看源代码不同 [英] Python get request returning different HTML than view source

查看:840
本文介绍了Python get请求返回的HTML与查看源代码不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从我们自己的URL存档中提取幻想,以便使用NLTK库对其进行语言分析.但是,从URL抓取HTML的每一次尝试都会返回所有东西,但都是幻想(还有我不需要的评论表单).

I'm trying to extract the fanfiction from an Archive of Our Own URL in order to use the NLTK library to do some linguistic analysis on it. However every attempt at scraping the HTML from the URL is returning everything BUT the fanfic (and the comments form, which I don't need).

首先,我尝试使用内置的urllib库(和BeautifulSoup):

First I tried with the built in urllib library (and BeautifulSoup):

import urllib
from bs4 import BeautifulSoup    
html = request.urlopen("http://archiveofourown.org/works/6846694").read()
soup = BeautifulSoup(html,"html.parser")
soup.prettify()

然后,我了解了请求库以及用户代理可能是问题的一部分的原因,因此我尝试了以下相同结果:

Then I found out about the Requests library, and how the User Agent could be part of the problem, so I tried this with the same results:

import requests
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
        'Content-Type': 'text/html',
}
requests.get("http://archiveofourown.org/works/6846694",headers=headers,timeout=5).text

然后我发现了有关Selenium和PhantomJS的信息,所以我安装了它们并尝试了此操作,但是再次尝试-同样的结果:

Then I found out about Selenium and PhantomJS, so I installed those and tried this but again - same result:

from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.PhantomJS()
browser.get("http://archiveofourown.org/works/6846694")
soup = BeautifulSoup(browser.page_source, "html.parser")
soup.prettify()

在这些尝试中我是否做错了什么,或者这是服务器的问题?

Am I doing something wrong in any of these attempts, or is this an issue with the server?

推荐答案

如果您需要包含所有已执行JavaScript并发出异步请求的完整页面源,则最后一种方法是朝正确的方向迈出了一步.您只缺少一件事-您需要给PhantomJS时间进行加载阅读源代码之前的页面(双关语是故意的).

The last approach is a step into the right direction if you need the complete page source with all the JavaScript executed and async requests made. You are just missing one thing - you need to give PhantomJS time to load the page before reading the source (pun intentional).

并且,您还需要单击继续",您同意查看成人内容:

And, you need to also click "Proceed" that you agree to see the adult content:

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.PhantomJS()
driver.get("http://archiveofourown.org/works/6846694")

wait = WebDriverWait(driver, 10)

# click proceed
proceed = wait.until(EC.presence_of_element_located((By.LINK_TEXT, "Proceed")))
proceed.click()

# wait for the content to be present
wait.until(EC.presence_of_element_located((By.ID, "workskin")))

soup = BeautifulSoup(driver.page_source, "html.parser")
soup.prettify()

这篇关于Python get请求返回的HTML与查看源代码不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆