分页在每次迭代中都提供第一页 [英] Pagination giving the first page in every iteration

查看:80
本文介绍了分页在每次迭代中都提供第一页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取分页的网页,但这给了我第一个 每次迭代中的页面.当我在浏览器中单击它时, 内容不同.

I'm trying to scrape paginated web, but it gives me the first page in every iteration. When I click it in the browser, the content is different.

url = "http://www.x.y/z/a-b#/page-%s"

for i in range(1, 10):
  url2 = url % str(i)
  soup = urlToSoup(url2)
  print url2
  # url2 changes in every iteration
  # Here it will print the same product list in every iteration

这是输出:

http://www.x.y/z/a-b#/page-1
http://www.x.y/z/a-b#/page-2
http://www.x.y/z/a-b#/page-3
http://www.x.y/z/a-b#/page-4
http://www.x.y/z/a-b#/page-5
http://www.x.y/z/a-b#/page-6
http://www.x.y/z/a-b#/page-7
http://www.x.y/z/a-b#/page-8
http://www.x.y/z/a-b#/page-9

第2页(以及类似的3、4,...)的寻呼机项如下所示

The pager item for the page 2 (and similarly 3, 4, ...) looks as follows

<a rel="nofollow" href="http://www.x.y/z/a-b#/page-2"> <span>2</span> </a>

为什么在浏览器中打开URL(通过单击或通过地址栏)以及通过代码获取URL时,结果页面不同?

Why the resulting page is different when I open the URL (via click or via address bar) in the browser and when I get it via the code?

推荐答案

您正在将文本添加到片段标识符"(即,在#之后),请参见

You are adding text to the "Fragment Identifier" (i.e. after a #) see https://www.w3.org/DesignIssues/Fragment.html

片段标识符是URI之后,哈希之后的字符串,用于标识特定于文档的功能.对于诸如HTML poage之类的用户界面Web文档,它通常标识零件或视图.例如在对象

The fragment identifier is a string after URI, after the hash, which identifies something specific as a function of the document. For a user interface Web document such as HTML poage, it typically identifies a part or view. For example in the object

RFC3986

在取消引用之前,片段标识符与URI的其余部分是分开的,因此,片段本身内的标识信息仅由用户代理取消引用,而与URI方案无关.尽管通常认为这种单独的处理是信息的丢失,尤其是对于资源随时间推移而进行的精确重定向,但它还可以防止信息提供者拒绝参考作者有选择地参考资源中信息的权利.间接引用还为使用URI的系统提供了额外的灵活性和可扩展性,因为新的媒体类型比新的标识方案更容易定义和部署.

the fragment identifier is separated from the rest of the URI prior to a dereference, and thus the identifying information within the fragment itself is dereferenced solely by the user agent, regardless of the URI scheme. Although this separate handling is often perceived to be a loss of information, particularly for accurate redirection of references as resources move over time, it also serves to prevent information providers from denying reference authors the right to refer to information within a resource selectively. Indirect referencing also provides additional flexibility and extensibility to systems that use URIs, as new media types are easier to define and deploy than new schemes of identification.

因此,您要将索引添加到未发送到服务器的URL的一部分.仅供客户端使用,仅由用户代理取消引用".服务器每次迭代都会看到相同的URL.

So you are adding you index to a part of a URL that is not sent to the server. It is for client side use only "dereferenced solely by the user agent". The server is seeing the same URL every iteration.

页面最有可能呈现的方式是,有一些JavaScript读取片段标识符并发出另一个请求以获取数据或确定要显示数据的哪一部分.

The way the page is most likely rendered is that there is some JavaScript reading the fragment identifier and making another request to get the data or determining which part of the data to display.

我建议使用Live HTTP Headers或其他工具检查页面提出的所有请求,以查看是否有第二个请求可以使用或使用JavaScript渲染技术(例如Selenium,dryscrape或PyQT5),请参阅我对抓取Google财经(BeautifulSoup).

I suggest examining all the requests the page makes using Live HTTP Headers or some other tool to see if there is a second request you can utilise or use a JavaScript rendering technology like Selenium, dryscrape or PyQT5, see my answer to Scraping Google Finance (BeautifulSoup) for details.

这篇关于分页在每次迭代中都提供第一页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆