URL 上的 Scrapy 散列标签 [英] Scrapy, hash tag on URLs

查看:33
本文介绍了URL 上的 Scrapy 散列标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy 进行报废项目.

I'm on the middle of a scrapping project using Scrapy.

我意识到 Scrapy 将 URL 从哈希标签剥离到末尾.

I realized that Scrapy strips the URL from a hash tag to the end.

这是shell的输出:

Here's the output from the shell:

[s]   request    <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s]   response   <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>

这确实影响了我的报废,因为在尝试找出未选择某些项目的几个小时后,我意识到长 URL 提供的 HTML 与短 URL 提供的 HTML 不同.此外,经过一些观察,一些关键部分的内容发生了变化.

This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.

有没有办法修改这种行为,让 Scrapy 保留整个 URL?

Is there a way to modify this behavior so Scrapy keeps the whole URL?

感谢您的反馈和建议.

推荐答案


这不是scrapy本身可以改变的——url中哈希后面的部分是片段客户端使用的标识符(这里是scrapy,通常是浏览器)而不是服务器.


This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.

当您在浏览器中获取页面时,可能会发生的情况是该页面包含一些 JavaScript,这些 JavaScript 会查看片段标识符并通过 AJAX 加载一些附加数据并更新页面.您需要查看浏览器的功能,看看是否可以模拟它——Firebug 或 Chrome 或 Safari 检查器等开发者工具可以让这一切变得简单.

What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.

例如,如果您导航到 http://twitter.com/also,您将被重定向到 http://twitter.com/#!/also.此处浏览器加载的实际 URL 只是 http://twitter.com/,但该页面随后加载了数据(http://twitter.com/users/show_for_profile.json?screen_name=also) 用于生成页面,在这种情况下,它只是您可以自己解析的 JSON 数据.您可以使用 Chrome 中的网络检查器看到这种情况.

For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.

这篇关于URL 上的 Scrapy 散列标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆