Scrapy 忽略 URL 中 # 标签后的内容 [英] Scrapy ignoring content after # tag in the URL

查看:21
本文介绍了Scrapy 忽略 URL 中 # 标签后的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取一个具有如下 URl 的网站

Hi i am scraping a site which has the URl like below

http://www.example.com/categories-Mobile-Phones.aspx#RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03

我已经把它放在 start_url 并请求响应,但我收到如下响应

i had placed this in start_url and requested a response , but i received the response as below

<200 http://www.example.com/categories-Mobile-Phones.aspx>

也就是说,它只是忽略了 hash tag 之后的内容,所以我读了一些帖子并了解到,当我们请求带有哈希标记的 url 时,服务器只是忽略了哈希片段,我的意思是哈希片段用于加载 ajax 或 javascript 请求的一些额外信息.所以我在 之后用 感叹号 (!) 替换了 start_url 中的 url#标签如下

That is, it simply ignores the content after hash tag, so i had read some post and came to know that, when we request a url with hash tag the server simply ignores the hash fragments, i mean hash fragments are used to load some extra information for ajax or javascript requests.so i had replaced the url in start_url with an exclamatory mark(!) after # tag as below

http://www.example.com/categories-Mobile-Phones.aspx#!RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03

现在输出是

<GET http://www.example.com/categories-Mobile-Phones.aspx?_escaped_fragment_=RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03

我通过阅读这篇文章来做到这一点https://developers.google.com/webmasters/ajax-crawling/docs/getting-started,根据这篇文章中的概念,我需要将带有 ?_escaped_fragment_= 的输出 URL 转换为包含 # 的 URL(我的意思是原始 URL)以完全解析页面而不会忽略哈希片段.如何转换它.

I had done this by reading this post https://developers.google.com/webmasters/ajax-crawling/docs/getting-started, According to the concept in this post i need to convert the output URL with ?_escaped_fragment_= to the URL containing # (I mean the original URL)to parse the page completely without scrapy ignoring the hash fragment.How to convert it.

我希望我解释得很好,如果不是,请纠正我并让我知道如何制作scrapy而不忽略来自URL的哈希片段的概念.

I hope i explained well, if not please correct me and let me know the concept of how to make scrapy of not ignoring the hash fragments from a URL.

提前致谢........

Thanks in advance...............

推荐答案

没关系.无论有没有散列,URI 都指向完全相同的页面.

It doesn't matter. With or without the hash the URI refers to exactly the same page.

散列后面的东西是片段标识符.您的浏览器将使用它来将页面滚动到页面的特定部分.

The stuff after the hash is a fragment identifier. Your browser will use it to scroll the page to that specific part of the page.

像这样...

http://www.w3.org/TR/html4/intro/intro.html#h-2.1.2

……还有这个……

http://www.w3.org/TR/html4/intro/介绍.html

..两者都检索相同的页面.前者只是告诉您从页面上的哪个位置开始阅读.

..both retrieve the same page. The former simply tells you where on the page to start reading.

start_urls = ['themobilestore.in/home-mobiles-&-tablet/?page=1', 'themobilestore.in/home-mobiles-&-tablet/?page=2', ]

start_urls = ['themobilestore.in/home-mobiles-&-tablet/?page=1', 'themobilestore.in/home-mobiles-&-tablet/?page=2', ]

这篇关于Scrapy 忽略 URL 中 # 标签后的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆