Scrapy 忽略 URL 中 # 标签后的内容 [英] Scrapy ignoring content after # tag in the URL
问题描述
我正在抓取一个具有如下 URl 的网站
Hi i am scraping a site which has the URl like below
http://www.example.com/categories-Mobile-Phones.aspx#RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03
我已经把它放在 start_url
并请求响应,但我收到如下响应
i had placed this in start_url
and requested a response , but i received the response as below
<200 http://www.example.com/categories-Mobile-Phones.aspx>
也就是说,它只是忽略了 hash tag
之后的内容,所以我读了一些帖子并了解到,当我们请求带有哈希标记的 url 时,服务器只是忽略了哈希片段,我的意思是哈希片段用于加载 ajax 或 javascript 请求的一些额外信息.所以我在 之后用
如下感叹号 (!)
替换了 start_url
中的 url#标签
That is, it simply ignores the content after hash tag
, so i had read some post and came to know that, when we request a url with hash tag the server simply ignores the hash fragments, i mean hash fragments are used to load some extra information for ajax or javascript requests.so i had replaced the url in start_url
with an exclamatory mark(!)
after # tag
as below
http://www.example.com/categories-Mobile-Phones.aspx#!RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03
现在输出是
<GET http://www.example.com/categories-Mobile-Phones.aspx?_escaped_fragment_=RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03
我通过阅读这篇文章来做到这一点https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
,根据这篇文章中的概念,我需要将带有 ?_escaped_fragment_=
的输出 URL 转换为包含 #
的 URL(我的意思是原始 URL)以完全解析页面而不会忽略哈希片段.如何转换它.
I had done this by reading this post https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
, According to the concept in this post i need to convert the output URL with ?_escaped_fragment_=
to the URL containing #
(I mean the original URL)to parse the page completely without scrapy ignoring the hash fragment.How to convert it.
我希望我解释得很好,如果不是,请纠正我并让我知道如何制作scrapy而不忽略来自URL的哈希片段的概念.
I hope i explained well, if not please correct me and let me know the concept of how to make scrapy of not ignoring the hash fragments from a URL.
提前致谢........
Thanks in advance...............
推荐答案
没关系.无论有没有散列,URI 都指向完全相同的页面.
It doesn't matter. With or without the hash the URI refers to exactly the same page.
散列后面的东西是片段标识符.您的浏览器将使用它来将页面滚动到页面的特定部分.
The stuff after the hash is a fragment identifier. Your browser will use it to scroll the page to that specific part of the page.
像这样...
http://www.w3.org/TR/html4/intro/intro.html#h-2.1.2
……还有这个……
http://www.w3.org/TR/html4/intro/介绍.html
..两者都检索相同的页面.前者只是告诉您从页面上的哪个位置开始阅读.
..both retrieve the same page. The former simply tells you where on the page to start reading.
start_urls = ['themobilestore.in/home-mobiles-&-tablet/?page=1', 'themobilestore.in/home-mobiles-&-tablet/?page=2', ]
start_urls = ['themobilestore.in/home-mobiles-&-tablet/?page=1', 'themobilestore.in/home-mobiles-&-tablet/?page=2', ]
这篇关于Scrapy 忽略 URL 中 # 标签后的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!