Scrapy 忽略 URL 中 # 标签后的内容 [英] Scrapy ignoring content after # tag in the URL

查看：21 发布时间：2021/7/16 22:26:03 python url scrapy

本文介绍了Scrapy 忽略 URL 中 # 标签后的内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在抓取一个具有如下 URl 的网站

Hi i am scraping a site which has the URl like below

http://www.example.com/categories-Mobile-Phones.aspx#RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03

我已经把它放在 start_url 并请求响应，但我收到如下响应

i had placed this in start_url and requested a response , but i received the response as below

<200 http://www.example.com/categories-Mobile-Phones.aspx>

也就是说，它只是忽略了 hash tag 之后的内容，所以我读了一些帖子并了解到，当我们请求带有哈希标记的 url 时，服务器只是忽略了哈希片段，我的意思是哈希片段用于加载 ajax 或 javascript 请求的一些额外信息.所以我在 之后用感叹号 (!) 替换了 start_url 中的 url#标签如下

That is, it simply ignores the content after hash tag, so i had read some post and came to know that, when we request a url with hash tag the server simply ignores the hash fragments, i mean hash fragments are used to load some extra information for ajax or javascript requests.so i had replaced the url in start_url with an exclamatory mark(!) after # tag as below

http://www.example.com/categories-Mobile-Phones.aspx#!RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03

现在输出是

<GET http://www.example.com/categories-Mobile-Phones.aspx?_escaped_fragment_=RSS=pgZZ1QQdivZZctl00_ContentPlaceHolder1_ctl00_ctl03

我通过阅读这篇文章来做到这一点https://developers.google.com/webmasters/ajax-crawling/docs/getting-started，根据这篇文章中的概念，我需要将带有 ?_escaped_fragment_= 的输出 URL 转换为包含 # 的 URL(我的意思是原始 URL)以完全解析页面而不会忽略哈希片段.如何转换它.

I had done this by reading this post https://developers.google.com/webmasters/ajax-crawling/docs/getting-started, According to the concept in this post i need to convert the output URL with ?_escaped_fragment_= to the URL containing # (I mean the original URL)to parse the page completely without scrapy ignoring the hash fragment.How to convert it.

我希望我解释得很好，如果不是，请纠正我并让我知道如何制作scrapy而不忽略来自URL的哈希片段的概念.

I hope i explained well, if not please correct me and let me know the concept of how to make scrapy of not ignoring the hash fragments from a URL.

提前致谢........

Thanks in advance...............

Scrapy 忽略 URL 中 # 标签后的内容 [英] Scrapy ignoring content after # tag in the URL

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy 忽略 URL 中 # 标签后的内容 [英] Scrapy ignoring content after # tag in the URL

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭