如何使用 Scrapy 从网站上抓取地址? [英] How to scrape address from websites using Scrapy?
问题描述
我正在使用 Scrapy,我需要从给定域的联系我们页面中抓取地址.这些域是由 google 搜索 api 提供的,因此我不知道网页的确切结构是什么.这种刮痧可能吗?任何例子都会很好.
I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples would be nice.
推荐答案
提供一些示例有助于更好地回答,但总体思路可能是:
Providing few examples would help to make a better answer, but the general idea could be to:
- 找到联系我们"链接
- 点击链接并提取地址
假设您没有关于您将获得的网站的任何信息.
assuming you don't have any information about the web-sites you'll be given.
让我们关注第一个问题.
Let's focus on the first problem.
这里的主要问题是网站的结构不同,严格来说,您无法建立一种 100% 可靠的方式来找到联系我们"页面.但是,您可以涵盖"最常见的情况:
The main problem here is that the web-sites are structured differently and, strictly speaking, you cannot build a 100% reliable way to find the "Contact Us" page. But, you can "cover" the most common cases:
- 在
a
标签后面加上联系我们"、联系方式"、关于我们"、关于"等文本 - 检查
/about
、/contact_us
和类似的端点,示例:- follow the
a
tag with the text "Contact Us", "Contact", "About Us", "About" etc - check
/about
,/contact_us
and similar endpoints, examples:- http://www.sample.com/contact.php
- http://www.sample.com/contact
从这些你可以构建一组 您的
CrawlSpider
.From these you can build a set of
Rules
for yourCrawlSpider
.第二个问题并不容易 - 您不知道地址位于页面上的哪个位置(并且可能在页面上不存在),并且您不知道地址格式.您可能需要深入研究 自然语言处理 和 机器学习.
The second problem is no easier - you don't know where on the page an address is located (and may be it doesn't exist on a page), and you don't know the address format. You may need to dive into Natural Language Processing and Machine Learning.
这篇关于如何使用 Scrapy 从网站上抓取地址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- follow the