如何使用 Scrapy 从网站上抓取地址? [英] How to scrape address from websites using Scrapy?

查看:57
本文介绍了如何使用 Scrapy 从网站上抓取地址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy,我需要从给定域的联系我们页面中抓取地址.这些域是由 google 搜索 api 提供的,因此我不知道网页的确切结构是什么.这种刮痧可能吗?任何例子都会很好.

I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples would be nice.

推荐答案

提供一些示例有助于更好地回答,但总体思路可能是:

Providing few examples would help to make a better answer, but the general idea could be to:

  • 找到联系我们"链接
  • 点击链接并提取地址

假设您没有关于您将获得的网站的任何信息.

assuming you don't have any information about the web-sites you'll be given.

让我们关注第一个问题.

Let's focus on the first problem.

这里的主要问题是网站的结构不同,严格来说,您无法建立一种 100% 可靠的方式来找到联系我们"页面.但是,您可以涵盖"最常见的情况:

The main problem here is that the web-sites are structured differently and, strictly speaking, you cannot build a 100% reliable way to find the "Contact Us" page. But, you can "cover" the most common cases:

  • a 标签后面加上联系我们"、联系方式"、关于我们"、关于"等文本
  • 检查 /about/contact_us 和类似的端点,示例:
    • follow the a tag with the text "Contact Us", "Contact", "About Us", "About" etc
    • check /about, /contact_us and similar endpoints, examples:
      • http://www.sample.com/contact.php
      • http://www.sample.com/contact

      从这些你可以构建一组 您的CrawlSpider.

      From these you can build a set of Rules for your CrawlSpider.

      第二个问题并不容易 - 您不知道地址位于页面上的哪个位置(并且可能在页面上不存在),并且您不知道地址格式.您可能需要深入研究 自然语言处理机器学习.

      The second problem is no easier - you don't know where on the page an address is located (and may be it doesn't exist on a page), and you don't know the address format. You may need to dive into Natural Language Processing and Machine Learning.

      这篇关于如何使用 Scrapy 从网站上抓取地址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆