从网站上抓取数据的最佳方式是什么? [英] What's the best way of scraping data from a website?

查看:32
本文介绍了从网站上抓取数据的最佳方式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从网站中提取内容,但该应用程序不提供任何应用程序编程接口或其他机制来以编程方式访问该数据.

我发现了一个名为 Import.io 的有用的第三方工具,它提供了用于抓取网页和构建数据集的点击即走功能,唯一的问题是我想将我的数据保存在本地,而且我不想订阅任何订阅计划.

这家公司使用什么样的技术来抓取网页和构建他们的数据集?我发现了一些网页抓取框架 pjscrape &Scrapy他们能提供这样的功能吗

解决方案

您肯定希望从一个好的 Web 抓取框架开始.稍后您可能会认为它们太有限了,您可以将自己的库组合在一起,但如果没有很多抓取经验,您的设计将比 pjscrape 或 scrapy 差很多.

注意:我在这里使用的术语爬行和抓取基本上可以互换.这是我对你的 Quora 问题的回答的副本,它很长.

工具

根据您的首选浏览器,熟悉 Firebug 或 Chrome 开发工具.当您浏览要从中提取数据的站点并确定哪些 url 包含您要查找的数据以及构成响应的数据格式时,这将是绝对必要的.

您需要对 HTTP 和 HTML 具有良好的工作知识,并且可能希望在中间代理软件中找到一个像样的人.您需要能够检查 HTTP 请求和响应,并了解 cookie 和会话信息以及查询参数是如何传递的.Fiddler (http://www.telerik.com/fiddler) 和 Charles 代理 (http://www.charlesproxy.com/) 是流行的工具.我经常使用 mitmproxy (http://mitmproxy.org/),因为我更喜欢键盘侠而不是鼠标手.>

某种控制台/shell/REPL 类型的环境,您可以在其中尝试具有即时反馈的各种代码片段,这将是非常宝贵的.像这样的逆向工程任务需要大量的反复试验,因此您需要一个可以简化此过程的工作流程.

语言

PHP 基本上已经过时了,它不太适合这项任务,而且这方面的库/框架支持很差.Python(Scrapy 是一个很好的起点)和 Clojure/Clojurescript(非常强大和高效,但学习曲线很大)是解决这个问题的好语言.由于您不想学习一门新语言并且您已经知道 Javascript,我绝对建议您坚持使用 JS.我没有使用过 pjscrape,但从快速阅读他们的文档来看,它看起来相当不错.它非常适合并实现了我在下面描述的问题的出色解决方案.

关于正则表达式的说明:不要使用正则表达式来解析 HTML.许多初学者这样做是因为他们已经熟悉正则表达式.这是一个巨大的错误,使用 xpath 或 css 选择器来导航 html 并且只使用正则表达式从 html 节点内的实际文本中提取数据.这对你来说可能已经很明显了,如果你尝试一下,它很快就会变得明显,但是很多人出于某种原因浪费了很多时间走这条路.不要害怕 xpath 或 css 选择器,它们比正则表达式更容易学习,而且它们旨在解决这个确切的问题.

Javascript 密集型网站

在过去,您只需要发出一个 http 请求并解析 HTML 响应.现在,您几乎肯定必须处理由标准 HTML HTTP 请求/响应和目标站点的 javascript 部分发出的异步 HTTP 调用混合而成的站点.这是您的代理软件和 firebug/devtools 的网络选项卡非常方便的地方.对这些的响应可能是 html 或 json,在极少数情况下,它们会是 xml 或其他内容.

有两种方法可以解决这个问题:

低级方法:

您可以找出站点 javascript 正在调用的 ajax url 以及这些响应的样子,然后自己发出相同的请求.因此,您可能会从 http://example.com/foobar 中提取 html 并提取一条数据,然后必须提取 json 响应从 http://example.com/api/baz?foo=b... 得到另一部分数据的.您需要注意传递正确的 cookie 或会话参数.这种情况非常罕见,但有时 ajax 调用所需的一些参数是在网站的 javascript 中进行的一些疯狂计算的结果,逆向工程这可能很烦人.

嵌入式浏览器方法:

为什么你需要计算出 html 中有哪些数据,哪些数据来自 ajax 调用?管理所有会话和 cookie 数据?当您浏览网站时,您不必这样做,浏览器和网站 javascript 会这样做.这就是重点.

如果您只是将页面加载到像 phantomjs 这样的无头浏览器引擎中,它将加载页面,运行 javascript 并告诉您所有 ajax 调用何时完成.如有必要,您可以注入自己的 javascript 以触发适当的点击或触发网站 javascript 加载适当数据所需的任何内容.

您现在有两个选择,让它吐出完成的 html 并对其进行解析,或者将一些 javascript 注入到页面中进行解析和数据格式化并吐出数据(可能是 json 格式).您也可以自由组合这两个选项.

哪种方法最好?

这取决于,您肯定需要熟悉并适应低级方法.嵌入式浏览器方法适用于任何事情,实施起来会容易得多,并且会使抓取中一些最棘手的问题消失.它也是一个相当复杂的机器,您需要了解它.这不仅仅是 HTTP 请求和响应,还有请求、嵌入式浏览器呈现、站点 javascript、注入的 javascript、您自己的代码以及与嵌入式浏览器进程的双向交互.

由于渲染开销,嵌入式浏览器在规模上也慢得多,但这几乎可以肯定无关紧要,除非您要抓取许多不同的域.您需要对请求进行速率限制,这将使渲染时间在单个域的情况下完全可以忽略不计.

速率限制/机器人行为

你需要非常清楚这一点.您需要以合理的速度向目标域发出请求.在抓取网站时,您需要编写一个表现良好的机器人,这意味着尊重 robots.txt,而不是用请求来敲打服务器.这里的错误或疏忽是非常不道德的,因为这可以被视为拒绝服务攻击.可接受的速率取决于你问的人,1req/s 是谷歌爬虫运行的最大值,但你不是谷歌,你可能不像谷歌那样受欢迎.保持合理的速度.我建议每个页面请求之间间隔 2-5 秒.

使用用户代理字符串标识您的请求,该字符串标识您的机器人,并为您的机器人提供一个解释其用途的网页.此网址位于代理字符串中.

如果站点想要阻止您,您将很容易被阻止.一个聪明的工程师可以轻松识别机器人,而他们端的几分钟工作可能会导致数周的工作改变你端的抓取代码,或者只是让它变得不可能.如果这种关系是对立的,那么目标站点的聪明工程师可以完全阻止编写爬虫的天才工程师.抓取代码本质上是脆弱的,这很容易被利用.无论如何,几乎可以肯定会引起这种反应的事情是不道德的,因此请编写一个行为良好的机器人,不要担心这一点.

测试

不是单元/集成测试人员?太糟糕了.你现在必须成为其中一员.站点经常更改,您将经常更改代码.这是挑战的很大一部分.

抓取现代网站涉及很多活动部分,良好的测试实践会大有帮助.您在编写此类代码时会遇到的许多错误都是静默返回损坏数据的类型.如果没有良好的测试来检查回归,您会发现您已经将无用的损坏数据保存到数据库中一段时间​​而没有注意到.这个项目会让你非常熟悉数据验证(找到一些好的库来使用)和测试.需要综合测试且很难测试的其他问题并不多.

测试的第二部分涉及缓存和更改检测.在编写代码时,您不想无缘无故地为同一页面一遍又一遍地敲打服务器.在运行单元测试时,您想知道测试失败是因为您破坏了代码还是因为网站已经重新设计.针对所涉及的 url 的缓存副本运行单元测试.缓存代理在这里非常有用,但很难正确配置和使用.

您也确实想知道该站点是否已更改.如果他们重新设计了站点并且您的爬虫损坏了,您的单元测试仍然会通过,因为它们是针对缓存副本运行的!您将需要另一组较小的集成测试集,这些测试很少针对实时站点运行,或者在爬行代码中进行良好的日志记录和错误检测,以记录确切的问题,提醒您注意问题并停止爬行.现在您可以更新缓存、运行单元测试并查看需要更改的内容.

法律问题

如果你做了愚蠢的事情,这里的法律可能会有点危险.如果涉及到法律,您将与那些经常将 wget 和 curl 称为黑客工具"的人打交道.你不想要这个.

这种情况的道德现实是,使用浏览器软件请求网址并查看某些数据与使用您自己的软件请求网址并查看某些数据之间没有区别.谷歌是世界上最大的抓取公司,他们因此受到喜爱.在用户代理中识别您的机器人名称并公开您的网络爬虫的目标和意图将在此有所帮助,因为法律了解 Google 是什么.如果您正在做任何可疑的事情,例如创建虚假用户帐户或访问您不应该访问的站点区域(被 robots.txt 阻止"或由于某种授权漏洞),那么请注意您正在做一些不道德的事情而法律对技术的无知在这里会格外危险.这是一个荒谬的情况,但它是真实的.

作为一个正直的公民,尝试建立一个新的搜索引擎,在你的软件中犯错或有错误,并被视为黑客,这实际上是可能的.考虑到当前的政治现实,这不是您想要的.

我是谁来写这堵巨大的文字墙?

在我的生活中,我写了很多与网络爬虫相关的代码.作为顾问、员工和初创公司创始人,我从事与网络相关的软件开发已有十多年了.早期是编写 perl 爬虫/爬虫和 php 网站.当我们在 Jesse James Garrett 将其命名为 ajax 之前,在 XMLHTTPRequest 成为一个想法之前,嵌入隐藏的 iframe 将 csv 数据加载到网页中以执行 ajax.在 jQuery 之前,在 json 之前.我已经 30 多岁了,这显然被认为对这项业务来说很古老.

我编写过两次大型抓取/抓取系统,一次是为一家媒体公司的一个大型团队(使用 Perl)编写的,最近一次是为一个作为搜索引擎初创公司 CTO 的小团队编写的(使用 Python/Javascript).我目前是一名顾问,主要使用 Clojure/Clojurescript(一种很棒的专家语言,并且拥有使爬虫/抓取问题成为一种乐趣的库)进行编码.

我也编写了成功的反爬虫软件系统.如果您想要或识别和破坏您不喜欢的机器人,那么编写几乎无法抓取的网站非常容易.

比起任何其他类型的软件,我更喜欢编写爬虫、抓取器和解析器.它具有挑战性、趣味性,可用于创造令人惊叹的事物.

I need to extract contents from a website, but the application doesn’t provide any application programming interface or another mechanism to access that data programmatically.

I found a useful third-party tool called Import.io that provides click and go functionality for scraping web pages and building data sets, the only thing is I want to keep my data locally and I don't want to subscribe to any subscription plans.

What kind of technique does this company use for scraping the web pages and building their datasets? I found some web scraping frameworks pjscrape & Scrapy could they provide such a feature

解决方案

You will definitely want to start with a good web scraping framework. Later on you may decide that they are too limiting and you can put together your own stack of libraries but without a lot of scraping experience your design will be much worse than pjscrape or scrapy.

Note: I use the terms crawling and scraping basically interchangeable here. This is a copy of my answer to your Quora question, it's pretty long.

Tools

Get very familiar with either Firebug or Chrome dev tools depending on your preferred browser. This will be absolutely necessary as you browse the site you are pulling data from and map out which urls contain the data you are looking for and what data formats make up the responses.

You will need a good working knowledge of HTTP as well as HTML and will probably want to find a decent piece of man in the middle proxy software. You will need to be able to inspect HTTP requests and responses and understand how the cookies and session information and query parameters are being passed around. Fiddler (http://www.telerik.com/fiddler) and Charles Proxy (http://www.charlesproxy.com/) are popular tools. I use mitmproxy (http://mitmproxy.org/) a lot as I'm more of a keyboard guy than a mouse guy.

Some kind of console/shell/REPL type environment where you can try out various pieces of code with instant feedback will be invaluable. Reverse engineering tasks like this are a lot of trial and error so you will want a workflow that makes this easy.

Language

PHP is basically out, it's not well suited for this task and the library/framework support is poor in this area. Python (Scrapy is a great starting point) and Clojure/Clojurescript (incredibly powerful and productive but a big learning curve) are great languages for this problem. Since you would rather not learn a new language and you already know Javascript I would definitely suggest sticking with JS. I have not used pjscrape but it looks quite good from a quick read of their docs. It's well suited and implements an excellent solution to the problem I describe below.

A note on Regular expressions: DO NOT USE REGULAR EXPRESSIONS TO PARSE HTML. A lot of beginners do this because they are already familiar with regexes. It's a huge mistake, use xpath or css selectors to navigate html and only use regular expressions to extract data from actual text inside an html node. This might already be obvious to you, it becomes obvious quickly if you try it but a lot of people waste a lot of time going down this road for some reason. Don't be scared of xpath or css selectors, they are WAY easier to learn than regexes and they were designed to solve this exact problem.

Javascript-heavy sites

In the old days you just had to make an http request and parse the HTML reponse. Now you will almost certainly have to deal with sites that are a mix of standard HTML HTTP request/responses and asynchronous HTTP calls made by the javascript portion of the target site. This is where your proxy software and the network tab of firebug/devtools comes in very handy. The responses to these might be html or they might be json, in rare cases they will be xml or something else.

There are two approaches to this problem:

The low level approach:

You can figure out what ajax urls the site javascript is calling and what those responses look like and make those same requests yourself. So you might pull the html from http://example.com/foobar and extract one piece of data and then have to pull the json response from http://example.com/api/baz?foo=b... to get the other piece of data. You'll need to be aware of passing the correct cookies or session parameters. It's very rare, but occasionally some required parameters for an ajax call will be the result of some crazy calculation done in the site's javascript, reverse engineering this can be annoying.

The embedded browser approach:

Why do you need to work out what data is in html and what data comes in from an ajax call? Managing all that session and cookie data? You don't have to when you browse a site, the browser and the site javascript do that. That's the whole point.

If you just load the page into a headless browser engine like phantomjs it will load the page, run the javascript and tell you when all the ajax calls have completed. You can inject your own javascript if necessary to trigger the appropriate clicks or whatever is necessary to trigger the site javascript to load the appropriate data.

You now have two options, get it to spit out the finished html and parse it or inject some javascript into the page that does your parsing and data formatting and spits the data out (probably in json format). You can freely mix these two options as well.

Which approach is best?

That depends, you will need to be familiar and comfortable with the low level approach for sure. The embedded browser approach works for anything, it will be much easier to implement and will make some of the trickiest problems in scraping disappear. It's also quite a complex piece of machinery that you will need to understand. It's not just HTTP requests and responses, it's requests, embedded browser rendering, site javascript, injected javascript, your own code and 2-way interaction with the embedded browser process.

The embedded browser is also much slower at scale because of the rendering overhead but that will almost certainly not matter unless you are scraping a lot of different domains. Your need to rate limit your requests will make the rendering time completely negligible in the case of a single domain.

Rate Limiting/Bot behaviour

You need to be very aware of this. You need to make requests to your target domains at a reasonable rate. You need to write a well behaved bot when crawling websites, and that means respecting robots.txt and not hammering the server with requests. Mistakes or negligence here is very unethical since this can be considered a denial of service attack. The acceptable rate varies depending on who you ask, 1req/s is the max that the Google crawler runs at but you are not Google and you probably aren't as welcome as Google. Keep it as slow as reasonable. I would suggest 2-5 seconds between each page request.

Identify your requests with a user agent string that identifies your bot and have a webpage for your bot explaining it's purpose. This url goes in the agent string.

You will be easy to block if the site wants to block you. A smart engineer on their end can easily identify bots and a few minutes of work on their end can cause weeks of work changing your scraping code on your end or just make it impossible. If the relationship is antagonistic then a smart engineer at the target site can completely stymie a genius engineer writing a crawler. Scraping code is inherently fragile and this is easily exploited. Something that would provoke this response is almost certainly unethical anyway, so write a well behaved bot and don't worry about this.

Testing

Not a unit/integration test person? Too bad. You will now have to become one. Sites change frequently and you will be changing your code frequently. This is a large part of the challenge.

There are a lot of moving parts involved in scraping a modern website, good test practices will help a lot. Many of the bugs you will encounter while writing this type of code will be the type that just return corrupted data silently. Without good tests to check for regressions you will find out that you've been saving useless corrupted data to your database for a while without noticing. This project will make you very familiar with data validation (find some good libraries to use) and testing. There are not many other problems that combine requiring comprehensive tests and being very difficult to test.

The second part of your tests involve caching and change detection. While writing your code you don't want to be hammering the server for the same page over and over again for no reason. While running your unit tests you want to know if your tests are failing because you broke your code or because the website has been redesigned. Run your unit tests against a cached copy of the urls involved. A caching proxy is very useful here but tricky to configure and use properly.

You also do want to know if the site has changed. If they redesigned the site and your crawler is broken your unit tests will still pass because they are running against a cached copy! You will need either another, smaller set of integration tests that are run infrequently against the live site or good logging and error detection in your crawling code that logs the exact issues, alerts you to the problem and stops crawling. Now you can update your cache, run your unit tests and see what you need to change.

Legal Issues

The law here can be slightly dangerous if you do stupid things. If the law gets involved you are dealing with people who regularly refer to wget and curl as "hacking tools". You don't want this.

The ethical reality of the situation is that there is no difference between using browser software to request a url and look at some data and using your own software to request a url and look at some data. Google is the largest scraping company in the world and they are loved for it. Identifying your bots name in the user agent and being open about the goals and intentions of your web crawler will help here as the law understands what Google is. If you are doing anything shady, like creating fake user accounts or accessing areas of the site that you shouldn't (either "blocked" by robots.txt or because of some kind of authorization exploit) then be aware that you are doing something unethical and the law's ignorance of technology will be extraordinarily dangerous here. It's a ridiculous situation but it's a real one.

It's literally possible to try and build a new search engine on the up and up as an upstanding citizen, make a mistake or have a bug in your software and be seen as a hacker. Not something you want considering the current political reality.

Who am I to write this giant wall of text anyway?

I've written a lot of web crawling related code in my life. I've been doing web related software development for more than a decade as a consultant, employee and startup founder. The early days were writing perl crawlers/scrapers and php websites. When we were embedding hidden iframes loading csv data into webpages to do ajax before Jesse James Garrett named it ajax, before XMLHTTPRequest was an idea. Before jQuery, before json. I'm in my mid-30's, that's apparently considered ancient for this business.

I've written large scale crawling/scraping systems twice, once for a large team at a media company (in Perl) and recently for a small team as the CTO of a search engine startup (in Python/Javascript). I currently work as a consultant, mostly coding in Clojure/Clojurescript (a wonderful expert language in general and has libraries that make crawler/scraper problems a delight)

I've written successful anti-crawling software systems as well. It's remarkably easy to write nigh-unscrapable sites if you want to or to identify and sabotage bots you don't like.

I like writing crawlers, scrapers and parsers more than any other type of software. It's challenging, fun and can be used to create amazing things.

这篇关于从网站上抓取数据的最佳方式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆