什么是从一个网站刮数据的最佳方式? [英] What's the best way of scraping data from a website?

查看:230
本文介绍了什么是从一个网站刮数据的最佳方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要提取从一个网站的内容,但应用程序不提供任何API或机制以编程方式访问这些数据。

I need to extract content from a website, but the application doesn’t provide any API or mechanism to access that data programmatically.

我已经发现了一些有用的第三方工具(和服实验室&放大器;的 Import.io ),提供点击和网页拼抢和建筑的API去的功能,但我想保持我的本地数据,我不想向任何订阅计划。

I have found some useful third-party tools(Kimono Labs & Import.io) that provide click and go functionality for scraping webpages and building API's but I want to keep my data locally and I don't want to subscribe to any subscription plans.

什么样的​​技术做这些创业公司使用刮网页和建设自己的API?

What kind of techniques do these startups use for scraping the web pages and building their API's?

我发现了一些网页刮框架( pjscrape &放大器;的 Scrapy )做他们的行为相同的技术上面问?

I found some web-scraping frameworks(pjscrape & Scrapy) do they act the same as the technique asked above?

推荐答案

您一定要开始一个良好的网络刮的框架。后来你可能认为他们太限制了,你可以整理了一下自己的库堆栈,而且没有很多的经验,刮你的设计会比pjscrape或scrapy差很多。

You will definitely want to start with a good web scraping framework. Later on you may decide that they are too limiting and you can put together your own stack of libraries but without a lot of scraping experience your design will be much worse than pjscrape or scrapy.

请注意:我使用的抓取和刮基本上可以互换这里的条款。这是我回答你的问题的Quora的副本,它是pretty长。

Note: I use the terms crawling and scraping basically interchangeable here. This is a copy of my answer to your Quora question, it's pretty long.

工具

获取非常熟悉,无论是萤火虫或根据您的preferred浏览器Chrome的开发工具。这是绝对必要的,因为您浏览您从提取数据的网站,并绘制出哪些URL包含你正在寻找的数据,什么数据格式组成的响应。

Get very familiar with either Firebug or Chrome dev tools depending on your preferred browser. This will be absolutely necessary as you browse the site you are pulling data from and map out which urls contain the data you are looking for and what data formats make up the responses.

您将需要HTTP的一个良好的工作知识以及HTML和可能会希望找到中间的代理软件一个体面的一块人。您将需要能够检查HTTP请求和响应并了解cookie和会话信息和查询参数正在传递。菲德勒( http://www.telerik.com/fiddler )和查尔斯代理(的 http://www.charlesproxy.com/ )是流行的工具。我用mitmproxy( http://mitmproxy.org/ )很多因为我更多的是键盘的家伙比鼠标的家伙。

You will need a good working knowledge of HTTP as well as HTML and will probably want to find a decent piece of man in the middle proxy software. You will need to be able to inspect HTTP requests and responses and understand how the cookies and session information and query parameters are being passed around. Fiddler (http://www.telerik.com/fiddler) and Charles Proxy (http://www.charlesproxy.com/) are popular tools. I use mitmproxy (http://mitmproxy.org/) a lot as I'm more of a keyboard guy than a mouse guy.

某种控制台/壳/ REPL类型的环​​境,在那里你可以尝试不同的片code即时反馈将是非常宝贵的。这样的逆向工程任务是大量的试验和错误的,所以你会希望有一个工作流程,使这个容易。

Some kind of console/shell/REPL type environment where you can try out various pieces of code with instant feedback will be invaluable. Reverse engineering tasks like this are a lot of trial and error so you will want a workflow that makes this easy.

语言

PHP是基本淘汰,这不是很适合这项工作,图书馆/框架的支持较差在这个领域。蟒蛇(Scrapy是一个很好的起点)和Clojure的/ Clojurescript(令人难以置信的强大和高效,但一个大的学习曲线)是此问题的伟大的语言。既然你宁愿不学一门新的语言,你已经了解Javascript我肯定会建议用JS坚持。我没有用pjscrape但它看起来相当不错从他们的文档的快速阅读。这是非常适合,并实现一个很好的解决方案,我在下面说明问题。

PHP is basically out, it's not well suited for this task and the library/framework support is poor in this area. Python (Scrapy is a great starting point) and Clojure/Clojurescript (incredibly powerful and productive but a big learning curve) are great languages for this problem. Since you would rather not learn a new language and you already know Javascript I would definitely suggest sticking with JS. I have not used pjscrape but it looks quite good from a quick read of their docs. It's well suited and implements an excellent solution to the problem I describe below.

在例行前pressions的注意事项:
不要用正则EX preSSIONS解析HTML。
很多初学者这样做,因为他们已经熟悉的正则表达式。这是一个巨大的错误,使用XPath或CSS选择器浏览html和只使用常规的前pressions提取一个HTML节点内实际文本数据。这可能已经是显而易见的你,它,如果你尝试,但很多人浪费了大量的时间去沿着这条道路因为某些原因变得很明显快。不要害怕XPath或CSS选择器,它们都是比较容易的方式来学习,而不是正则表达式,它们被设计来解决这个确切的问题。

A note on Regular expressions: DO NOT USE REGULAR EXPRESSIONS TO PARSE HTML. A lot of beginners do this because they are already familiar with regexes. It's a huge mistake, use xpath or css selectors to navigate html and only use regular expressions to extract data from actual text inside an html node. This might already be obvious to you, it becomes obvious quickly if you try it but a lot of people waste a lot of time going down this road for some reason. Don't be scared of xpath or css selectors, they are WAY easier to learn than regexes and they were designed to solve this exact problem.

JavaScript的沉重网站

在过去,你只是不得不做一个HTTP请求,并解析HTML效应初探。现在,你将几乎肯定要处理的是标准的HTML HTTP请求/响应和由目标网站的JavaScript部分做异步HTTP调用组合网站。这是你的代理软件和萤火虫/ devtools的网络标签是非常方便的。这些的反应可能是HTML或他们可能是JSON,在极少数情况下,他们将XML或别的东西。

In the old days you just had to make an http request and parse the HTML reponse. Now you will almost certainly have to deal with sites that are a mix of standard HTML HTTP request/responses and asynchronous HTTP calls made by the javascript portion of the target site. This is where your proxy software and the network tab of firebug/devtools comes in very handy. The responses to these might be html or they might be json, in rare cases they will be xml or something else.

有两种方法解决这个问题:

There are two approaches to this problem:

低层次的方法:

您可以找出AJAX网址的JavaScript调用该网站以及这些答复的样子和自己做出那些相同的请求。所以,你可能会从 http://example.com/foobar 拉html和提取一块数据,然后要拉JSON响应从 http://example.com/api/baz?foo=b ......拿到另一片数据的。你需要知道通过正确的cookie或会话参数。这是非常罕见的,但偶尔一个Ajax调用一些必要的参数将在该网站的JavaScript做了一些疯狂的计算的结果,逆向工程这可能是恼人。

You can figure out what ajax urls the site javascript is calling and what those responses look like and make those same requests yourself. So you might pull the html from http://example.com/foobar and extract one piece of data and then have to pull the json response from http://example.com/api/baz?foo=b... to get the other piece of data. You'll need to be aware of passing the correct cookies or session parameters. It's very rare, but occasionally some required parameters for an ajax call will be the result of some crazy calculation done in the site's javascript, reverse engineering this can be annoying.

嵌入式浏览器的方式:

为什么需要制定出哪些数据是HTML和哪些数据从一个Ajax调用进来?管理所有的会话和cookie数据?您不必当您浏览网站时,浏览器和JavaScript做的网站。这是整点。

Why do you need to work out what data is in html and what data comes in from an ajax call? Managing all that session and cookie data? You don't have to when you browse a site, the browser and the site javascript do that. That's the whole point.

如果你只是在页面加载到像phantomjs它会加载页面,运行JavaScript和告诉你,当所有的Ajax调用已经完成了一具无头的浏览器引擎。你可以注入自己的JavaScript如有必要,触发相应的点击或一切必要措施来触发该网站​​的JavaScript加载适当的数据。

If you just load the page into a headless browser engine like phantomjs it will load the page, run the javascript and tell you when all the ajax calls have completed. You can inject your own javascript if necessary to trigger the appropriate clicks or whatever is necessary to trigger the site javascript to load the appropriate data.

您现在有两个选择,把它吐出来完成的HTML和解析它或注入一些JavaScript成您做解析和数据格式和吐数据出来(可能是JSON格式)的页面。您可以自由组合这两个选项为好。

You now have two options, get it to spit out the finished html and parse it or inject some javascript into the page that does your parsing and data formatting and spits the data out (probably in json format). You can freely mix these two options as well.

哪种方法最好?

这取决于,你将需要熟悉和适应一定水平低的方法。嵌入式浏览器方法适用于任何事情,这将是更容易实现,并会做出一些最棘手的问题,在拼抢中消失。这也是一个相当复杂的机械部件,你需要了解。这不只是HTTP请求和响应,它的要求,嵌入式浏览器的渲染,网站的JavaScript,JavaScript的注入,你自己的code,与嵌入式浏览器进程2双向互动。

That depends, you will need to be familiar and comfortable with the low level approach for sure. The embedded browser approach works for anything, it will be much easier to implement and will make some of the trickiest problems in scraping disappear. It's also quite a complex piece of machinery that you will need to understand. It's not just HTTP requests and responses, it's requests, embedded browser rendering, site javascript, injected javascript, your own code and 2-way interaction with the embedded browser process.

嵌入式浏览器也因为在头顶的渲染,但将几乎肯定没有关系,除非你刮了很多不同领域的规模要慢得多。你需要速率限制您的要求将使得渲染时间在单个域的情况下,完全可以忽略不计。

The embedded browser is also much slower at scale because of the rendering overhead but that will almost certainly not matter unless you are scraping a lot of different domains. Your need to rate limit your requests will make the rendering time completely negligible in the case of a single domain.

限速/博特行为

您必须非常清楚这一点。你需要在一个合理的速度,使请求到目标域。你需要抓取网站时,写一个很乖的机器人,这意味着尊重robots.txt的,而不是用锤击请求的服务器。错误或疏忽这里是非常不道德的,因为这可以被认为是服务攻击的否认。可接受的速度取决于谁你问,1req / s是,谷歌爬虫在运行可变化最大,但你是不是谷歌,你可能并不像谷歌一样受欢迎。保持合理的速度慢。我建议每个页面请求之间2-5秒。

You need to be very aware of this. You need to make requests to your target domains at a reasonable rate. You need to write a well behaved bot when crawling websites, and that means respecting robots.txt and not hammering the server with requests. Mistakes or negligence here is very unethical since this can be considered a denial of service attack. The acceptable rate varies depending on who you ask, 1req/s is the max that the Google crawler runs at but you are not Google and you probably aren't as welcome as Google. Keep it as slow as reasonable. I would suggest 2-5 seconds between each page request.

与标识您的机器人用户代理字符串识别您的请求,并为您的机器人解释它的目的的网页。这个网址云在代理字符串。

Identify your requests with a user agent string that identifies your bot and have a webpage for your bot explaining it's purpose. This url goes in the agent string.

您会很容易,如果网站想阻止你阻止。在他们结束一个聪明的工程师可以轻松地识别机器人和它们的最终工作几分钟可导致工作周改变你的刮code对您的最终或只是使它不可能。如果关系是对立的,然后在目标站点一个聪明的工程师完全可以妨碍一个天才工程师编写履带。刮code是天生脆弱,这是很容易利用。东西会引发这种反应几乎可以肯定是不道德的,无论如何,所以写一个很乖的机器人,并且不担心这一点。

You will be easy to block if the site wants to block you. A smart engineer on their end can easily identify bots and a few minutes of work on their end can cause weeks of work changing your scraping code on your end or just make it impossible. If the relationship is antagonistic then a smart engineer at the target site can completely stymie a genius engineer writing a crawler. Scraping code is inherently fragile and this is easily exploited. Something that would provoke this response is almost certainly unethical anyway, so write a well behaved bot and don't worry about this.

测试

不是单位/集成测试的人吗?太糟糕了。现在,您将不得不成为一体。网站经常改变,你会经常改变你的code。这是在挑战的很大一部分。

Not a unit/integration test person? Too bad. You will now have to become one. Sites change frequently and you will be changing your code frequently. This is a large part of the challenge.

有很多移动参与拼抢一个现代化的网站部分,良好的测试实践将有很大的帮助。许多在写这种类型的code的将是刚刚返回损坏的数据默默的类型,你会遇到的bug。如果没有良好的测试,以检查是否有回归,你会发现,你一直留着无用损坏的数据到数据库一段时间没有注意到。这个项目会让你很熟悉数据的验证(找到一些好的库使用)和测试。有迹象表明,需要结合COM prehensive测试,是非常困难的考验不是很多其他的问题。

There are a lot of moving parts involved in scraping a modern website, good test practices will help a lot. Many of the bugs you will encounter while writing this type of code will be the type that just return corrupted data silently. Without good tests to check for regressions you will find out that you've been saving useless corrupted data to your database for a while without noticing. This project will make you very familiar with data validation (find some good libraries to use) and testing. There are not many other problems that combine requiring comprehensive tests and being very difficult to test.

你的测试的第二部分涉及的缓存和变化检测。虽然写你的code你不想被一遍又一遍地锤击服务器在同一页没有理由。在运行单元测试,你要知道,如果你的测试失败了,因为你打破了你的code或因为该网站进行了重新设计。对有关的URL的缓存副本运行单元测试。缓存代理是在这里非常有用的,但棘手的配置和使用正常。

The second part of your tests involve caching and change detection. While writing your code you don't want to be hammering the server for the same page over and over again for no reason. While running your unit tests you want to know if your tests are failing because you broke your code or because the website has been redesigned. Run your unit tests against a cached copy of the urls involved. A caching proxy is very useful here but tricky to configure and use properly.

您也确实想知道,如果该网站已经改变。如果他们重新设计了网站和履带坏了你的单元测试仍然会通过,因为他们对缓存副本运行!您需要使用其它更小的一套在你的爬行code,它记录的确切问题对直播网站或好的日志和错误检测很少运行集成测试,提醒您的问题并停止爬行。现在,您可以更新缓存,运行单元测试,看看你需要改变什么。

You also do want to know if the site has changed. If they redesigned the site and your crawler is broken your unit tests will still pass because they are running against a cached copy! You will need either another, smaller set of integration tests that are run infrequently against the live site or good logging and error detection in your crawling code that logs the exact issues, alerts you to the problem and stops crawling. Now you can update your cache, run your unit tests and see what you need to change.

法律问题

如果你做蠢事这里的法律可以稍微危险。如果法律介入你与人打交道谁经常参考wget和卷曲为黑客工具。你不想要这个。

The law here can be slightly dangerous if you do stupid things. If the law gets involved you are dealing with people who regularly refer to wget and curl as "hacking tools". You don't want this.

形势的道德现实情况是,有使用的浏览器软件,请求一个URL,看看一些数据,并使用自己的软件,请求一个URL,看看一些数据没有差别。谷歌是世界上最大的公司拼抢,他们都爱它。在用户代理识别您的机器人的名字,并为你的网络爬虫在法律了解什么是谷歌将帮助这里的目标和意图打开。如果你正在做什么黑幕,如创建虚假用户帐户或访问,你不应该(通过利用因的某种授权robots.txt或封锁)的网站的区域,然后知道你正在做一些不道德的事情技术的法律的无知会显得格外危险在这里。这是一个荒谬的情况,但它是一个真正的。

The ethical reality of the situation is that there is no difference between using browser software to request a url and look at some data and using your own software to request a url and look at some data. Google is the largest scraping company in the world and they are loved for it. Identifying your bots name in the user agent and being open about the goals and intentions of your web crawler will help here as the law understands what Google is. If you are doing anything shady, like creating fake user accounts or accessing areas of the site that you shouldn't (either "blocked" by robots.txt or because of some kind of authorization exploit) then be aware that you are doing something unethical and the law's ignorance of technology will be extraordinarily dangerous here. It's a ridiculous situation but it's a real one.

这是字面上可能试图就起来了作为一个正直的公民建立一个新的搜索引擎,犯了一个错误或者有错误的软件和被视为黑客。不是你想考虑到目前的政治现实。

It's literally possible to try and build a new search engine on the up and up as an upstanding citizen, make a mistake or have a bug in your software and be seen as a hacker. Not something you want considering the current political reality.

我是谁这么写文字的这个巨大的墙?

我已经写了很多网站在我的生活爬行相关code的。我一直在做网络相关软件的开发时间超过十年的顾问,员工和创业者。建国初期在写pe​​rl的爬虫/刮削器和PHP的网站。当我们被嵌入隐藏的iframe的加载CSV数据到网页做阿贾克斯之前杰西詹姆斯加勒特把它命名为阿贾克斯,前XMLHTT prequest是一个想法。 jQuery的之前,在此之前JSON。我在我的30年代中期的,这显然是认为古这项业务。

I've written a lot of web crawling related code in my life. I've been doing web related software development for more than a decade as a consultant, employee and startup founder. The early days were writing perl crawlers/scrapers and php websites. When we were embedding hidden iframes loading csv data into webpages to do ajax before Jesse James Garrett named it ajax, before XMLHTTPRequest was an idea. Before jQuery, before json. I'm in my mid-30's, that's apparently considered ancient for this business.

我为一个大的团队在媒体公司(用Perl)编写的大型爬行/刮系统两次,一次为最近一个小型团队作为搜索引擎启动的CTO(在Python / JavaScript)的。我目前的工作作为一个顾问,主要是用Clojure / Clojurescript编码(一般美妙的专家语言,有库,使履带式/刮刀问题的喜悦)

I've written large scale crawling/scraping systems twice, once for a large team at a media company (in Perl) and recently for a small team as the CTO of a search engine startup (in Python/Javascript). I currently work as a consultant, mostly coding in Clojure/Clojurescript (a wonderful expert language in general and has libraries that make crawler/scraper problems a delight)

我已经写了成功的反爬行软件系统也是如此。这是非常容易写在附近,unscrapable的网站,如果你想或识别和破坏你不喜欢机器人。

I've written successful anti-crawling software systems as well. It's remarkably easy to write nigh-unscrapable sites if you want to or to identify and sabotage bots you don't like.

我喜欢写爬虫,刮削器和分析器比其他任何类型的软件。这是具有挑战性的,好玩的,可用于创建令人惊叹的事情。

I like writing crawlers, scrapers and parsers more than any other type of software. It's challenging, fun and can be used to create amazing things.

这篇关于什么是从一个网站刮数据的最佳方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆