通过安全登录进行PHP网站抓取 [英] PHP Site Scraping With a Secure Login

查看:76
本文介绍了通过安全登录进行PHP网站抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试刮除我的一位分销商每种产品库存的物品数量.他们不知道如何导出此数据.因此,我想知道是否有人可以帮助我向正确的方向介绍如何使用PHP刮取您必须登录才能获取数据的网站?这不是使用SSL的安全站点.

I am trying to scrape the quantity of items one of my distributors has in stock per product. They do not know how to export this data. So I am wondering if someone could help point me in the proper direction on how to scrape a site with PHP that you have to log into to get to the data? It's not a secure site with SSL.

感谢任何提示,

克里斯·爱德华兹

推荐答案

最简单的方法是使用

The easiest way to get where you want is by utilizing cURL. cURL's base feature is that it allows you to make an HTTP request configured how you need it and receive the response. This can be done in various degrees of detail, depending on your needs.

您要做的基本上是发出一个HTTP请求,以获取所需的页面并将数据从响应的HTML中抓取.这可能很容易做到,但是在您的情况下,您将需要克服一些障碍.

What you want to do is basically make a HTTP request to get the page you want and scrape the data out of the response's HTML. This can be very easy to do, but in your case you will need to overcome some obstacles.

我假设说必须登录"是指必须要有一种登录表格才能刮擦任何东西. cURL可以在您的一点帮助下伪造登录名.

I 'm assuming that by saying "have to log in" you mean there's a login form you have to get past before being able to scrape anything. cURL can fake a login with a little help on your part.

首先,您将需要像手动操作一样使用cURL提交"登录表单.为确保正确,您需要在手动提交表单时使用浏览器查看浏览器发出的HTTP请求,并使用cURL构造相同的请求.要详细查看HTTP请求,可以使用Firebug,Chrome的开发人员工具或绝对很棒的 Fiddler调试代理.

First of all, you will need to "submit" the login form with cURL just as you would do by hand. To make sure you got it right, you will need to see the HTTP requests your browser makes when submitting the form by hand and construct identical requests with cURL. To see the HTTP requests in detail you can use Firebug, Chrome's Developer Tools or the absolutely fantastic Fiddler debugging proxy.

在提交有效的登录表单后,服务器很可能会向您发送一个cookie,以便在以后的请求中对您进行身份验证.此cookie将成为服务器HTTP响应的标头(Set-Cookie标头)的一部分.您将需要记住该cookie的值,并在后续对服务器的抓取操作中包含一个Cookie标头-本质上,您所做的就是您登录到浏览器时的行为******.

Most probably after submitting a valid login form the server will send you a cookie to be used in authenticating you on subsequent requests. This cookie will be part of the headers of the server's HTTP response (Set-Cookie header). You will need to remember the value of that cookie, and include a Cookie header on subsequent scrapes to the server -- in essence you are doing exactly what your browser would if you were logged in**¹**.

最后,您可能需要进行多次往返才能找到目标.也许您之前需要抓取的URL是未知的,并且您需要抓取列表"页面以查找要抓取的URL的可变部分.这可以通过简单地逐步解决问题来解决:首先抓取列表"页面,找出所需内容,然后抓取您真正想要的详细信息"页面.

And finally, you may need to make more than one round-trip to find your target. Maybe the URL you need to scrape isn't known beforehand, and you need to scrape a "list" page to find out some variable part of the URL you want to scrape. This can be solved by simply tackling the problem in steps: first scrape the "list" page, find out what you need, then scrape the "details" page you really want.

我没有提供任何代码,因为网络上有大量的cURL教程,但是我相信知道该计划的内容将使您的工作容易得多.

I 'm not providing any code, as there are tons of cURL tutorials on the web, but I believe that knowing what the plan is will make your work much much easier.

¹进行此操作的另一种(更快但很简单)的方法是简单地登录自己,查看获得的cookie的值,然后将其粘贴到抓取的请求中.好处是您不再需要使用cURL伪造登录名;缺点是,在每次使用您的工具之前,都必须有人手动登录并为您的工具提供凭据.

¹ Another (faster, but crude) way to go around doing this is by simply logging in yourself, seeing the value of the cookie you got, and just pasting that into your scrape's request. The upside is that you no longer need to fake a login with cURL; the downside is that before each time your tool is to be used, someone has to login manually and provide your tool with the credentials.

这篇关于通过安全登录进行PHP网站抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆