如何使用古特 [英] How to use Goutte

查看:89
本文介绍了如何使用古特的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:
无法完全了解Goutte网页抓取工具.

Issue:
Cannot fully understand the Goutte web scraper.

请求:
有人可以帮助我理解或提供代码以帮助我更好地了解如何使用Goutte网络抓取工具吗?我已经阅读了README.md.我正在寻找比提供的信息更多的信息,例如Goutte中可用的选项以及如何编写这些选项,或者当您查看表单时,是否在搜索表单的name =或id =?

Request:
Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information than what that provides such as what options are available in Goutte and how to write those options or when you are looking at forms do you search for the name= or the id= of the form?

试图抓取网页布局:
第1步:
该网页有一个带有单选按钮的表单,用于选择要填写哪种表单(即名称或许可证).它默认为带有名字和姓氏的文本框,以及状态下拉菜单选择列表.如果选择广播",则使用jQuery或JavaScript可以使名字"和姓氏"文本框消失,并显示许可证"文本框.

Webpage Layout attempting to be scraped:
Step 1:
The webpage has a form has a radio button to choose what kind of form to fill out (ie. Name or License). It is defaulted to Name with First and Last Name textboxes along with a State drop down menu select list. If you choose Radio there is jQuery or JavaScript that makes the First and Last Name textboxes go away and a License Textbox appears.

第2步:
成功提交表单后,它将带您进入具有多个链接的页面.我们可以进入其中两个之一来获取我们所需的信息.

Step 2:
Once you have successfully submitted the form then it brings you to a page that has multiple links. We can go in to one of two of them to get our information we need.

第3步:
成功单击链接后,我们希望第三页包含我们要查找的数据,并将其存储到php变量中.

Step 3:
Once we have successfully clicked on the link we want the third page has the data that we are looking for and we want to store that data into a php variable.

提交不正确的信息:
如果提交了错误的信息,那么jQuery/Javascript将返回一条消息 没有找到记录."在与提交内容相同的页面上.

Submitting Incorrect information:
If wrong information is submitted then a jQuery/Javascript returns a message of "No records were found." on the same page as the submission.

注意:
首选方法是选择许可证单选按钮,填写许可证编号,选择状态,然后提交表格.我已经阅读了许多有关Goutte的帖子,博客和其他文章,但我无处找不到Goutte的可用选项,如何查找此信息或如何使用这些信息(如果存在的话).

Note:
The preferred method would be to select the license radio button, fill in the license number, choose the state and then submit the form. I have read tons of posts and blogs and other items about Goutte and nowhere can I find what options are available for Goutte, how you find out this information or how to use this information if it did exist.

推荐答案

您要查看的文档是 Symfony2 DomCrawler .

Goutte是在 Guzzle 之上的客户端构建,每次您请求/提交某些东西时,Gawte都会返回Crawlers:

Goutte is a client build on top of Guzzle that returns Crawlers every time you request/submit something:

use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.symfony-project.org/');

使用此爬虫,您可以执行诸如将所有P标签放入体内的操作:

With this crawler you can do stuff like get all the P tags inside the body:

$nodeValues = $crawler->filter('body > p')->each(function (Crawler $node, $i) {
    return $node->text();
});
print_r($nodeValues);

填写并提交表格:

$form = $crawler->selectButton('sign in')->form(); 
$crawler = $client->submit($form, array(
        'username' => 'username', 
        'password' => 'xxxxxx'
));

Crawler上提供了一个selectButton()方法,该方法返回 另一个与按钮匹配的Crawler(input [type = submit], 输入[type = image]或按钮),并带有给定的文本. [ 1 ]

A selectButton() method is available on the Crawler which returns another Crawler that matches a button (input[type=submit], input[type=image], or a button) with the given text. [1]

您单击链接或设置选项,选择复选框等等,请参见表单和链接支持.

You click on links or set options, select check-boxes and more, see Form and Link support.

要从搜寻器中获取数据,请使用htmltext方法

To get data from the crawler use the html or text methods

echo $crawler->html();
echo $crawler->text();

这篇关于如何使用古特的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆