PHP 快速抓取 [英] PHP Fast scraping

查看:21
本文介绍了PHP 快速抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是从不同的新闻媒体收集标题,然后在我的页面上重复它们.我试过使用简单的 HTML DOM,然后运行 ​​IF 语句来检查关键字.它有效,但速度非常慢!代码将在下面找到.有没有更好的方法来解决这个问题,如果有的话;怎么写?

My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?

提前致谢.

<?php
require 'simple_html_dom.php';

// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';

// Debug
$i = 0;

// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
   if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
      echo $element->href . '<br>';
      $i++;
   }
} 

echo "<h1>$i were found</h1>";
?>

推荐答案

我们说话的速度有多慢?

1-2 秒就可以了.

如果您将其用于网站.

我建议将 crawlingdisplay 分成 2 个单独的脚本,并缓存每次抓取的结果.

I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.

你可以:

  • 有一个 crawl.php 文件,它会定期运行以更新您的链接.
  • 然后有一个 webpage.php 来读取上次抓取的结果并根据您的网站需要显示它.
  • have a crawl.php file that runs periodically to update your links.
  • then have a webpage.php that reads the results of the last crawl and displays it however you need for your website.

这样:

  • 每次刷新网页时,它都不会重新请求新闻网站的信息.
  • 新闻网站需要一点时间来响应并不重要.

你会想要解耦,抓取和显示100%.有一个crawler.php"而不是一次一个地运行所有新闻站点,将原始链接保存到一个文件.这可以每 5-10 分钟运行一次以保持新闻更新,被警告不到 1 分钟,一些新闻网站可能会生气!

You will want to decouple, crawling and display 100%. Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!

crawler.php

<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds

require 'simple_html_dom.php';

$html_output = ""; // use this to build up html output

$sites = array(
    array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
    /* more sites go here, like this */
    // array('URL', 'KEY')
);

// loop over each site
foreach ($sites as $site){
   $url = $site[0];
   $key = $site[1];
   // fetch site
   $syds = file_get_html($url);

   // loop over each link
   foreach($syds->find($key) as $element) {
     // add link to $html_output
     $html_output .= $element->href . '<br>\n';
   }
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>

display.php

/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>

还想要更快吗?

如果您不想更快,我建议您查看 node.js,它在 tcp 连接和 html 解析方面要快得多.

If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.

这篇关于PHP 快速抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆