如何在数据库中添加抓取的网站数据? [英] How to add scraped website data in database?

查看:110
本文介绍了如何在数据库中添加抓取的网站数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想储存:


  1. 产品名称

  2. Categoty

  3. 子类别

  4. 价格

  5. 产品公司。

在我的表中,名为products_data的filds名称为PID,product_name,category,subcategory,product_price和product_company。



我使用 curl_init()函数在php到第一个废料网站的URL,接下来我想存储产品数据在我的数据库表。这是我到目前为止所做的:

  $ sites [0] ='http://www.babyoye。 com /'; 

foreach($ sites为$ site)
{
$ ch = curl_init($ site);
curl_setopt($ ch,CURLOPT_RETURNTRANSFER,1);
$ html = curl_exec($ ch);

$ title_start ='< div class =info>';

$ parts = explode($ title_start,$ html);
foreach($ parts as $ part){
$ link = explode('< a href =/ d /',$ part);

$ link = explode ('>',$ link [1]);
$ url ='http://www.babyoye.com/d/'.$link[0];

//现在为标题我们需要遵循一个类似的过程:

$ title = explode('< h2>',$ part);

$ title = explode('< / h2>',$ title [1]);

$ title = strip_tags($ title [0]);

// INSERT DB CODE HERE例如

$ db_conn = mysql_connect('localhost','root','')或die('error');
mysql_select_db('babyoye',$ db_conn)或die(mysql_error());

$ sql =INSERT INTO products_data(PID,product_name)VALUES('。$ url。','。$ title。')

mysql_query ($ sql)或die(mysql_error());

}
}

如何在表中插入数据。任何帮助?

解决方案

在编写代码之前,您可能希望在设计阶段考虑一些事情: p>


  • 尽可能地广泛使用您的解决方案。如果你必须为每一个新的刮写写PHP代码,如果目标站点更改其布局可能太慢,并且可能会中断正在构建的企业,则需要进行开发更改。这是非常重要的,如果你打算刮掉大量的网站,因为网站重组的几率统计上更大。

  • 实现这种概括的一种方法是使用off-the -shelf库已经很好。所以,而不是使用cURL,使用Goutte或一些其他程序化浏览器系统。这将给你的免费会话,在一些网站是必要的从一个页面到另一个点击。您还将获得CSS选择器来指定您感兴趣的内容项目。

  • 对于表格内容,在您的本地网站上存储一个查找数据库表,转换标题标题到数据库列名。对于产品网格,您可以使用表格将CSS选择器(相对于每个网格单元格)转换为列。

  • 如果您要从网站中提取文字,至少需要运行它通过适当的逃生系统,否则目标站点可以在理论上在他们的站点上添加内容以将他们选择的SQL注入到数据库中。在任何情况下,他们一边的撇号肯定会导致你的调用失败,所以你应该使用 mysql_real_escape_string

  • 从网站提取HTML以便重新显示它,始终记住首先正确清理它。这意味着剥离您不需要的标记,删除可能不受欢迎的属性,并确保结构是嵌套的。



抓取时记住:




  • 是一个好的机器人,为自己定义一个唯一的USER_AGENT,所以网站操作员很容易阻止你,如果他们愿意。作为一个人类使用,例如,Internet Explorer,伪装的礼仪是很差的礼节。

  • 不要抓取代理或其他系统来隐藏您的身份 - 在打开状态下抓取。
  • / li>
  • 尊重robots.txt;如果网站希望阻止刮板,则应允许使用受尊重的约定。如果您的行为像搜索引擎一样,希望阻止您的运营商的几率非常低(大多数人都不希望被搜索引擎抓取)

  • 总是做一些速率限制,否则会发生这种情况。在我的开发笔记本电脑通过缓慢的连接,我可以以每秒两页的速度刮刮一个网站,即使没有使用multi_curl。在真实的服务器上,这可能会更快 - 也许20?无论哪种方式,使一个目标IP /域的请求的数量是一个伟大的方式来找到自己在某人的阻止列表。

  • 我维护一个HTTP访问表,并且有一个规则,如果我在一个请求的最后5秒,我暂停这个刮,并刮掉别的东西。一旦足够的时间过去,我回到暂停的刮擦。我可能倾向于增加此值,并保持大量暂停操作在内存中的并发状态。

  • 如果您正在刮取多个网站, sleep 过度地是在循环的基础上插入你想要做的请求。因此,在50个网站上执行一个HTTP操作,保留每个抓取的状态,然后返回第一个。

  • 如果实施许多网站的交叉,您可以使用 multi_curl 以并行化您的HTTP请求。我不建议在单个网站上使用这个,因为已经说明的原因(远程服务器可能会限制连接的数量,你可以单独打开他们)。

  • 你的整个企业刮刮单个网站。如果他们阻止你,你就被卡住了。如果您的商业模式可以依赖于许多网站的抓取,那么被一个网站阻止会降低风险。



可能具有成本效益,安装第三方抓取软件,或获得第三方服务为您进行抓取。我自己在这方面的研究发现,很少有组织看起来有能力(并记住,在写作的时候,我没有尝试过任何一个)。所以,您不妨看看这些:




I want to store:

  1. Product Name
  2. Categoty
  3. Subcategory
  4. Price
  5. Product Company.

In my table named products_data with filds name as PID, product_name, category, subcategory, product_price and product_company.

I am using curl_init() function in php to first scrap website URL, next I want to store products data in my database table. Here is what I have done so far for this:

$sites[0] = 'http://www.babyoye.com/';

foreach ($sites as $site)
{
    $ch = curl_init($site);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $html = curl_exec($ch);

    $title_start = '<div class="info">';

    $parts = explode($title_start,$html);
    foreach($parts as $part){
        $link = explode('<a href="/d/', $part);

        $link = explode('">', $link[1]);
        $url = 'http://www.babyoye.com/d/'.$link[0];

        // now for the title we need to follow a similar process:

        $title = explode('<h2>', $part);

        $title = explode('</h2>', $title[1]);

        $title = strip_tags($title[0]);

        // INSERT DB CODE HERE e.g.

        $db_conn = mysql_connect('localhost', 'root', '') or die('error');
        mysql_select_db('babyoye', $db_conn) or die(mysql_error());

        $sql = "INSERT INTO products_data(PID, product_name) VALUES ('".$url."', '".$title."')"

        mysql_query($sql) or die(mysql_error()); 

    }
}

I am little confused with database part that how to insert data in table. Any help?

解决方案

There's a number of things you may wish to consider in your design phase prior to writing some code:

  • Generalise your solutions as much as you can. If you have to write PHP code for every new scrape, your development changes required if a target site changes their layout may be too slow, and may disrupt the enterprise you are building. This is extra-important if you intend to scrape a large number of sites, since the odds of a site restructuring are statistically greater.
  • One way to achieve this generalisation is to use off-the-shelf libraries that are good at this already. So, rather than using cURL, use Goutte or some other programmatic browser system. This will give you sessions for free, which in some sites is necessary to click from one page to another. You'll also get CSS selectors to specify what items of content you are interested in.
  • For tabular content, store a look-up database table on your local site, that converts a heading title to a database column name. For product grids, you could use a table to convert a CSS selector (relative to each grid cell, say) to a column. Either of these will make it easier to respond to changes in the format of your target site(s).
  • If you are extracting text from a site, at a minimum you need to run it through a proper escape system, otherwise a target site could in theory add content on their site to inject SQL of their choosing into your database. In any case, an apostrophe on their side would certainly cause your call to fail, so you should use mysql_real_escape_string.
  • If you are extracting HTML from a site with view to re-displaying it, always remember to clean it properly first. This means stripping tags that you don't want, removing attributes that may be unwelcome, and ensuring the structure is well-nested. HTMLPurifier is good for this, I've found.

When crawling, remember:

  • Be a good robot and define a unique USER_AGENT for yourself, so site operators are easily block you if they wish. It is poor etiquette to masquerade as a human using, say, Internet Explorer. Include a URL to a friendly help page in your user agent, like the GoogleBot does.
  • Don't crawl through proxies or other systems intended to hide your identity - crawl in the open.
  • Respect robots.txt; if a site wishes to block scrapers, they should be allowed to do so using respected conventions. If you are acting like a search engine, the odds of an operator wishing to block you are very low (don't most people want to be scraped by search engines?)
  • Always do some rate limiting, otherwise this happens. On my development laptop through a slow connection, I can scrape a site at a rate of two pages a second, even without using multi_curl. On a real server, that's likely to be much faster - maybe 20? Either way, making that number of requests of one target IP/domain is a great way to find yourself in someone's blocklist. Thus, if you scrape, do it slowly.
  • I maintain a table of HTTP accesses, and have a rule that if I've made a request in the last 5 seconds, I "pause" this scrape, and scrape something else instead. I come back to paused scrapes once sufficient time has passed. I may be inclined to increase this value, and hold the concurrent state of a larger number of paused operations in memory.
  • If you are scraping a number of sites, one way to maintain performance without sleeping excessively is to interleave the requests you wish to make on a round-robin basis. So, do one HTTP operation each on 50 sites, retain the state of each scrape, and then go back to the first one.
  • If you implement the interleaving of many sites, you can use multi_curl to parallelise your HTTP requests. I wouldn't recommend using this on a single site for reasons already stated (the remote server may well limit the number of connections you can separately open to them anyway).
  • Be careful about basing your entire enterprise on the scraping of a single site. If they block you, you're fairly stuck. If your business model can rely on the scraping of many sites, then being blocked by one becomes less of a risk.

Also, it may be cost-effect to install third party scraping software, or get a third-party service to do the scraping for you. My own research in this area has turned up very few organisations that appear to be capable (and bear in mind that, at the time of writing, I've not tried any of them). So, you may wish to look at these:

这篇关于如何在数据库中添加抓取的网站数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆