网络爬虫解析PHP / Javascript链接? [英] Web crawler Parsing PHP/Javascript links?

查看:196
本文介绍了网络爬虫解析PHP / Javascript链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前使用的HTML敏捷性包在C#中的网络爬虫。我已经成功地避免许多问题至今(无效的URI,如/extra/url/to/base.html和#链接),但我还需要处理PHP,JavaScript等像一些网站的链接是在PHP,当我的网络爬虫尝试导航到这些,它失败。一个例子是PHP / JavaScript的手风琴链接页面。我怎么会去浏览/分析这些链接?

I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One example is a PHP/Javascript accordion link page. How would I go about navigating/parsing these links?

推荐答案

让我们看看,如果我理解正确你的问题。我知道,这个答案很可能是不够的,但如果你需要一个更具体的答案我需要更多的细节。

Lets see if I understood your question correctly. I'm aware that this answer is probably inadequate but if you need a more specific answer I'd need more details.

您正在尝试编写一个网络爬虫,但它不能抓取与.PHP结尾的URL?

如果是这样的话,你需要退一步,想想这是为什么。这可能是因为履带选择哪个URL中使用基于一个URI方案正则表达式抓取。

If that's the case you need to take a step back and think about why that is. It could be because the crawler chooses which URLs to crawl using a regex based on an URI scheme.

在大多数情况下,这些URL都只是普通的HTML,但他们也可以是一个生成的图像(如CAPTCHA)或下载链接为700MB的ISO文件 - 而且也没有办法知道是没有检查出从该网址的HTTP响应的报头一定。

In most cases these URLs are just normal HTML but they could also be a generated image (like a captcha) or a download link for a 700mb iso file - and there's no way to know be certain without checking out the header of the HTTP response from that URL.

注意:如果您是从头编写自己的履带,你会需要的 HTTP

Note: If you're writing your own crawler from scratch you're going to need good understanding of HTTP.

你的爬虫会看到时,得到的第一件事情一个URL是头,其中包含一个 MIME内容类型 - 它讲述了一个浏览器/履带如何处理和打开数据(它是HTML,普通文本,.EXE等)。你可能想下载基于MIME类型,而不是一个URL方案的页面。 MIME类型为HTML是的text / html 你应该用你下载的URL的内容的其余部分之前所使用的HTTP库检查这一点。

The first thing your crawler is going to see when gets an URL is the header, which contains a MIME content-type - it tells a browser/crawler how to process and open the data (is it HTML, normal text, .exe, etc). You'll probably want to download pages based on the MIME type instead of an URL scheme. The MIME type for HTML is text/html and you should check for that using the HTTP library you're using before downloading the rest of the content of an URL.

JavaScript的问题

同上面除了在爬行器中运行的JavaScript /分析器是简单的项目非常罕见的,比它解决可能会造成更多的问题。为什么你需要使用Javascript?

Same as above except that running javascript in the crawler/parser is pretty uncommon for simple projects and might create more problems than it solves. Why do you need Javascript?

:一种不同的解决方案结果
。如果你愿意学习的Python (或已经知道了),我建议你看一下的 Scrapy 。这是类似建到 Django的web框架一个网络爬虫框架。这真的很容易使用,很多问题都已经得到解决,因此它可能是一个很好的起点,如果你想了解更多的技术。

A different solution
If you're willing to learn Python (or already know it) I suggest you look at Scrapy. It's a web crawling framework built similarly to the Django web framework. It's really easy to use and a lot of problems have already been solved so it could be a good starting point if you're trying to learn more about the technology.

这篇关于网络爬虫解析PHP / Javascript链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆