创建一个机器人/爬虫 [英] Creating a bot/crawler
问题描述
我想做一个小型机器人,以便自动并定期在一些合作伙伴网站上冲浪。这将为这里的很多员工节省几个小时。
I would like to make a small bot in order to automatically and periodontally surf on a few partner website. This would save several hours to a lot of employees here.
该机器人必须能够:
- 连接到该网站,其中一些以用户身份登录,访问并解析该网站上的特定信息。
- 该漫游器必须集成到我们的网站中并使用我们网站的数据更改其设置(使用用户…)。最终,它必须汇总分析信息。
- 最好是从客户端而不是在服务器上完成此操作。
我上个月尝试过飞镖并喜欢它……
我想用飞镖来做。
I tried dart last month and loved it… I would like to do it in dart.
但是我是有点丢失:
我可以为要解析的每个网站使用Document类对象吗?
可能没有头,还是应该使用chrome / dartium api来控制web浏览器(我想避免这种情况)?
But I am a bit lost : Can I use a Document class object for each website I want to parse? Could be headless or should I use the chrome/dartium api to controle the webbrowser (i'd like to avoid this) ?
我一直在阅读此线程: https ://groups.google.com/a/dartlang.org/forum/?fromgroups =#!searchin / misc / crawler / misc / TkUYKZXjoEg / Lj5uoH3vPgIJ
是否使用 https://github.com/dart-lang/html5lib 对我来说是个好主意吗?
I've been reading this thread : https://groups.google.com/a/dartlang.org/forum/?fromgroups=#!searchin/misc/crawler/misc/TkUYKZXjoEg/Lj5uoH3vPgIJ Does using https://github.com/dart-lang/html5lib is a good idea for my case?
推荐答案
这有两部分。
- 获取页面
- 将页面读入一个可以解析的类。
对于第一部分,如果您打算运行此客户端,则很可能会遇到跨站点问题,因为从服务器X提供的页面无法从服务器Y请求页面,除非设置正确的标题。
For the first part, if you are planning on running this client-side, you are likely to run into cross-site issues, in that your page, served from server X, cannot request pages from server Y, unless the correct headers are set.
请参阅:使用Dart的CORS,我如何使其工作?
和 Dart应用程序和跨域策略
或有问题的网站需要返回正确的 CORS标头。
假设您实际上可以从远程站点客户端获取页面,则可以使用HttpRequest检索实际内容:
Assuming that you can actually get the pages from the remote site client-side, you can use HttpRequest to retrieve the actual content:
// snippet of code...
new HttpRequest.get("http://www.example.com", (req) {
// process the req.responseText
});
您还可以使用 HttpRequest.getWithCredentials
。如果该站点具有一些自定义登录名,那么您可能会遇到问题(因为您可能不得不将站点中的用户名和密码Http POST到其服务器中-
You can also use HttpRequest.getWithCredentials
. If the site has some custom login, then you will probably problems (as you will likely be having to Http POST the username and password from your site into their server -
是第二部分进入的时间。您可以使用 DocumentFragment.html(...)
构造函数来处理HTML,该构造函数为您提供了一个可以迭代和递归的节点集合。下面的示例针对静态html块显示了此示例,但是您可以使用从上面的 HttpRequest
返回的数据。
This is when the second part comes in. You can process your HTML using the DocumentFragment.html(...)
constructor, which gives you a nodes collection that you can iterate and recurse through. The example below shows this for a static block of html, but you could use the data returned from the HttpRequest
above.
import 'dart:html';
void main() {
var d = new DocumentFragment.html("""
<html>
<head></head>
<body>Foo</body>
</html>
""");
// print the content of the top level nods
d.nodes.forEach((node) => print(node.text)); // prints "Foo"
// real-world - use recursion to go down the hierarchy.
}
我猜(以前没有写过蜘蛛)您想在特定位置/深度提取特定标签以求和,并在< a>中添加网址超链接到机器人将导航到的队列。
I'm guessing (not having written a spider before) that you'd be wanting to pull out specific tags at specific locations / depths to sum as your results, and also add urls in <a> hyperlinks to a queue that your bot will navigate into.
这篇关于创建一个机器人/爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!