数据搜寻器或其他 [英] Data crawler or something else

查看:90
本文介绍了数据搜寻器或其他的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找我不知道该怎么做的东西.我对爬网,抓取等没有很深的了解,但是我相信我正在寻找的技术就是这些.

I'm looking for something that I don't know exactly how it can be done. I don't have deep knowledge into crawling, scrapping and etc, but I believe the kind of technology I'm looking for are these.

  1. 我有一个要持续监控的大约100个网站的列表.每3或4天至少一次.在这些网站中,我会寻找一些符合逻辑的匹配项,例如:

文本包含"ABC"且不包含"BCZ" 或文本包含"XYZ"且不包含"ATM" 依此类推

Text contains 'ABC' AND doesn't contain 'BCZ" OR text contains 'XYZ' AND doesn't contain 'ATM' and so on so forth

  1. 该工具必须在以下位置查看这些网站:

  1. The tool would have to look into these websites in:

  • 网页
  • DOC文件
  • DOCX文件
  • XLS文件
  • XLSX文件
  • TXT文件
  • RTF文件
  • PDF文件
  • RAR和ZIP文件

比赛必须是增量比赛(我只想要最近X天以来的最新比赛)

The matches would have to be incremental (I just want the most recent ones, from the previous X days)

最重要的是,在这100个网站中,大约40个需要用户身份验证(我已经拥有).

Most importantly, out of these 100 websites, around 40 require user authentication (which I have already).

只要有比赛,我都想下载:

Whenever there's a match, I'd like to download:

  • 文件
  • 链接
  • 日期/时间
  • 比赛报告

我一直在使用import.io之类的工具,但是我还没有弄清楚如何正确地做到这一点!

I've been playing around with tools like import.io, but I haven't figured out how to do it properly!

有人知道我在寻找哪种技术吗?谁(什么样的专家,程序员)可以为我构建这个?对于了解数据爬网的程序员来说,构建它太难了吗?

Does anyone know exactly which kind of technology am I looking for? Who (what kind of specialist, programmer) could build this for me? Is it too hard for a programmer who understand about data crawling to build it?

对不起,很长的帖子

推荐答案

对于60个不需要身份验证的网站:

For the 60 websites that don't require authentication:

您可以使用 backstitch 之类的工具来标记您要监视的网站,并获得交互式页面缩略图供稿包含具有所需关键字的内容. Backstitch支持使用布尔运算符(您描述的AND/OR功能),并具有一个API,可以让您以所需的格式导出结果.

You can use a tool like backstitch to mark websites you want to monitor, and get an interactive thumbnail feed of pages with content that have the keywords you want. Backstitch supports using boolean operators (the AND / OR functionality you described), and has an API that may allow you to export the results in a format that you need.

过去,他们的支持团队(和首席执行官)在描述如何将其API用于自定义搜索案例方面一直非常有帮助.祝你好运!

Their support team (and CEO) have been very helpful in the past with describing how their API can be used for custom search cases. Good luck!

这篇关于数据搜寻器或其他的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆