Rapidminer可以从URL列表中提取xpath,而不是先保存HTML页面吗? [英] Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

查看:135
本文介绍了Rapidminer可以从URL列表中提取xpath,而不是先保存HTML页面吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现了RapidMiner,对其功能感到非常兴奋。但是,我仍然不确定该程序是否可以帮助我满足我的特定需求。我希望该程序从另一个程序生成的URL列表中抓取xpath匹配项。 (它比RapidMiner中的抓取网络运算符具有更多选项)



我从Neil Mcguigan看过以下教程:,但仅从7:40开始,具有以下区别:




  • 输入 提取信息 子进程,该子进程位于先前创建的 Web处理文档中。



ExampleSet将包含与XPath查询匹配的链接和属性。




I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)

I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?

I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.

I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.

Looking forward to your replies.

解决方案

Web scraping without saving the html pages internally using RapidMiner is a two step process:

Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:

  • instead of Crawl Web operator use the Process Documents from Web operator. There will not be an option to specify the output directory, because the results will be loaded into the ExampleSet.

ExampleSet will contain links matching the crawling rules.

Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:

  • put the Extract Information subprocess inside the Process Documents from Web which has been created previously.

ExampleSet will contain the links and the attributes matching the XPath queries.

这篇关于Rapidminer可以从URL列表中提取xpath,而不是先保存HTML页面吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆