动态抓取和解析 [英] Dynamic scraping and parsing

查看:53
本文介绍了动态抓取和解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

原问题改写:

我知道很多 PHP、Js、CSS、OOP,最近通过使用 vim 编辑器的 netrw 和 elinks 插件来下载一系列已解析并准备好的网页(大约一百万行),从而磨练了我的正则表达式技能用于上传到我的网站.我在一个 linux/ubuntu 系统上工作,一个本地主机设置,这个特定的项目正在实现 Concrete5 cms - 它是用 PHP 编写的.

I know a good amount PHP, Js, CSS, OOP and have recently honed my regex skills by using the vim editor's netrw and elinks plugins to download a series of web pages (about a million lines) that were parsed and made ready for uploading into my website. I work on a linux/ubuntu system, a localhost setup and this particular project is implementing the Concrete5 cms - which is written in PHP.

看到抓取和解析信息的好处,我想让我的网站动态执行这个功能,尽管规模要小得多;例如,使我的新用户能够将他们的个人信息从另一个网站传输到我的网站——这通常是在安全连接(尽管并非总是如此)和密码下.

Seeing the benefits of scraping and parsing information, I would like to have my site dynamically perform this function, though on a much smaller scale; such as, enabling my new user to transfer their personal information from another website into mine - which will typically be under a secure connection (though not always) and password.

问题:为此使用的最佳工具(脚本语言)是什么?我不知道 Perl 或 Ruby,但我相信其中任何一个都是不错的选择.我也听说过 AWK 和 SED.一旦我开始学习这门语言,我相信我可以弄清楚如何去做.我真的很感谢一些有经验的人就哪种语言最适合开始投入时间学习它的意见.

Question: What is the best tool (scripting language) to use for this? I do not know either Perl or Ruby but I believe either one of those would be a good choice. I have also heard AWK and SED. I'm sure I can figure out HOW to do it once I begin studying the language. I would really appreciate some experienced input on which language would be the best to begin investing my time into learning it.

感谢您的帮助.

推荐答案

Perl 有两个非常好的即用型抓取工具,我知道:Web::ScraperScrappy.两者都能够使用 CSS3 和 XPath 选择器来识别元素;Scrappy 建立在 Web::Scraper 之上,并添加了集成的抓取和抓取,并带有一个很好的 URL 匹配系统来选择要关注的链接以收集更多信息,(而 Web::Scraper 可以处理单个文档).它使用完善且强大的 WWW::Mechanize 库在页面之间移动,该库是智能、可靠且了解身份验证和 Cookie.

Perl has two very nice ready-to-use tools for scraping that I know of: Web::Scraper and Scrappy. Both are able to work with CSS3 and XPath selectors for identifying elements; Scrappy builds on Web::Scraper and adds integrated scraping and crawling, with a nice URL-matching system to select the links to follow to gather more information, (while Web::Scraper works with a single document). It moves between pages using the well-established and robust WWW::Mechanize library, which is smart, reliable, and aware of authentication and cookies.

如果你想自己进入较低的层次,有很多很好的工具可以建立,包括前面提到的 WWW::Mechanize,HTML::TreeBuilder, HTML::TreeBuilder::XPathHTML::TableExtractor 等.

If you want to get into the lower level yourself, there are a lot of good tools to build on, including the aforementioned WWW::Mechanize, HTML::TreeBuilder, HTML::TreeBuilder::XPath, HTML::TableExtractor and more.

这篇关于动态抓取和解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆