屏幕抓取页面使用CSS的布局和格式化...如何抓取适用于HTML的CSS? [英] Screen scraping pages that use CSS for layout and formatting...how to scrape the CSS applicable to the html?

查看:102
本文介绍了屏幕抓取页面使用CSS的布局和格式化...如何抓取适用于HTML的CSS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个应用程序,用于对小部分外部网页(而不是整个页面,只是其中的一小部分)进行屏幕抓取。

I am working on an app for doing screen scraping of small portions of external web pages (not an entire page, just a small subset of it).

我的代码工作完美的剪贴html,但我的问题是,我想要刮除不仅是原始html,而且CSS样式,用于格式化我正在提取的页面的部分,所以我可以显示在一个新的

So I have the code working perfectly for scraping the html, but my problem is that I want to scrape not just the raw html, but also the CSS styles used to format the section of the page I am extracting, so I can display on a new page with it's original formatting intact.

如果您熟悉firebug,则可以显示哪些CSS样式适用于您突出显示的页面的特定子集,所以如果我能找出一种方法来做到这一点,那么我可以在显示我的新页面上的内容时使用这些样式。但我不知道如何做到这一点........

If you are familiar with firebug, it is able to display which CSS styles are applicable to the specific subset of the page you have highlighted, so if I could figure out a way to do that, then I could just use those styles when displaying the content on my new page. But I have no idea how to do this........

推荐答案

今天我需要刮掉Facebook共享对话框在我们的应用程序生成器中用于Facebook应用程序的动态预览示例。我采取了Firebug 1.5代码库,并添加了一个新的上下文菜单选项复制HTML与内联样式。我已经从lib.js复制了他们的getElementHTML函数并修改它来执行此操作:

Today I needed to scrape Facebook share dialogs to be used as dynamic preview samples in our app builder for facebook apps. I've taken Firebug 1.5 codebase and added a new context menu option "Copy HTML with inlined styles". I've copied their getElementHTML function from lib.js and modified it to do this:


  • 删除类,id和样式属性

  • 删除onclick和类似的JavaScript处理程序

  • 删除所有数据 -

  • 删除显式href, #

  • 将所有块级元素替换为div,并将inline元素替换为span(以防止继承目标网页上的样式)

  • absolutize相对网址

  • 将所有应用的非默认css属性内联到全新的样式属性

  • 通过考虑通过traversion DOM树向上设置父/子继承来减少内联样式膨胀

  • 缩进输出

  • remove class, id and style attributes
  • remove onclick and similar javascript handlers
  • remove all data-something attributes
  • remove explicit hrefs and replace them with "#"
  • replace all block level elements with div and inline element with span (to prevent inheriting styles on target page)
  • absolutize relative urls
  • inline all applied non-default css atributes into brand new style attribute
  • reduce inline style bloat by considering styling parent/child inheritance by traversion DOM tree up
  • indent output

它适用于更简单的页面,但解决方案不是100%因为在Firebug(或Firefox?)中的错误。

It works well for simpler pages, but the solution is not 100% robust because of bugs in Firebug (or Firefox?). But it is definitely usable when operated by a web developer who can debug and fix all quirks.

到目前为止,我发现的问题:


  • 有时清除css属性不会被释放(会破坏布局)

  • :hover和其他伪类不能以这种方式捕获

  • firefox只保留mozilla特定的css属性/值,因此例如你输入-webkit-border-radius,因为这是被CSS跳过的parser

无论如何,这个解决方案节省了大量的时间。最初我手动选择他们的样式表,并进行手动选择和后处理。这是慢,无聊和污染我们的类命名空间。现在我可以在几分钟而不是几小时内删除Facebook标记,导出的标记不会干扰页面的其余部分。

Anyway, this solution saved lot of my time. Originally I was manually selecting pieces of their stylesheets and doing manual selection and postprocessing. It was slow, boring and polluted our class namespace. Now I'm able to scrap facebook markup in minutes instead of hours and exported markup does not interfere with the rest of the page.

这篇关于屏幕抓取页面使用CSS的布局和格式化...如何抓取适用于HTML的CSS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆