需要的工具:剥离一些HTML [英] Tool needed: to strip some HTML

查看:53
本文介绍了需要的工具:剥离一些HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

G''day


我有一些由机器人编写的页面,而且大部分代码都没有

关注网站上的可见内容。我想删除所有不影响或影响可见内容的

代码(尽管我会

喜欢保留嵌套表,如果可能的话)。其中一些可以使用搜索/替换来剥离,但其中一些包含的代码

因页面而异。


多少页?大约750,总共80兆字节的数据,我希望在我清理时减少数据。代码。


你知道有什么工具可以做到这一点吗?一个可以设置为

的工具剥离除了HTML 2.0之外的所有代码,例如,也会很有用

除了我会丢失嵌套表(不是a *巨大的*

损失......)。


我尝试将所有内容转换为TXT但大多数HTML2TXT程序

非常好结果不好。我确实找到了一些代码剥离器,

尝试维护表格布局(但更少

首选)。如果这些东西是明文的,那么应该是一个处理嵌套表的智能方式。


任何建议,人们?你能推荐什么工具?最好是W95x

(但只要新手友好,Linux就可以了),

最好免费软件(或共享软件,但我不打算购买) }。

G''day

I have some pages written by a bot and much of the code does not
concern the visible content on the site. I''d like to strip all the
codes that do not affect or influence the visible stuff (although I''d
like to keep the nested tables, if possible). Some of this can be
stripped using Search/Replace, but some of it contains codes which
differ from page to page.

How many pages? About 750, totalling 80 megabytes of data, which I''m
hoping to reduce when I "clean" the code.

Do you know of any tool that can do this? A tool that can be set to
strip all codes except HTML 2.0 would, for example, also be useful
except I''ll lose the nested tables (which is not a *gigantic*
loss...).

I tried converting everything to TXT but most HTML2TXT programs
deliver very poor results. I did find some code strippers that
attempt to maintain the tables layout (but that is even less
preferred). If the stuff is gonna be in plaintext, then there should
be an intelligent way of dealing with nested tables.

Any advice, people? What tool can you recommend? Preferably for W95x
(but Linux would be fine too as long as it is newbie-friendly),
preferably freeware (or shareware, but I don''t intend buying).

推荐答案

2004年4月7日星期三02:34:39 -0700,Voetleuce en f?nsievry写道:
On Wed, 07 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry wrote:
任何建议,人们?你能推荐什么工具?
Any advice, people? What tool can you recommend?




如果你只知道Perl ...


-


..



If you only knew Perl...

--

..


2004年4月7日02:34:39 -0700,Voetleuce en f?nsievry

< ca ****** @ websurfer.co.za>写道:
On 7 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry
<ca******@websurfer.co.za> wrote:
G''day

我有一些由机器人编写的页面,而且大部分代码都不关心可见内容网站。我想删除所有不影响或影响可见内容的代码(尽管如果可能的话我想保留嵌套表)。其中一些可以使用搜索/替换来剥离,但其中一些包含的页面之间的代码不同。

多少页?大约750,总共80兆字节的数据,我希望在我清理时减少这些数据。代码。

你知道任何可以做到这一点的工具吗?例如,一个可以设置为除去HTML 2.0之外的所有代码的工具也很有用
除了我会丢失嵌套表(这不是一个巨大的*
损失...)。
G''day

I have some pages written by a bot and much of the code does not
concern the visible content on the site. I''d like to strip all the
codes that do not affect or influence the visible stuff (although I''d
like to keep the nested tables, if possible). Some of this can be
stripped using Search/Replace, but some of it contains codes which
differ from page to page.

How many pages? About 750, totalling 80 megabytes of data, which I''m
hoping to reduce when I "clean" the code.

Do you know of any tool that can do this? A tool that can be set to
strip all codes except HTML 2.0 would, for example, also be useful
except I''ll lose the nested tables (which is not a *gigantic*
loss...).




HTML Tidy可能会有很大帮助。它可以设置为''清理''页面,它将

然后删除所有表示标记。

http://tidy.sf.net/

-

Rijk van Geijtenbeek


网络是一种拖延装置:

它可以吸收尽可能多的时间来确保你

不会做任何真正的工作。 - J.Nielsen



HTML Tidy might help a lot. It can be set to ''clean'' the pages, it will
then drop all presentational markup.

http://tidy.sf.net/

--
Rijk van Geijtenbeek

The Web is a procrastination apparatus:
It can absorb as much time as is required to ensure that you
won''t get any real work done. - J.Nielsen


" Voetleuce en f?nsievry" < CA ****** @ websurfer.co.za> aécritdansle

message de news:f0 ************************** @ posting.google.c om
"Voetleuce en f?nsievry" <ca******@websurfer.co.za> a écrit dans le
message de news:f0**************************@posting.google.c om
我有一些由机器人编写的页面,而且大部分代码都没有关注网站上的可见内容。我想删除所有不影响或影响可见内容的代码(尽管如果可能的话我想保留嵌套表)。其中一些可以使用搜索/替换进行剥离,但其中一些包含的页面之间的代码不同。
I have some pages written by a bot and much of the code does not
concern the visible content on the site. I''d like to strip all the
codes that do not affect or influence the visible stuff (although I''d
like to keep the nested tables, if possible). Some of this can be
stripped using Search/Replace, but some of it contains codes which
differ from page to page.




你能来吗?给出样本的URL?答案取决于要删除的文本

在HTML页面中的位置......



Can you give an URL for a sample ? The answer will depend on where the text
to delete is located in your HTML pages...


这篇关于需要的工具:剥离一些HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆