从HTML中提取内容? [英] Extract Content from HTML ?

查看:84
本文介绍了从HTML中提取内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,


是否有任何实用程序可帮助我从HTML中提取内容?


我想存储此数据数据库。


HTML包含大约10,000个文件,总大小为

约160 Mb。每个文件都是来自消息论坛的线程。每个

线程都有几个贡献。线程以线性

日期顺序发布,文件名如000125633.html。

HTML标有< table>,etc标记。这个HTML非常难以形成,并且缺少关键标签(例如< TR>,< BODY>,

等)。这与此没有连贯性;没有系统 - 有时标签

缺失,有时它们存在。尽管如此,

线程似乎正确呈现;这是现代浏览器的宽容性。

现金浏览器。

每个帖子的字段通常由属性标签标识。

(通常是< TD>或< SPAN>的属性。

有时我需要实际存储带内容的HTML(对于

实例,当帖子包含一个链接,彩色书写或文字

格式化为< PRE>标签。


我将其存储在数据库中的目的是制作内容

(a)更容易搜索和(b)使用更高效的存储空间

中等。


原始数据库这些网络论坛的帖子是

已经不再可以在网上看到它看起来也不会再次出现。我也不能再联系那个'b $ b'的人。拥有''它。

如果我确实联系过他们,他们就不太可能发布

数据。


尽管如此,这里没有版权问题。每一个

帖子制作到论坛是使用别名而没有论坛

海报想要被识别,任何海报都不希望声称

所有权他们的贡献。

解决方案

mark4写道:

是否有任何实用程序可以帮助我从HTML中提取内容?
我想将这些数据存储在数据库中。


在我看来你必须编写自己的定制程序才能获得
提取数据。


为此,我建议使用Perl。 Perl有一个名为HTML :: Parser

的模块,它显然非常擅长从格式错误的

HTML文件中提取信息。更重要的是,它通常非常擅长文本处理,并且还有很好的数据库模块。

我也不能联系拥有它的人。如果我确实与他们联系,他们将不太可能发布数据。

尽管如此,这里没有版权问题。每个发布到论坛的帖子都是使用别名制作的,没有任何论坛海报可以识别,也没有任何海报希望声称所有权。他们的贡献。




对我来说就像有*重大*版权问题一样!


-

Toby A Inkster BSc(荣誉)ARCS

与我联系〜 http://tobyinkster.co.uk/contact


2005年2月28日星期一07:24:15 +0000,Toby Inkster

< us ********** @ tobyinkster.co.uk>写道:

mark4写道:

是否有任何实用程序可以帮助我从HTML中提取内容?
我想要将这些数据存储在数据库中。
在我看来,你必须编写自己的定制程序来提取数据。




我的预期同样多。

为此,我建议使用Perl。 Perl有一个名为HTML :: Parser
的模块,它显然非常擅长从格式错误的HTML文件中提取信息。更重要的是,它通常非常擅长文本处理,并且具有相当不错的数据库模块。




谢谢。作为一个微型服务器,我通常不用Perl编写代码,但是我可以看一下这个问题。它或者是WSH Javascript和

它是正则表达式。幸运的是,我已经有了一个顶级

级别的设计,它看起来很简单。我可能会研究一下这个

Perl模块,但是可能更容易使用我非常熟悉的microserf

技术。我可能会将它存储在MSSQL中。

我也无法联系拥有它的人。如果我确实与他们联系,他们将不太可能发布数据。

尽管如此,这里没有版权问题。每个发布到论坛的帖子都是使用别名制作的,没有任何论坛海报可以识别,也没有任何海报希望声称所有权。他们的贡献。



听起来像是*主要版权问题!




我可以看不出那些问题。谁拥有这些数据?不是

原始论坛提供商。发布到论坛的数据版权归原作者版权所示 - 无论我在论坛中指定的是什么?所有这些原创作者都有一个别名,而且实际上并不需要确定b $ b。我正在做的事情不是违反版权而是违反报纸剪报的人。


只要我不重新发布它。


mark4写道:

2005年2月28日星期一07:24:15 +0000,Toby Inkster
< us ********** @ tobyinkster.co.uk>写道:

为此,我建议使用Perl。 Perl有一个名为HTML :: Parser
的模块,它显然非常擅长从格式错误的HTML文件中提取信息。更重要的是,它通常非常擅长文本处理,并且具有相当不错的数据库模块。



Mark'是的。我不会做整个语言啦啦队长。事情 - 但对于

这个特殊的问题,Perl是一个理想的选择。

谢谢。作为一个微型服务器,我通常不用Perl编写代码,但我可能会对此进行调查。它或者是WSH Javascript和它的正则表达式。




你知道,有Windows的Perl。它也很好地与WSH整合。


< http://www.activestate.com>


sherm--


-

Perl中的Cocoa编程: http ://camelbones.sourceforge.net

雇用我!我的简历: http://www.dot-app.org


Hello,

Are there any utilities to help me extract Content from HTML ?

I''d like to store this data in a database.

The HTML consists of about 10,000 files with a total size of
about 160 Mb. Each file is a thread from a message forum. Each
thread has several contributions. The threads are in linear
order of date posted with filenames such as 000125633.html. The
HTML is marked up with <table>, etc tags. This HTML is very
badly formed with crucial tags missing (such as <TR>, <BODY>,
etc.). There is no coherence to this; no system - sometimes tags
are missing and sometimes they are present. Despite this, the
threads seem to render correctly; such is the forgiving nature
of modern browsers.

Fields for each post are usually identified by an attribute tag.
(usually an attribute of a <TD> or <SPAN>.

Sometimes I need to actually store HTML with the content (for
instance when a post includes a link, colored writing or text
formatted with <PRE> tags.

My purpose in storing this in a database is to make the content
(a) easier to search and (b) use a more efficient storage
medium.

The original database from which these web-forum posts were
taken is no longer available on the web nor does it look like it
ever will be again. Nor can I contact the person who ''owns'' it.
If I did contact them, they would be unlikely to release the
data.

Despite this, there are no copyright issues here. Every single
post made to the forum was made using an alias and no forum
poster wants to be identified, nor do any posters wish to claim
"ownership" of their contributions.

解决方案

mark4 wrote:

Are there any utilities to help me extract Content from HTML ?
I''d like to store this data in a database.
Looks to me like you''d have to write your own customised program to
extract the data.

To do that, I recommend using Perl. Perl has a module called HTML::Parser
which is apparently pretty good at extracting information from malformed
HTML files. Whatsmore, it is generally very good at text handling and has
decent database modules too.
Nor can I contact the person who ''owns'' it. If I did contact them, they
would be unlikely to release the data.

Despite this, there are no copyright issues here. Every single post made
to the forum was made using an alias and no forum poster wants to be
identified, nor do any posters wish to claim "ownership" of their
contributions.



Sounds to me like there are *major* copyright issues!

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact


On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
<us**********@tobyinkster.co.uk> wrote:

mark4 wrote:

Are there any utilities to help me extract Content from HTML ?
I''d like to store this data in a database.
Looks to me like you''d have to write your own customised program to
extract the data.



I expected as much.
To do that, I recommend using Perl. Perl has a module called HTML::Parser
which is apparently pretty good at extracting information from malformed
HTML files. Whatsmore, it is generally very good at text handling and has
decent database modules too.



Thanks. Being a microserf, I don''t normally code in Perl but I
may look into this. It''s either that or WSH Javascript with
it''s regular expressions. Fortunately I already have a top
level design and it looks pretty simple. I may look into this
Perl module but it will probably be easier to use microserf
technology with which I''m intimate with. I shall probably store
it in MSSQL.

Nor can I contact the person who ''owns'' it. If I did contact them, they
would be unlikely to release the data.

Despite this, there are no copyright issues here. Every single post made
to the forum was made using an alias and no forum poster wants to be
identified, nor do any posters wish to claim "ownership" of their
contributions.



Sounds to me like there are *major* copyright issues!



I can''t see what those issues are. Who owns the data? Not the
original forum provider. The data posted to a forum is copyright
of the original author - no matter what ToS my be specified in
the forum. All those original authors have an alias and don''t
actually want to be identified. What I''m doing is no more a
violation of copyright than someone keeping newspaper clippings.

So long as I don''t republish it.


mark4 wrote:

On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
<us**********@tobyinkster.co.uk> wrote:

To do that, I recommend using Perl. Perl has a module called HTML::Parser
which is apparently pretty good at extracting information from malformed
HTML files. Whatsmore, it is generally very good at text handling and has
decent database modules too.


Mark''s right. I don''t do the whole "language cheerleader" thing - but for
this particular problem, Perl''s an ideal fit.
Thanks. Being a microserf, I don''t normally code in Perl but I
may look into this. It''s either that or WSH Javascript with
it''s regular expressions.



There''s Perl for Windows, you know. It integrates nicely with WSH too.

<http://www.activestate.com>

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org


这篇关于从HTML中提取内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆