ASP问题:解析HTML文件? [英] ASP Question: Parse HTML file?

查看:57
本文介绍了ASP问题:解析HTML文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,


我正在开发一个项目,其中有不到1300个课程文件,这些

是HTML文件 - 我的问题是因为我需要对这些页面的内容做更多的事情 - 而想到编写1300个asp页来处理这个

并不让我兴奋。


HTML页面由培训公司提供。它们似乎是b / b结构化的在某种程度上,但我不确定它是多么容易

解析页面。


通常有以下部分 ;每页:


标题

摘要

主题

技术要求

版权信息

使用条款


我需要获取标题,摘要,主题,技术的内容

要求并失去版权和使用条款......此外,我需要在一个新的部分挤压
,这将显示定价信息和链接

to加入购物车等....


我的计划 (如果你可以这么称呼的话)是有一个asp页面可以解析

基于asp页面传递代码的相应HTML文件

查询字符串 - 代码将匹配HTML页面的文件名(点之前的第一个

部分)。


我当时需要做的是通过HTML的内容....这是

我目前被卡住了....


我已经粘贴了以下其中一个页面的示例 - 如果有人可以向我建议如何实现这一目标我将非常感激 - 此外 - 如果

任何人都可以在那里解释XML Name Space的内容方便

- 我认为这只是一个普通的HTML页面,因为没有声明

或顶部的任何东西?


我们非常感谢任何信息/建议。


提前感谢您的帮助,


问候


Rob

示例文件:


< html>

< head>

< title> Novell 560 CNE系列:文件系统< / title>

< meta name =" Description" content ="">

< link rel =" stylesheet" href =" ../ resource / mlcatstyle.css"

type =" text / css">

< / head>

< body class =" MlCatPage">

< table class =" Header" xmlns:fo =" http://www.w3.org/1999/XSL/Format"

xmlns:fn =" http://www.w3.org/2005/xpath-函数">

< tr>

< td class =" Logo" colspan =" 2">

< img class =" Logo" src =" ../ images / logo.gif">

< / td>

< / tr>

< ; tr>

< td class =" Title">

< div class =" ProductTitle">

< ; span class =" CoCat"> Novell 560 CNE系列:文件系统< / span>

< / div>

< div class =" ProductDetails" ;>

< span class =" SmallText">

< span class =" BoldText">产品代码:< / span>

560c04< span class =" BoldText">时间:< / span>

4.0小时< span class =" BoldText"> CEUs:< / span>

可用< / span>

< / div>

< / td>

< td class ="返回">

< div class =" BackButton">

< a href =" javascript :history.back()">

< img src =" ../ images / back.gif"对齐= QUOT;右" border =" 0">

< / a>

< / div>

< / td>

< / tr>

< / table>

< br xmlns:fo =" http://www.w3.org/ 1999 / XSL /格式

xmlns:fn =" http://www.w3.org/2005/xpath-functions">

< table class = QUOT;高层" xmlns:fo =" http://www.w3.org/1999/XSL/Format"

xmlns:fn =" http://www.w3.org/2005/xpath-函数">

< tr>

< td class =" BlockHeader">

< h3 class =" sectiontext">摘要:< / h3>

< / td>

< / tr>

< tr>

< td class ="概述">

< div class =" ProductSummary">本课程提供介绍

到NetWare 5文件系统的概念和管理程序。< / div>

< br>

< h3 class =" Sectiontext">目标:< / h3>

< div class =" FreeText">完成本课程后,学生将能够:< / div>

< div class =" ObjectiveList">

< ul class =" listing">

< li class =" ObjectiveItem" ;>解释文件

sys的关系tem和登录脚本< / li>

< li class =" ObjectiveItem">创建登录脚本< / li>

< li class =" ObjectiveItem" ;>管理文件系统目录和

文件< / li>

< li class =" ObjectiveItem">映射网络驱动器< / li>

< / ul>

< / div>

< br>< / br>

< h3 class =" Sectiontext"> Topics:< / h3>

< div class =" OutlineList">

< ul class ="列出">

< li class =" OutlineItem">管理文件系统< / li>

< li class =" OutlineItem">卷空间< / li>

< li class =" OutlineItem">检查登录脚本< / li>

< li class =" OutlineItem"> ;创建和执行登录

脚本< / li>

< li class =" OutlineItem"> Drive Mappings< / li&g t;

< li class =" OutlineItem">登录脚本和资源< / li>

< / ul>

< / div>

< / td>

< / tr>

< / table>

< br xmlns:fo =" http://www.w3.org/1999/XSL/Format"

xmlns:fn =" http://www.w3.org / 2005 / xpath-functions">

< table class =" Details" xmlns:fo =" http://www.w3.org/1999/XSL/Format"

xmlns:fn =" http://www.w3.org/2005/xpath-函数">

< tr>

< td class =" BlockHeader">

< h3 class =" Sectiontext">技术要求:< / h3>

< / td>

< / tr>

< tr>

< td class =" Details">

< div class =" ProductRequirements"> 200MHz Pentium with 32MB Ram。 800

x 600最低屏幕分辨率。 Windows 98,NT,2000或XP。最低56K

建议连接速度,宽带(256 kbps或更高)连接。

需要Internet Explorer 5.0或更高版本。 Flash Player 7.0或更高版本
需要
。必须启用JavaScript。 Netscape,Firefox和AOL浏览器不支持
。< / div>

< / td>

< / tr>

< / table>

< br xmlns:fo =" http://www.w3.org/1999/XSL/Format"

xmlns:fn =" http://www.w3.org/2005/xpath-functions">

< table class =" Legal" xmlns:fo =" http://www.w3.org/1999/XSL/Format"

xmlns:fn =" http://www.w3.org/2005/xpath-函数">

< tr>

< td class =" BlockHeader">

< h3 class =" Sectiontext">版权信息:< / h3>

< / td>

< / tr>

< tr>

< td class =" Copyright">

< div class =" ProductRequirements">此处提及的产品名称

目录可能是其各自公司的商标/服务标志或注册商标/服务标志

,特此声明。所有产品

已知为商标或服务商标的名称已经适当资本化。在此目录中使用名称仅用于

识别目的,不应视为影响任何商标或服务标记的有效性,或暗示任何从属关系
MindLeaders.com,Inc。和商标/服务商之间的


所有者。< / div>

< br>

< h3 class =" Sectiontext">使用条款:< / h3>

< div class =" ProductUsenote">< / div>

< / td>

< / tr>

< / table>

< p align =" center">

< span class =" SmallText">版权和复制; 2006年MindLeaders。所有权利

保留。< / span>

< / p>

< / body>

< / html>

Hi all,

I''m working on a project where there are just under 1300 course files, these
are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn''t thrill me.

The HTML pages are provided by a training company. They seem to be
"structured" to some degree, but I''m not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use

I need to get the content for the Title, Summary, Topics, Technical
Requirements and lose the Copyright and Terms of use...in addition I need to
squeeze in a new section which will display pricing information and a link
to "Add to cart" etc....

My "plan" (if you can call it that) was to have 1 asp page which can parse
the appropriate HTML file based on the asp page being passed a code in the
querystring - the code will match the filename of the HTML page (the first
part prior to the dot).

What I then need to do is go through the content of the HTML....this is
where I am currently stuck....

I have pasted an example of one of these pages below - if anyone can suggest
to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?

Any information/suggestions would be most appreciated.

Thanks in advance for your help,

Regards

Rob
Example file:

<html>
<head>
<title>Novell 560 CNE Series: File System</title>
<meta name="Description" content="">
<link rel="stylesheet" href="../resource/mlcatstyle.css"
type="text/css">
</head>
<body class="MlCatPage">
<table class="Header" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="Logo" colspan="2">
<img class="Logo" src="../images/logo.gif">
</td>
</tr>
<tr>
<td class="Title">
<div class="ProductTitle">
<span class="CoCat">Novell 560 CNE Series: File System</span>
</div>
<div class="ProductDetails">
<span class="SmallText">
<span class="BoldText"> Product Code: </span>
560c04<span class="BoldText"> Time: </span>
4.0 hour(s)<span class="BoldText"> CEUs: </span>
Available</span>
</div>
</td>
<td class="Back">
<div class="BackButton">
<a href="javascript:history.back()">
<img src="../images/back.gif" align="right" border="0">
</a>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="HighLevel" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="sectiontext">Summary:</h3>
</td>
</tr>
<tr>
<td class="Overview">
<div class="ProductSummary">This course provides an introduction
to NetWare 5 file system concepts and management procedures.</div>
<br>
<h3 class="Sectiontext">Objectives:</h3>
<div class="FreeText">After completing this course, students will
be able to: </div>
<div class="ObjectiveList">
<ul class="listing">
<li class="ObjectiveItem">Explain the relationship of the file
system and login scripts</li>
<li class="ObjectiveItem">Create login scripts</li>
<li class="ObjectiveItem">Manage file system directories and
files</li>
<li class="ObjectiveItem">Map network drives</li>
</ul>
</div>
<br></br>
<h3 class="Sectiontext">Topics:</h3>
<div class="OutlineList">
<ul class="listing">
<li class="OutlineItem">Managing the File System</li>
<li class="OutlineItem">Volume Space</li>
<li class="OutlineItem">Examining Login Scripts</li>
<li class="OutlineItem">Creating and Executing Login
Scripts</li>
<li class="OutlineItem">Drive Mappings</li>
<li class="OutlineItem">Login Scripts and Resources</li>
</ul>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Details" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="Sectiontext">Technical Requirements:</h3>
</td>
</tr>
<tr>
<td class="Details">
<div class="ProductRequirements">200MHz Pentium with 32MB Ram. 800
x 600 minimum screen resolution. Windows 98, NT, 2000, or XP. 56K minimum
connection speed, broadband (256 kbps or greater) connection recommended.
Internet Explorer 5.0 or higher required. Flash Player 7.0 or higher
required. JavaScript must be enabled. Netscape, Firefox and AOL browsers not
supported.</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Legal" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="Sectiontext">Copyright Information:</h3>
</td>
</tr>
<tr>
<td class="Copyright">
<div class="ProductRequirements">Product names mentioned in this
catalog may be trademarks/servicemarks or registered trademarks/servicemarks
of their respective companies and are hereby acknowledged. All product
names that are known to be trademarks or service marks have been
appropriately capitalized. Use of a name in this catalog is for
identification purposes only, and should not be regarded as affecting the
validity of any trademark or service mark, or as suggesting any affiliation
between MindLeaders.com, Inc. and the trademark/servicemark
proprietor.</div>
<br>
<h3 class="Sectiontext">Terms of Use:</h3>
<div class="ProductUsenote"></div>
</td>
</tr>
</table>
<p align="center">
<span class="SmallText">Copyright &copy; 2006 MindLeaders. All rights
reserved.</span>
</p>
</body>
</html>

推荐答案



Rob Meade写道:

Rob Meade wrote:
大家好,

我正在开发一个项目,其中有不到1300个课程文件,这些是HTML文件 - 我的问题是我需要做更多关于
这些页面 - 以及编写1300个asp页面来处理这个问题的想法
并不让我感到振奋。

HTML页面由培训公司提供。它们似乎是有条理的。在某种程度上,但我不确定要解析页面是多么容易。

通常有以下部分。每页:

标题
摘要
主题
技术要求
版权信息
使用条款
Hi all,

I''m working on a project where there are just under 1300 course files, these
are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn''t thrill me.

The HTML pages are provided by a training company. They seem to be
"structured" to some degree, but I''m not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use




如果您可以识别保存此信息的特定div(并且

它们在页面中是一致的),您可以使用正则表达式来解析

文件并将相关位弹出到数据库中。


-

Mike Brind



If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.

--
Mike Brind


我已经粘贴了以下其中一个页面的示例 - 如果有人可以
告诉我如何实现这一目标我会非常感激 - 在另外 - 如果
任何人都可以在那里解释XML Name Space的东西也很方便
- 我认为这只是一个普通的HTML页面,因为没有声明
或其他任何东西顶部?

I have pasted an example of one of these pages below - if anyone can suggest to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?




这些页面将通过XSLT转换生成。转换

将使用这些命名空间。但是除非另有通知否则

XSLT将输出这些命名空间的xmlns标签,即使没有元素

输出属于他们,这就是这里的情况。


这是一个长篇大论,说他们什么都不做,不理会他们。


很可惜他们没有去整个猪并输出整个页面为XML

将更容易做你需要的。仍然是一个好兆头,其他1299页的

内容可能是一致的,所以迈克的想法

使用RegExp进行扫描应该有效。 br />

Anthony。



These pages will have been generated via an XSLT transform. The transform
will have made use of these namespaces. However unless informed otherwise
XSLT will output the xmlns tags for these namespaces even though no element
is output belonging to them which is the case here.

That''s a long winded way of saying they don''t do anything, ignore them.

It''s a pity they didn''t go the whole hog and output the whole page as XML it
would be a lot easier to do what you need. Still it''s a good sign that the
content of the other 1299 pages are likely to be consistent so Mike''s idea
of scanning with RegExp should work.

Anthony.


" Rob Meade" < RO ******** @ NO-SPAM.kingswoodweb.net>在消息中写道

新闻:PS ***************** @ text.news.blueyonder.co.u k ...
"Rob Meade" <ro********@NO-SPAM.kingswoodweb.net> wrote in message
news:PS*****************@text.news.blueyonder.co.u k...
大家好,

我正在开发一个项目,其中有不到1300个课程文件,
这些是HTML文件 - 我的问题是我需要做更多这些页面的内容 - 以及编写1300页面来处理这个页面的想法
并不让我感到振奋。

HTML页面由培训公司提供。它们似乎是有条理的。在某种程度上,但我不确定要解析页面是多么容易。

通常有以下部分。每页:

标题
总结
主题
技术要求
版权信息
使用条款

我需要获取标题,摘要,主题,技术要求的内容并丢失版权和使用条款......此外我需要
来挤入一个新的部分,它将显示定价信息和一个链接
到添加到购物车等......

我的计划 (如果你可以这样称呼的话)就是有一个asp页面可以解析相应的HTML文件基于asp页面传递一个代码在
查询字符串中 - 代码将匹配文件名HTML页面(点之前的第一个部分)。

然后我需要做的是浏览HTML的内容....这是
我在哪里我目前被卡住了....

我已经粘贴了以下其中一个页面的示例 - 如果有人可以
向我建议我如何实现这一点我会非常感激 - 此外 - 如果
任何人都可以在那里解释那些方便的XML Name Space - 我认为这只是一个普通的HTML页面,因为没有声明
或顶部的任何东西?

任何信息/建议都会非常感激。
Hi all,

I''m working on a project where there are just under 1300 course files, these are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn''t thrill me.

The HTML pages are provided by a training company. They seem to be
"structured" to some degree, but I''m not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use

I need to get the content for the Title, Summary, Topics, Technical
Requirements and lose the Copyright and Terms of use...in addition I need to squeeze in a new section which will display pricing information and a link
to "Add to cart" etc....

My "plan" (if you can call it that) was to have 1 asp page which can parse
the appropriate HTML file based on the asp page being passed a code in the
querystring - the code will match the filename of the HTML page (the first
part prior to the dot).

What I then need to do is go through the content of the HTML....this is
where I am currently stuck....

I have pasted an example of one of these pages below - if anyone can suggest to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?

Any information/suggestions would be most appreciated.




[snip]


考虑在包含您内容的页面内显示< iframe>

内的页面。


" if rame元素创建一个包含另一个文档的内联框架。
http ://www.w3schools.com/tags/tag_iframe.asp


这篇关于ASP问题:解析HTML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆