HTML解析是什么意思? [英] What does HTML Parsing mean?

查看:195
本文介绍了HTML解析是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我听说过简单的HTML DOM和HTML解析器这样的HTML解析器库。我也听说过包含HTML解析的问题。解析HTML是什么意思?

解决方案

与Spudley说的不同,解析基本上是 resolve(一个句子)到它的组成部分并描述它们的语法角色。根据维基百科,解析或语法分析是分析一串符号的过程,或者在根据形式语法的规则,可以使用自然语言或<计算机语言。术语解析来自拉丁语语法(orationis),意思是语言的一部分。在您的情况下,HTML解析基本上是:取入HTML代码并提取相关信息如页面的标题,页面中的段落,页面中的标题,链接,粗体文本等。

解析器:



解析内容的计算机程序称为解析器。通常有两种解析器:

自顶向下解析 - 自顶向下解析可以被看作是试图找到左 - 通过使用给定形式语法规则的自顶向下扩展来搜索分析树的输入流的大部分推导。令牌从左到右消耗。包容性选择用于通过扩展语法规则的所有可选右侧来解决模糊性。
$ b 自下而上解析 - 解析器可以从输入开始并尝试将其重写到开始符号。直观上,解析器试图找到最基本的元素,然后是包含这些元素的元素,等等。 LR解析器是自底向上解析器的例子。另一个用于这种类型的解析器的术语是Shift-Reduce解析。



一些示例解析器: - 解析器:





自下而上解析器:





示例解析器:



以下是python中的一个HTML解析器示例:



< (HTMLParser):
def handle_starttag(self ,tag,attrs):
print遇到一个开始标记:,标记
def handle_endtag(self,tag):
printE ncountered结束标记:,标记
def handle_data(self,data):
print遇到一些数据:,数据

#实例化解析器并为它提供一些HTML
parser = MyHTMLParser()
parser.feed('< html>< head>< title>测试< / title>< / head>'
< body> < h1> Parse me!< / h1>< / body>< / html>')



以下是输出:


 遇到开始标记:html 
遇到a开始标签:head
遇到一个开始标签:title
遇到一些数据:Test
遇到一个结束标签:title
遇到一个结束标签:head
遇到一个开始标签:body
遇到一个开始标签:h1
遇到一些数据:解析我!
遇到一个结束标记:h1
遇到一个结束标记:body
遇到一个结束标记:html




参考文献




I have heard of HTML Parser libraries like Simple HTML DOM and HTML Parser. I have also heard of questions containing HTML Parsing. What does it mean to parse HTML?

解决方案

Unlike what Spudley said, parsing is basically to resolve (a sentence) into its component parts and describe their syntactic roles.

According to wikipedia, Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).

In your case, HTML parsing is basically: taking in HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings in the page, links, bold text etc.

Parsers:

A computer program that parses content is called a parser. There are in general 2 kinds of parsers:

Top-down parsing- Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.

Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.

A few example parsers:

Top-down parsers:

Bottom-up parsers:

Example parser:

Here's an example HTML parser in python:

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

Here's the output:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

References

这篇关于HTML解析是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆