检测HTML5段落结尾--- HTML序列化相关 [英] Detecting HTML5 paragraph endings --- HTML Serialization Related

查看:85
本文介绍了检测HTML5段落结尾--- HTML序列化相关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个Perl程序,使合理的HTML5标记在 $ _ 中更好(更有效 - 我知道,听起来像更怀孕) 。具体而言,我想尝试使用< / p> 标签正确关闭段落,这正是浏览器关闭它们的地方。它是将html转换为xhtml的一个步骤。这可以帮助我在随后的文本分析完整段落。



HTML5规范说明


  1. p 元素必须有开始标记。

  2. 如果 p 元素后面紧跟着一个地址 p $ c>, article ,抛开, blockquote dir div dl fieldset 页脚表格 h1 h2 h3 h4 h5 h6 标题 hr 菜单 nav ol p pre section ul 元素,


  3. 或如果父元素中没有更多内容,并且父元素不是 a 元素。


问题:


  1. 我相信有可能看到段落不是这样的。 HTML浏览器推断并自行插入< p> 。例如,< h1> HEADER< / h1>现在... 会在 Now is ... 之前插入< p> 。我错了吗?我们假设HTML内容创建者已经正确插入了< p> > 。我现在需要向前搜索,直到它结束。从26个标签列表中检测一个关闭段落的开口很简单。

  2. 但是,如何检测父段中是否有更多内容?我可以从上面26个标签的集合中搜索下一个< / ...> ,还是需要编写一个完整的堆栈机器(假设所有内容都在段落本身是有效的XHTML)检测封闭容器的结束?


    感谢@Palec我现在明白段落在HTML中是一个奇怪的概念。试试这个:

     <!DOCTYPE html> 
    < html>
    < head>
    < style>
    p {color:blue; }
    p:before {content:[SP]; }
    p:after {content:[EP]; }
    < / style>
    < / head>

    < body>

    l0

    < h1> h1< / h1>

    l0

    < p> para

    < p>第< / p>

    10

    < p> para
    < ol>
    < li> 10< p>第< / li>
    < / ol>
    l0

    < / body>
    < / html>

    这表明并非所有的文本都至少是一个段落。我确实将它与LaTeX概念混为一谈......并认为无论在0级,默认都是一个段落。它不是。 段的三个概念 有两个单独的概念: p 元素和段落。我将这一段称为结构性段落。在现实世界中,我发现至少有两个其他相关概念:逻辑段和印刷段。 /TR/html5/grouping-content.html#the-p-elementrel =nofollow noreferrer> p 元素 已清楚。你知道的,你已经从规范中引用了它的描述。



    (Structural)paragraph 对我来说有点奇怪。也许它被屏幕阅读器或其他使用。它的定义基本上说它是一个非空运行的短语内容不会被其他类型的内容中断(不会被 a ins del 和地图考虑在内)。 逻辑段落是我认为人类考虑的一个段落。它是一个文本单位,带有一个单一的思想。当另一个(可能相关的)思想开始时,该段落将会断开,并且一个新的开始。它由一系列句子组成。

    每个句子不仅可以有其语言结构,还可以包含格式。格式不限于什么HTML调用短语内容,但我会添加至少多行预格式化代码片段,列表,数学公式(可能跨越多行,从TeX显示数学)以及可用于中间的其他任何东西的句子或句子之间,而不会打破思路。在我的问题列表或更长的代码片段中,可以看到逻辑段落和其他两个概念之间的巨大差异。



    印刷段落由一系列的线段组成,而不是句子,可以包含印刷系统可以处理的任何内容。我原本以为这与逻辑段落的概念完全相同,但事实并非如此。



    我在想到 tex 。您可以通过乳胶发现问题,这只是一个大问题TeX的定义集和段落的相同概念。内容被缓冲到满足 \ par (或内部翻译为 \ par 的空行),然后它作为单个段落被冲刷到输出。看起来像一个(逻辑)段落可以在内部有几个段落,因为它必须用于实现排版算法的一些更复杂的行为。从这个角度来看,它更像是一个结构性的段落。

    您的问题解答




    1. 如果仅存在文本节点,则(结构)段落在 h1 元素之后开始。但这不是 p 元素。它不能在CSS中使用 p 选择器进行样式化,它不存在于文档的DOM树中。



      有些地方元素标签不在标记中,但仍然是元素被创建。这些元素的开始标签可以省略。这些是 html 头, body colgroup tbody (至少 tbody 用于在HTML 4中表现不同,这种行为来自XHTML,而HTML中则不需要) <$ c但是,$ c> p 元素不是这种情况。 如果内容创建者没有插入 < p> 正确(它不是有效的HTML 5),你应该如何纠正它?一旦它不正确,你通常不会假设任何事情。另外省略结束标记不是不正确!你真的假设有效的XHTML 5(即HTML 5的XML序列化,特别是所有标签都关闭了)?好的,那么你需要跟踪文档树深度信息(或者如果你需要结构化形式的数据,则可以使用堆栈)。否则,你将不得不实施完整的HTML 5解析,因为可能存在选项内部省略了结尾标记(在中选择)。这会破坏你的深度跟踪。



      当一个命名元素启动时或者< / p> 结束标记被满足或者当父元素的结束符合时。嗯。如果您仅在内部使用有效的XHTML,您仍然需要为所有元素实现关闭规则,以便能够检测父元素的结束......这并不容易。




    HTML到HTML 5的XML序列化转换



    在一篇评论中你说HTML5转换为XHTML 5是你的用例。

    不要使用正则表达式!



    正则表达式不是用来解析HTML这样复杂的任务。你尝试的任何东西都只是启发式的。 真正的正则表达式根本无法解析HTML ,因为HTML不是正规语言。让我们搁置一下,这个perlre要强大得多。拥有巨大的权力会带来很大的责任,当它错误时你不应该使用权力。对于这个关于这个话题的问题,有一个非常有名的答案,这是真正的艺术作品。 Jeff Atwood在上写了更多,在开始时引用了这个答案,并解释了在本文其余部分理解您的工具的重要性。



    我相信文本级别的方法对这个目标是不利的。 HTML通常被称为标签汤,与维基百科所说的相反,我遇到了这个词用于参考文本层面的方法来创建和修改(即 document.write() element.innerHTML )。

    这是XHTML通过废除很好解决的一件事情。在JavaScript中,您不能在XHTML中使用 document.write()。如果它起作用,那么您使用HTML解析器和XHTML文档 - 使用 Content-Type HTTP标头和 application / xhtml + xml; charset = utf-8 而不是 text / html 您使用的MIME类型。



    使用DOM



    Clean Solution™是 DOM



    我相信你应该实现(或者使用其他的实现)a HTML 解析器,获取DOM树,并将一个序列化器写入 XHTML 。如果输入文档无效,请拒绝处理。或者将开关添加到您的程序中,该开关告诉它如何修复解析算法无法处理的某些错误。可能有很多方法。



    我不确定如果您对它们不感兴趣,可以自由忽略规范的哪些部分。解析算法是标准化的,也指定了错误处理。您可以在不需要创建DOM树的一部分的情况下找到一个快捷方式,并将输入的相应部分保留为未解析的状态,但您必须确保继续解析输入的正确位置。这可能会变得混乱,并且肯定会出错。因此,我建议你不要这样做。



    实际解决方案



    实际上,您似乎可以使用至少两个现有的模块。



    Mojolicious 是web框架包含 Mojo :: DOM 模块。如果您不需要DOM操作并且只需解析和序列化,则可以使用基础魔精:: DOM :: HTML 。 Mojo :: DOM可以使用 my $ dom = Mojo :: DOM-> new($ html_markup); 来解析HTML,可以将生成的DOM对象设置为使用按照 $ dom-> xml(1); 的XML序列化,序列化可以返回为 $ xhtml_markup =$ dom; $ xhtml_markup = $ dom-> to_string(); 。来自Mojo :: DOM POD:Mojo :: DOM是一个简约而宽松的HTML / XML DOM解析器,支持CSS选择器。它甚至会尝试解释破损的XML,因此您不应该将其用于验证。在 amon回答中使用示例。如果您已经使用了Mojolicious,您可能需要使用此解决方案,否则安装整个大型框架对于此项工作来说是一件大事。

    .cpan.org / perldoc?HTML%3A%3AHTML5%3A%3AParserrel =nofollow noreferrer> HTML :: HTML5 :: Parser and 模块可分别用于解析和序列化HTML 5。他们似乎只有一些依赖关系。使用这些代码的好代码可以在他们的作者 tobyink 的答案中找到。这应该是那些不使用Mojolicious的人的解决方案。


    I would like to write a Perl program that makes reasonable HTML5 markup in $_ better ("more valid" – I know, it sounds like "more pregnant"). Specifically I want to try to properly close paragraphs with </p> tags, just where browsers would close them. Its a step on the way to convert html to xhtml. This helps me in subsequent text analysis of full paragraphs.

    The HTML5 spec says that

    1. A p element must have a start tag.

    2. A p element’s end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, dir, div, dl, fieldset, footer, form, h1, h2, h3, h4, h5, h6, header, hr, menu, nav, ol, p, pre, section, table, or ul element,

    3. or if there is no more content in the parent element and the parent element is not an a element.

    Problems:

    1. I believe it is possible to see paragraphs where this is not true. HTML browsers infer and insert <p> by themselves. For example, <h1>HEADER</h1> Now is… will insert a <p> just before the Now is…. Am I mistaken?

    2. Let’s assume the HTML content creator has already inserted the <p> correctly. I now need to search forward until it ends. Detecting an opening from the list of 26 tags which close a paragraph is easy.

    3. But how can I detect if there is more content in the parent paragraph? Can I just search for the next </…> from the set of the above 26 tags, or do I need to code a full stack machine (assuming all contents in paragraphs themselves are valid XHTML) to detect the end of enclosing container?

    Thanks to @Palec I now understand that paragraphs are an odd concept in HTML. Try this:

    <!DOCTYPE html>
    <html>
    <head>
    <style>
        p { color: blue; }
        p:before { content:"[SP]"; }
        p:after { content:"[EP]"; }
    </style>
    </head>
    
    <body>
    
    l0
    
    <h1> h1 </h1>
    
    l0
    
    <p> para
    
    <p> para </p>
    
    l0
    
    <p>para
    <ol>
    <li> l0 <p> para </li>
    </ol>
    l0
    
    </body>
    </html>
    

    This shows that not all text is at least a paragraph. I did confuse it with the LaTeX concept… and thought that whatever was at "level 0" was a paragraph by default. It is not.

    解决方案

    Three concepts of paragraph

    HTML 5 has two separate concepts: p element and paragraph. I will call this paragraph a structural paragraph. In real world I found at least two other related concepts: logical paragraph and typographical paragraph.

    p element is clear. You know it, you already quoted its description from the spec.

    (Structural) paragraph is somewhat strange concept to me. Maybe it is used by screen readers or whatever. Its definition basically says that it is a non-empty run of phrasing content not interrupted by other types of content (not taking a, ins, del and map into account).

    Logical paragraph is what I think human beings consider a paragraph. It is a unit of text that carries a single thought. When another (probably related) thought begins, the paragraph breaks and a new one begins. It is composed from a sequence of sentences.

    Each sentence can have not only its linguistic structure, but also can contain formatting. Formatting is not limited to what HTML calls phrasing content, but I’ll add at least multi-line preformatted code snippets, lists, math formulas (possibly spanning multiple lines, display math from TeX) and anything else that can be used in the middle of a sentence or between sentences while not breaking the train of thought. This big difference between logical paragraph and the other two concepts can be seen in my question List or longer code snippet inside paragraph.

    Typographical paragraph consists of sequence of lines, not sentences, and can contain whatever the typographical system can handles inside. I originally thought it is exactly the same concept as logical paragraph, but it is not.

    It came to my mind when thinking about . You may know it from that is just a large set of definitions for TeX and has the same notion of paragraph. Content is buffered till \par (or empty line which translates to \par internally) is met, then it is flushed to the output as a single paragraph. What looks like one (logical) paragraph can be internally several paragraphs as it has to be used to implement some more complicated behavior of the typesetting algorithm. From this point of view it resembles more a structural paragraph.

    Answers to your questions

    1. A (structural) paragraph begins after h1 element if just a text node is present. But this is not a p element. It cannot be styled in CSS using p selector, it is not present in the DOM tree of the document etc.

      There are certain places where element tags are not in the markup but still the elements are created. This is the case with those elements whose start tag can be omitted. These are html, head, body, colgroup and tbody. (At least tbody used to behave differently in HTML 4, this behavior comes from XHTML. In HTML it just need not exist.) p element is not the case, however.

    2. If the content creator did not insert <p> correctly (it was not valid HTML 5), how would you be supposed to correct it? Once it is not correct, you cannot generally assume anything about it. Also omitting the end tag is not incorrect! Not a question really in this list item, so going further…

    3. Are you really assuming valid XHTML 5 (i.e. XML serialization of HTML 5, specifically all tags closed)? OK, then you need to track document tree depth info (or stack if you need the data in structured form). Otherwise you would have to implement full HTML 5 parsing as there might be e.g. option with omitted end tag inside (within a select). This would break your depth tracking.

      The paragraph closes when one of the named elements starts or when </p> closing tag is met or when end of parent element is met. Mmmm. When you assume valid XHTML only inside, you still need to implement closing rules for all elements to be able to detect end of parent element… This will not be easy.

    HTML to XML serialization conversion of HTML 5

    In a comment you said that converting HTML 5 to XHTML 5 is your use case.

    Do not use regexes!

    Regexes were not designed to do such complicated tasks as parsing HTML. Anything you try would be just a heuristic. True regular expressions cannot parse HTML at all, because HTML is not a regular language. Let’s put aside that perlre is much more powerful; with great power comes great responsibility and you should not use the power when it is wrong. There is an extremely famous answer to a question on this topic here on SO, real piece of art. Jeff Atwood wrote more on the topic, quoting this answer at the beginning and explaining the importance of understanding your tools in the rest of the article.

    I believe that text-level approach to this goal is bad. HTML is often referred to as tag soup and in contrast with what Wikipedia says, I met this term used in reference to text-level approach to its creation and amending generally (namely document.write() and element.innerHTML).

    By the way this is one thing that XHTML solved really well by abolition. In JavaScript you can’t use document.write() with XHTML. If it works, you are using HTML parser with XHTML document – use Content-Type HTTP header with application/xhtml+xml; charset=utf-8 instead of text/html MIME type you use.

    Use DOM

    The Clean Solution™ is DOM.

    I believe you should implement (or use other’s implementation of) a HTML parser, get the DOM tree, and write a serializer to XHTML. If the input document is not valid, reject to process it. Or add switches to your program, that tells it how to fix certain errors that the parsing algorithm is not designed to handle. There could be many ways.

    I am not sure which parts of the spec you are free to ignore if you are not interested in them. The parsing algorithm is standardized and the error handling is specified too. You could find a shortcut where you don’t need to create a part of the DOM tree and just leave the corresponding part of input unparsed, but you have to be sure that you continue parsing at the right position of input. This could get messy and is definitely error-prone. Therefore I recommend you not to do that.

    Practical solution

    In practice, it seems you can use at least two existing modules.

    Mojolicious is web framework that contains Mojo::DOM module. If you do not need DOM manipulation and you want just parsing and serialization, you could use the underlying Mojo::DOM::HTML. HTML can be parsed by Mojo::DOM using my $dom = Mojo::DOM->new($html_markup);, the resulting DOM object can be set to use XML serialization by $dom->xml(1); and the serialization can be returned as $xhtml_markup = "$dom"; or $xhtml_markup = $dom->to_string();. From Mojo::DOM POD: "Mojo::DOM is a minimalistic and relaxed HTML/XML DOM parser with CSS selector support. It will even try to interpret broken XML, so you should not use it for validation." Example use in answer by amon. You may want to use this solution if you already use Mojolicious, otherwise installing whole big framework is an overkill for this job.

    HTML::HTML5::Parser and HTML::HTML5::Writer modules can be used for parsing and serialization of HTML 5 respectively. They seem to have only a few dependencies. Nice code using these can be found in answer by tobyink, their author. This should be a solution for those not using Mojolicious already.

    这篇关于检测HTML5段落结尾--- HTML序列化相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆