使用 xpath 在行中提取表格单元格文本内容以供使用? [英] Extracting table cell text contents with xpath in rows for consumption?

查看:45
本文介绍了使用 xpath 在行中提取表格单元格文本内容以供使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些关于 HTML 的内容.我想提取表格单元格的各种内容,但是我发现单元格中偶尔会嵌入一些 div,也许还有其他我不确定的奇怪之处:<​​/p>

I have something along the following lines in terms of HTML. I would like to extract the various contents of the table cells, however I discovered that there are some embedded divs occasionally in the cells and perhaps other oddities that I'm not sure of yet:

<p align="center">
    <img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
  <TD ALIGN=center>Title2</TD>
  <TD ALIGN=center></TD>
  <TD ALIGN=center><div class=redtext>----</div></TD>
  <TD>&nbsp;</TD>
</TR><TR>
  <TD ALIGN=center>Title3</TD>
  <TD ALIGN=center><div class=yellowtext>value</div></TD>
  <TD ALIGN=center><div class=redtext>value</div></TD>
  <TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
  <TD ALIGN=center>Title4</TD>
  <TD ALIGN=center><div class=bluetext>value</div></TD>
  <TD ALIGN=center><div class=redtext>value</div></TD>
  <TD>&nbsp;</TD>
</TR></TABLE>

<blockquote>
    <p class="textstyle">
        Text.
    </p>
</blockquote>

我的第一个冲动是提取所有元素文本并以编程方式将其切片.我会注意 Title1、Title2 等以了解行何时开始,然后如果发现----"意味着没有价值,只需跳过这一行并继续.但是,我意识到可能有更好的方法可以直接使用 xpath 处理此问题.

My first impulse was to extract ALL element texts and just programmatically slice it up. I would watch for Title1, Title2, etc. to know when a row starts and then if a "----" is found meaning no value, just skip this row and move on. However, I realized that there is probably a better way of handling this with xpath directly.

如何用 xpath 解决这个问题,以便从本质上给每个单元格的最终子文本内容,而不是必须走进每个 div(如果存在)?或者有没有更像 xpath 的方法来解决这个问题?

How could this be solved with xpath so as to essentially give each cell's final child text content vs having to walk into each div if it exists? Or is there a more xpath like way to approach this?

显然,我正在尝试使用最灵活的解决方案,即使出现其他意外元素,即使它们不太可能出现,也不会变得脆弱.

Obviously I'm attempting to have the most flexible solution that will not be brittle if other unexpected elements crop up, even though they are unlikely.

推荐答案

提供的文本不是格式正确的 XML 文档,因此 XPath 不适用.

如果您更正并将其转换为格式良好的 xml 文档,如下所示,这样的表达式可能会很有用:

If you correct and covert it to a well-formed xml document as the one below, an expression like this might be useful:

/*/TABLE//TD//text()

甚至:

//TABLE//TD//text()

<小时>

这是一个格式良好的 XML 文档,由提供的 HTML 构建:


Here is a wellformed XML document, constructed from the provided HTML:

<html>
    <p align="center">
        <img src="some_image.gif" alt="Some Title"/>
    </p>
    <TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
        <TR>
            <TD colspan="4" ALIGN="center">
                <b>Title</b>
            </TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title</TD>
            <TD ALIGN="center">date</TD>
            <TD ALIGN="center">value</TD>
            <TD ALIGN="center">value</TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title2</TD>
            <TD ALIGN="center"></TD>
            <TD ALIGN="center">
                <div class="redtext">----</div>
            </TD>
            <TD>&#xA0;</TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title3</TD>
            <TD ALIGN="center">
                <div class="yellowtext">value</div>
            </TD>
            <TD ALIGN="center">
                <div class="redtext">value</div>
            </TD>
            <TD ALIGN="center">value
                <SUP>6</SUP>
            </TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title4</TD>
            <TD ALIGN="center">
                <div class="bluetext">value</div>
            </TD>
            <TD ALIGN="center">
                <div class="redtext">value</div>
            </TD>
            <TD>&#xA0;</TD>
        </TR>
    </TABLE>
    <blockquote>
        <p class="textstyle">         Text.     </p>
    </blockquote>
</html>

这篇关于使用 xpath 在行中提取表格单元格文本内容以供使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆