HTML :: TableExtract:将权利attribs到指定感兴趣的属性 [英] HTML::TableExtract: applying the right attribs to specify the attributes of interest

查看:139
本文介绍了HTML :: TableExtract:将权利attribs到指定感兴趣的属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图运行的HTML下面进一步以下Perl脚本。我的问题是如何定义正确的散列引用,以 attribs 我的HTML中指定的兴趣属性<表> 标签本身。

I tried to run the following Perl script on the HTML further below. My problem is how to define the correct hash reference, with attribs that specify attributes of interest within my HTML <table> tag itself.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;


my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 1, br_translate => 0 ); 

$table->parse($html);
foreach my $row ($table->rows) 

sub cleanup {
    for ( @_ ) {
        s/\s+//;
        s/[\xa0 ]+\z//;
        s/\s+/ /g;
    }
}

{ print join("\t", @$row), "\n"; }

我想申请这个code上你看到下面进一步HTML文档。

I want to apply this code on the HTML-document you see further below.

我的第一种方法是使用的方法来做到这一点。但我无法弄清楚如何使用方法列在下面的HTML文件:我的直觉让我觉得这应该是类似以下的(但我的直觉是错的):

My first approach is to do this with the columns method. But i am not able to figure out how to use the columns method on the below HTML-file: My intuition makes me think it should be something like the following (but my intuition is wrong):

foreach my $column ($table->columns) { 
    print join("\t", @$column), "\n"; 
}

借助 HTML :: TableExtract 文档没有太多棚光(我反正)。

The HTML::TableExtract documentation doesn't shed much light (for me anyway).

我可以在列方法属于 HTML :: TableExtract ::表模块的code看到的,但我无法弄清楚如何使用它。我AP preciate任何帮助。

I can see in the code of the module that the columns method belongs to HTML::TableExtract::Table, but I can't figure out how to use it. I appreciate any help.

我试图让表中提取,我有表的一个非常非常小的文件,我想与 HTML :: TableExtract 模块我试图搜索解析在HTML关键字 - 这样我可以把他们的 attribs 我仅打印所需的数据。

I try to get the table extracted and I have a very very small document of tables that i want to parse with the HTML::TableExtract module I am trying to search for keywords in the HTML - so that i can take them for the attribs I have to print only the necessary data.

我试着去CPAN但不能真正找到如何通过它搜索特定关键字。这样做将是一个办法 HTML :: TableExtract - 其他方式将与 HTML :: TokeParser 解析我有很少的经验 HTML :: TokeParser

I tried going CPAN but could not really find how to search through it for particular keywords. One way to do it would be HTML::TableExtract - the other way would be to parse with HTML::TokeParser I have very little experience with HTML::TokeParser.

好 - 一种或另一种方式,我需要做此分析:我要输出的分析表到一些的.text的结果 - 甚至更好的将其存储到数据库中。这里的问题是我不能找到反正通过分析结果表,搜索和获取必要的数据。

Well - one or the other way i need to do this parsing: I want to output the result of the parsed tables into some .text - or even better store it into a database. The problem here is I cant find anyway to search through the resulting parsed table and get necessary data.

在HTML

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">

<link rel="stylesheet" href="jspsrc/css/bp_style.css" type="text/css">

<title>Weitere Schulinformationen</title>
</head>

<body class="bodyclass">
<div style="text-align:center;"><center>
<!-- <fieldset><legend> general information  </legend>
-->
<br/>

<table border="1" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_result_tab_info'>
<!-- <table border="0" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_search_info'>
-->  
  <tr>
    <td width="100%" colspan="2" class="ldstabTitel"><strong>data_one </strong></td>
  </tr>
  <tr>
    <td width="27%"><strong>data_two</strong></td>
    <td width="73%">&nbsp;116439
  </td>
  </tr>
  <tr>
    <td width="27%"><strong>official_description</strong></td>
    <td width="73%">the name </td>
  </tr>
  <tr>
    <td width="27%"><strong>name of the street</strong></td>
    <td width="73%">champs elysee</td>
  </tr>
  <tr>
    <td width="27%"><strong>number and town</strong></td>
    <td width="73%"> 75000 paris </td>
  </tr>
  <tr>
    <td width="27%"><strong>telefon</strong></td>

    <td width="73%">&nbsp;000241 49321
</td>
  </tr>
  <tr>
    <td width="27%"><strong>fax</strong></td>
    <td width="73%">&nbsp;000241 4093287
</td>
  </tr>
  <tr>
  <td width="27%"><strong>e-mail-adresse</strong></td>
  <td width="73%">&nbsp;<a href=mailto:1111116439@my_domain.org>1222216439@site.org</a>
</td>
  </tr>
  <tr>
    <td width="27%"><strong>internet-site</strong></td>
    <td width="73%">&nbsp;<a href=http://www.thesite.org>http://www.thesite.org</td>
 </tr>
<!--  
<tr>
    <td width="27%">&nbsp;</td>
    <td width="73%" align="right"><a href="schule_aeinfo.php?SNR=<? print $SCHULNR ?>" target="_blank">
    [Schuldaten &auml;ndern]&nbsp;&nbsp;</a>
</tr>
</td> -->
<tr>
  <td width="27%">&nbsp;</td>
  <td width="73%">the department</td>
 </tr> 

  <tr>
    <td width="100%" colspan=2><strong>&nbsp;</strong></td>
 </tr> 
 <tr>
    <td width="27%"><strong>number of indidviduals</strong></td>
    <td width="73%">&nbsp;192</td>
<tr>
    <td width="100%" colspan=2><strong>&nbsp;</strong></td>
   </tr>
  <!-- if (!fsp.isEmpty()){
 ztext = "&nbsp;";

 int i = 0;
 Iterator it = fsp.iterator();
 while (it.hasNext()){
  String[] zwert = new String[2];
  zwert = (String[])it.next();

  if (i==0){
   if (zwert[1].equals("0")){
    ztext = ztext+zwert[0];
   }else{
    ztext = ztext+zwert[0]+" mit "+zwert[1];
    if (zwert[1].equals("1")){
     ztext = ztext+" Sch&uuml;ler";
    }else{
     ztext = ztext+" Sch&uuml;lern";
    }
   } 
   i++;
  }else{
   if (zwert[1].equals("0")){
    ztext = ztext+"<br>&nbsp;"+zwert[0];
   }else{
    ztext = ztext+"<br>&nbsp;"+zwert[0]+" mit "+zwert[1];
    if (zwert[1].equals("1")){
     ztext = ztext+" Sch&uuml;ler";
    }else{
     ztext = ztext+" Sch&uuml;lern";
    }
   } 
  }  
 } 

-->





</table>
<!--  </fieldset>  -->
<br>

</body>
</html>

感谢您的任何和所有帮助。

Thanks for any and all help.

推荐答案

您需要提供的东西,唯一地标识有问题的表。这可以是其头或HTML属性的内容。在这种情况下,只有一个文件表,所以你甚至都不需要做。但是,如果我给构造提供任何东西,我将提供一流的表。

You need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so you don't even need to do that. But, if I were to provide anything to the constructor, I would provide the class of the table.

另外,我不认为你需要的表列。该表的第一列包括标签和第二列包括的值。为了得到在同一时间的标签和值,你应该处理表行由行。

Also, I do not think you want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, you should process the table row-by-row.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new(
    attribs => { class => 'bp_result_tab_info' },
);

$te->parse_file('t.html');

for my $table ( $te->tables ) {
    print Dump $table->columns;
}

输出:

---
- 'data_one '
- data_two
- official_description
- name of the street
- number and town
- telefon
- fax
- e-mail-adresse
- internet-site
- á
- á
- number of indidviduals
- á
---
- ~
- "á116439\r\n  "
- 'the name '
- champs elysee
- ' 75000 paris '
- "á000241 49321\r\n"
- "á000241 4093287\r\n"
- "á1222216439@site.org\r\n"
- áhttp://www.thesite.org
- the department
- ~
- á192
- ~

最后,一句忠告:很显然,你没有太多的的Perl的理解(或HTML为此事)。这将是更好地为您首先尝试学习一些基础知识。这样一来,你正在做的是正确复制,并从一个答案粘贴code到另一个,而不是学习的东西。

Finally, a word of advice: It is clear that you do not have much of an understanding of Perl (or HTML for that matter). It would be better for you to try to learn some of the basics first. This way, all you are doing is incorrectly copying and pasting code from one answer into another and not learning anything.

这篇关于HTML :: TableExtract:将权利attribs到指定感兴趣的属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆