获得everyfile目录输出表标签之间的内容到一个文件 [英] Get contents between table tags in everyfile in directory output to one file

查看:85
本文介绍了获得everyfile目录输出表标签之间的内容到一个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这约900 HTML文档目录,每个文档都包含在该表中的同一个表标签(容易定义)是我需要在一个CSV格式提取和输出数据。什么是做到这一点的最佳途径,我该怎么办呢?

I have directory with about 900 html documents in it, each document contains the same table tags (easily defined) in that table is data which I need to extract and output in a csv format. What is the best way to do this and how can I do it?

下面是一个什么是我需要提取每个HTML文件的例子

Here is an example of what is in each html file which I need to extract

<table class="datalogs" cellspacing="5px">
                        <tr>< th>Data1</th><th>Data 2</th><th>Data 3</th><th>Data 4</th><th>Data 4< /th>< th>Data 5</th><th>Data 6</th></tr>
<tr class="odd"><td valign="top"><h4>123<br/></h4></td><td valign="top">AAA</td><td valign="top"><b>url here</b></td><td valign="top">Yes</td><td valign="top">None</td><td valign="top"></td><td valign="top"></td></tr><tr class="even">...
                        </table>

理想的结果将是
123,AAA,URL这里,是,无,,

The ideal outcome would be "123", "AAA", "url here", "Yes", "None", "", ""

如果这不能一步到位实现,表标签之间然后只提取数据(按类=数据日志中定义),并把所有的结果到一个文件中(这将是从一个循环肚里通过目录和每一个文件中获取此表。

If this cant be achieved in one go, then just extracting data between the table tags (defined by class="datalogs") and put all results into one file (this would be from a loop which goes through the directory and every file getting this table.

感谢您的帮助。

推荐答案

可行在Perl,的帮助下 HTML :: TableExtract 文字:: CSV

Doable in Perl, with the help of HTML::TableExtract and Text::CSV:

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TableExtract;
use Text::CSV;

my $te = 'HTML::TableExtract'
         ->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
                           'Data 4', 'Data 5', 'Data 6']);

my $csv = 'Text::CSV'->new({ binary       => 1,
                             eol          => "\n",
                             always_quote => 1,
                           });

while (@ARGV) {
    my $file = shift;
    open my $IN, '<', $file or die $!;
    my $html = do { local $/; <$IN> };
    $te->parse($html);
}
for my $table ($te->tables) {
    $csv->print(*STDOUT{IO}, $_) for $table->rows;
}

我不得不修正一些错误在你的样品输入(应该有℃之间没有空格; 和标记名称或 / )。

添加文件名的第一列:为每个文件创建新TableExtract对象

Adding the file names to the first column: a new TableExtract object created for each file.

#!/usr/bin/perl
use warnings;
use strict;


use HTML::TableExtract;
use Text::CSV;

my $csv = 'Text::CSV'->new({ binary       => 1,
                             eol          => "\n",
                             always_quote => 1,
                           });

for my $file (@ARGV) {
    open my $IN, '<', $file or die $!;
    my $html = do { local $/; <$IN> };
    my $te = 'HTML::TableExtract'
             ->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
                               'Data 4', 'Data 5', 'Data 6']);
    $te->parse($html);
    $csv->print(*STDOUT{IO}, [$file, @$_]) for ($te->tables)[0]->rows;
}

这篇关于获得everyfile目录输出表标签之间的内容到一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆