获得everyfile目录输出表标签之间的内容到一个文件 [英] Get contents between table tags in everyfile in directory output to one file
问题描述
我在这约900 HTML文档目录,每个文档都包含在该表中的同一个表标签(容易定义)是我需要在一个CSV格式提取和输出数据。什么是做到这一点的最佳途径,我该怎么办呢?
I have directory with about 900 html documents in it, each document contains the same table tags (easily defined) in that table is data which I need to extract and output in a csv format. What is the best way to do this and how can I do it?
下面是一个什么是我需要提取每个HTML文件的例子
Here is an example of what is in each html file which I need to extract
<table class="datalogs" cellspacing="5px">
<tr>< th>Data1</th><th>Data 2</th><th>Data 3</th><th>Data 4</th><th>Data 4< /th>< th>Data 5</th><th>Data 6</th></tr>
<tr class="odd"><td valign="top"><h4>123<br/></h4></td><td valign="top">AAA</td><td valign="top"><b>url here</b></td><td valign="top">Yes</td><td valign="top">None</td><td valign="top"></td><td valign="top"></td></tr><tr class="even">...
</table>
理想的结果将是
123,AAA,URL这里,是,无,,
The ideal outcome would be "123", "AAA", "url here", "Yes", "None", "", ""
如果这不能一步到位实现,表标签之间然后只提取数据(按类=数据日志中定义),并把所有的结果到一个文件中(这将是从一个循环肚里通过目录和每一个文件中获取此表。
If this cant be achieved in one go, then just extracting data between the table tags (defined by class="datalogs") and put all results into one file (this would be from a loop which goes through the directory and every file getting this table.
感谢您的帮助。
推荐答案
可行在Perl,的帮助下 HTML :: TableExtract 和文字:: CSV :
Doable in Perl, with the help of HTML::TableExtract and Text::CSV:
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;
use Text::CSV;
my $te = 'HTML::TableExtract'
->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
'Data 4', 'Data 5', 'Data 6']);
my $csv = 'Text::CSV'->new({ binary => 1,
eol => "\n",
always_quote => 1,
});
while (@ARGV) {
my $file = shift;
open my $IN, '<', $file or die $!;
my $html = do { local $/; <$IN> };
$te->parse($html);
}
for my $table ($te->tables) {
$csv->print(*STDOUT{IO}, $_) for $table->rows;
}
我不得不修正一些错误在你的样品输入(应该有℃之间没有空格;
和标记名称或 /
)。
添加文件名的第一列:为每个文件创建新TableExtract对象
Adding the file names to the first column: a new TableExtract object created for each file.
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;
use Text::CSV;
my $csv = 'Text::CSV'->new({ binary => 1,
eol => "\n",
always_quote => 1,
});
for my $file (@ARGV) {
open my $IN, '<', $file or die $!;
my $html = do { local $/; <$IN> };
my $te = 'HTML::TableExtract'
->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
'Data 4', 'Data 5', 'Data 6']);
$te->parse($html);
$csv->print(*STDOUT{IO}, [$file, @$_]) for ($te->tables)[0]->rows;
}
这篇关于获得everyfile目录输出表标签之间的内容到一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!