抓取没有 HTML 的纯文本文件? [英] Scraping a plain text file with no HTML?

查看:48
本文介绍了抓取没有 HTML 的纯文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在纯文本文件中有以下数据:

I have the following data in a plain text file:

1.  Value
Location :  Value
Owner:  Value
Architect:  Value

2.  Value
Location :  Value
Owner:  Value
Architect:  Value

... upto 200+ ...

每个段的编号和单词 Value 都会发生变化.

The numbering and the word Value changes for each segment.

现在我需要将这些数据插入到 MySQL 数据库中.

Now I need to insert this data in to a MySQL database.

你对我如何遍历和刮取它有什么建议,以便我可以获取数字旁边的文本值,以及位置"、所有者"、建筑师"的值?

Do you have a suggestion on how can I traverse and scrape it so I can get the value of the text beside the number, and the value of "location", "owner", "architect" ?

似乎很难使用 DOM 抓取类,因为不存在 HTML 标签.

Seems hard to do with DOM scraping class since there is no HTML tags present.

推荐答案

这将适用于一个非常简单的有状态的面向行的解析器.您将解析的数据累积到一个数组()中的每一行.当某事告诉您有新记录时,您将解析的内容转储并再次继续.

That will work with a very simple stateful line-oriented parser. Every line you cumulate parsed data into an array(). When something tells you're on a new record, you dump what you parsed and proceed again.

面向行的解析器有一个很好的特性:它们需要很少的内存,最重要的是,它们需要恒定的内存.他们可以毫不费力地处理千兆字节的数据.我正在管理一堆生产服务器,没有什么比将整个文件放入内存的脚本更糟糕的了(然后用解析的内容填充数组,这需要两倍于原始文件大小的内存).

Line-oriented parsers have a great property : they require little memory and what's most important, constant memory. They can proceed with gigabytes of data without any sweat. I'm managing a bunch of production servers and there's nothing worse than those scripts slurping whole files into memory (then stuffing arrays with parsed content which requires more than twice the original file size as memory).

这很有效,而且几乎是牢不可破的:

This works and is mostly unbreakable :

<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();

function dump_record($r) {
    print_r($r);
}

$current = array();
while ($line = fgets($in)) {
    /* Skip empty lines (any number of whitespaces is 'empty' */
    if (preg_match('/^\s*$/', $line)) continue;

    /* Search for '123. <value> ' stanzas */
    if (preg_match('/^(\d+)\.\s+(.*)\s*$/', $line, $start)) {
        /* If we already parsed a record, this is the time to dump it */
        if (!empty($current)) dump_record($current);

        /* Let's start the new record */
        $current = array( 'id' => $start[1] );
    }
    else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
        /* Otherwise parse a plain 'key: value' stanza */
        $current[ $keyval[1] ] = $keyval[2];
    }
    else {
        error_log("parsing error: '$line'");
    }
}

/* Don't forget to dump the last parsed record, situation
 * we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);

fclose($in);
?>

显然,您需要在 function dump_record 中提供适合您口味的东西,例如打印格式正确的 INSERT SQL 语句.

Obvously you'll need something suited to your taste in function dump_record, like printing a correctly formated INSERT SQL statement.

这篇关于抓取没有 HTML 的纯文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆