解析多个不包含HTML的纯文本网站的最佳方法 [英] Best way to parse multiple plain text websites that contain no HTML

查看:100
本文介绍了解析多个不包含HTML的纯文本网站的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法来读取多个(超过50个)纯文本网站,并且仅将某些信息解析为html表或csv文件.当我说纯文本"时,我的意思是虽然它是一个Web地址,它没有任何关联的html.这将是源示例.我对此很陌生,一直在寻求帮助以了解如何做到这一点.

I am looking for a way to read multiple (over 50) plain text websites and parse only certain information into a html table, or as a csv file.When I say "plain text" I mean that while it is a web address, it does not have any html associated with it.This would be an example of the source. I am pretty new to this, and was looking for help in seeing how this could be done.

update-token:179999210
vessel-name:Name Here
vessel-length:57.30
vessel-beam:14.63
vessel-draft:3.35
vessel-airdraft:0.00
time:20140104T040648.259Z
position:25.04876667 -75.57001667 GPS
river-mile:sd 178.71
rate-of-turn:0.0
course-over-ground:58.5
speed-over-ground:0.0
ais-367000000 {
    pos:45.943912 -87.384763 DGPS
    cog:249.8
    sog:0.0
    name:name here
    call:1113391
    imo:8856857
    type:31
    dim:10 20 4 5
    draft:3.8
    destination:
}
ais-367000000 {
    pos:25.949652 -86.384535 DGPS
    cog:105.6
    sog:0.0
    name:CHRISTINE
    call:5452438
    type:52
    status:0
    dim:1 2 3 4
    draft:3.0
    destination:IMTT ST.ROSE
    eta:06:00
}

感谢您的任何建议.

推荐答案

我在这里可能完全不明白这一点-但这里是您如何使用内容(假设您将它们作为字符串)并将其放入php的方法键/值数组.我硬编码"了您拥有的字符串,并更改了一个值(键ais-3670000似乎重复了,这使得第二个对象覆盖了第一个对象).

I may be completely missing the point here - but here is how you could take the contents (assuming you had them as a string) and put them into a php key/value array. I "hard-coded" the string you had, and changed one value (the key ais-3670000 seemed to repeat, and that makes the second object overwrite the first).

这是一个非常基本的解析器,它采用如上所述的格式.我在下面的代码中给出输出:

This is a very basic parser that assumes a format like you described above. I give the output below the code:

<?php
echo "<html>";
$s="update-token:179999210
vessel-name:Name Here
vessel-length:57.30
vessel-beam:14.63
vessel-draft:3.35
vessel-airdraft:0.00
time:20140104T040648.259Z
position:25.04876667 -75.57001667 GPS
river-mile:sd 178.71
rate-of-turn:0.0
course-over-ground:58.5
speed-over-ground:0.0
ais-367000000 {
    pos:45.943912 -87.384763 DGPS
    cog:249.8
    sog:0.0
    name:name here
    call:1113391
    imo:8856857
    type:31
    dim:10 20 4 5
    draft:3.8
    destination:
}
ais-367000001 {
    pos:25.949652 -86.384535 DGPS
    cog:105.6
    sog:0.0
    name:CHRISTINE
    call:5452438
    type:52
    status:0
    dim:1 2 3 4
    draft:3.0
    destination:IMTT ST.ROSE
    eta:06:00
}";
$lines = explode("\n", $s);
$output = Array();
$thisElement = & $output;
foreach($lines as $line) {
  $elements = explode(":", $line);
  if (count($elements) > 1) {
    $thisElement[trim($elements[0])] = $elements[1];
  }
  if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $output[$key] = Array();
      $thisElement = & $output[$key];
  }
  if(strstr($line, "}")) {
      $thisElement = & $output;
  }
}
echo '<pre>';
print_r($output);
echo '</pre>';
echo '</html>';
?>

以上内容的输出(可以在 http://www.floris.us上查看. /SO/ships.php ):

Output of the above (can be seen working at http://www.floris.us/SO/ships.php):

Array
(
    [update-token] => 179999210
    [vessel-name] => Name Here
    [vessel-length] => 57.30
    [vessel-beam] => 14.63
    [vessel-draft] => 3.35
    [vessel-airdraft] => 0.00
    [time] => 20140104T040648.259Z
    [position] => 25.04876667 -75.57001667 GPS
    [river-mile] => sd 178.71
    [rate-of-turn] => 0.0
    [course-over-ground] => 58.5
    [speed-over-ground] => 0.0
    [ais-367000000] => Array
        (
            [pos] => 45.943912 -87.384763 DGPS
            [cog] => 249.8
            [sog] => 0.0
            [name] => name here
            [call] => 1113391
            [imo] => 8856857
            [type] => 31
            [dim] => 10 20 4 5
            [draft] => 3.8
            [destination] => 
        )

    [ais-367000001] => Array
        (
            [pos] => 25.949652 -86.384535 DGPS
            [cog] => 105.6
            [sog] => 0.0
            [name] => CHRISTINE
            [call] => 5452438
            [type] => 52
            [status] => 0
            [dim] => 1 2 3 4
            [draft] => 3.0
            [destination] => IMTT ST.ROSE
            [eta] => 06
        )

)

更好的方法是将字符串转换为格式正确的JSON",然后使用json_decode.可能如下所示:

A better approach would be to turn the string into "properly formed JSON", then use json_decode. That might look like the following:

<?php
echo "<html>";
$s="update-token:179999210
vessel-name:Name Here
vessel-length:57.30
vessel-beam:14.63
vessel-draft:3.35
vessel-airdraft:0.00
time:20140104T040648.259Z
position:25.04876667 -75.57001667 GPS
river-mile:sd 178.71
rate-of-turn:0.0
course-over-ground:58.5
speed-over-ground:0.0
ais-367000000 {
    pos:45.943912 -87.384763 DGPS
    cog:249.8
    sog:0.0
    name:name here
    call:1113391
    imo:8856857
    type:31
    dim:10 20 4 5
    draft:3.8
    destination:
}
ais-367000001 {
    pos:25.949652 -86.384535 DGPS
    cog:105.6
    sog:0.0
    name:CHRISTINE
    call:5452438
    type:52
    status:0
    dim:1 2 3 4
    draft:3.0
    destination:IMTT ST.ROSE
    eta:06:00
}";

echo '<pre>';
print_r(parseString($s));
echo '</pre>';

function parseString($s) {
  $lines = explode("\n", $s);
  $jstring = "{ ";
  $comma = "";
  foreach($lines as $line) {
    $elements = explode(":", $line);
    if (count($elements) > 1) {
      $jstring = $jstring . $comma . '"' . trim($elements[0]) . '" : "' . $elements[1] .'"';
      $comma = ",";
    }
    if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $jstring = $jstring . $comma . '"' . $key .'" : {';
      $comma = "";
    }
    if(strstr($line, "}")) {
      $jstring = $jstring . '} ';
      $comma = ",";
    }
  }
  $jstring = $jstring ."}";
  return json_decode($jstring);
}
echo '</html>';
?>

演示在 http://www.floris.us/SO/ships2.php;请注意,我使用变量$comma来确保在字符串的各个点处都包含或不包含逗号.

Demo at http://www.floris.us/SO/ships2.php ; note that I use the variable $comma to make sure that commas are either included, or not included, at various points in the string.

此代码的输出与我们之前的输出类似:

Output of this code looks similar to what we had before:

stdClass Object
(
    [update-token] => 179999210
    [vessel-name] => Name Here
    [vessel-length] => 57.30
    [vessel-beam] => 14.63
    [vessel-draft] => 3.35
    [vessel-airdraft] => 0.00
    [time] => 20140104T040648.259Z
    [position] => 25.04876667 -75.57001667 GPS
    [river-mile] => sd 178.71
    [rate-of-turn] => 0.0
    [course-over-ground] => 58.5
    [speed-over-ground] => 0.0
    [ais-367000000] => stdClass Object
        (
            [pos] => 45.943912 -87.384763 DGPS
            [cog] => 249.8
            [sog] => 0.0
            [name] => name here
            [call] => 1113391
            [imo] => 8856857
            [type] => 31
            [dim] => 10 20 4 5
            [draft] => 3.8
            [destination] => 
        )

    [ais-367000001] => stdClass Object
        (
            [pos] => 25.949652 -86.384535 DGPS
            [cog] => 105.6
            [sog] => 0.0
            [name] => CHRISTINE
            [call] => 5452438
            [type] => 52
            [status] => 0
            [dim] => 1 2 3 4
            [draft] => 3.0
            [destination] => IMTT ST.ROSE
            [eta] => 06
        )

)

但是也许您的问题是首先如何将文本输入php".在这种情况下,您可能会看到以下内容:

But maybe your question is "how do I get the text into php in the first place". In that case, you might look at something like this:

<?php
$urlstring = file_get_contents('/path/to/urlFile.csv');
$urls = explode("\n", $urlstring); // one url per line

$responses = Array();

// loop over the urls, and get the information
// then parse it into the $responses array
$i = 0;
foreach($urls as $url) {
  $responses[$i] = parseString(file_get_contents($url));
  $i = $i + 1;
}


function parseString($s) {
  $lines = explode("\n", $s);
  $jstring = "{ ";
  $comma = "";
  foreach($lines as $line) {
    $elements = explode(":", $line);
    if (count($elements) > 1) {
      $jstring = $jstring . $comma . '"' . trim($elements[0]) . '" : "' . $elements[1] .'"';
      $comma = ",";
    }
    if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $jstring = $jstring . $comma . '"' . $key .'" : {';
      $comma = "";
    }
    if(strstr($line, "}")) {
      $jstring = $jstring . '} ';
      $comma = ",";
    }
  }
  $jstring = $jstring ."}";
  return json_decode($jstring);
}
?>

我包含与以前相同的解析功能;有可能使它变得更好,或者完全不使用它.从你的问题很难知道.

I include the same parsing function as before; it's possible to make it much better, or leave it out altogether. Hard to know from your question.

欢迎提问.

更新

基于注释,我添加了将对文件资源执行curl的功能;让我知道这是否适合您.我创建了一个文件http://www.floris.us/SO/ships.txt,该文件与您在上面显示的文件完全相同,并且创建了一个http://www.floris.us/SO/ships3.php,其中包含以下源代码-您可以运行它并看到它的工作(请注意-在此版本中,我没有不会从.csv文件中读取任何内容-您已经知道该怎么做.这只是获取数组,并使用它来获取文本文件,然后将其转换为可以使用的数据结构-display,无论如何):

Based on comments I have added a function that will perform the curl on the file resource; let me know if this works for you. I have created a file http://www.floris.us/SO/ships.txt that is an exact copy of the file you showed above, and a http://www.floris.us/SO/ships3.php that contains the following source code - you can run it and see that it works (note - in this version I don't read anything from a .csv file - you already know how to do that. This is just taking the array, and using it to obtain a text file, then converting it to a data structure you can use - display, whatever):

<?php
$urls = Array();
$urls[0] = "http://www.floris.us/SO/ships.txt";

$responses = Array();

// loop over the urls, and get the information
// then parse it into the $responses array
$i = 0;
foreach($urls as $url) {
//  $responses[$i] = parseString(file_get_contents($url));
  $responses[$i] = parseString(myCurl($url));
  $i = $i + 1;
}
echo '<html><body><pre>';
print_r($responses);
echo '</pre></body></html>';

function parseString($s) {
  $lines = explode("\n", $s);
  $jstring = "{ ";
  $comma = "";
  foreach($lines as $line) {
    $elements = explode(":", $line);
    if (count($elements) > 1) {
      $jstring = $jstring . $comma . '"' . trim($elements[0]) . '" : "' . $elements[1] .'"';
      $comma = ",";
    }
    if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $jstring = $jstring . $comma . '"' . $key .'" : {';
      $comma = "";
    }
    if(strstr($line, "}")) {
      $jstring = $jstring . '} ';
      $comma = ",";
    }
  }
  $jstring = $jstring ."}";
  return json_decode($jstring);
}

function myCurl($f) {
// create curl resource 
   $ch = curl_init();
// set url 
   curl_setopt($ch, CURLOPT_URL, $f); 

//return the transfer as a string 
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

// $output contains the output string 
   $output = curl_exec($ch); 

// close curl resource to free up system resources 
   curl_close($ch);    
   return $output;
}
?>

注意-由于两个条目具有相同的标签",因此在使用原始源数据时,第二个条目将覆盖第一个条目.如果有问题,请告诉我.另外,如果您对如何实际显示数据有想法,请尝试模拟一些东西,我会帮助您正确处理.

Note - because two entries have the same "tag", the second one overwrites the first when using the original source data. If that is a problem let me know. Also if you have ideas on how you actually want to display the data, try to mock up something and I can help you get it right.

有几种可能导致您遇到问题的超时机制.根据具体情况,以下解决方案之一可能会为您提供帮助:

There are several possible timeout mechanisms that can be causing you problems; depending on which it is, one of the following solutions may help you:

  1. 如果浏览器没有收到服务器的任何响应,它将最终超时.现在几乎可以肯定这不是您的问题.但是如果您解决其他问题,则可能会成为您的问题
  2. php脚本通常在确定您将它们发送到无限循环之前具有内置的最长运行时间".如果您知道将要发出很多请求,并且这些请求将花费大量时间,则可能需要将超时设置为更高.参见 http://www.php.net/manual/en /function.set-time-limit.php 了解有关如何执行此操作的详细信息.我建议将限制设置为curl循环内的合理"值-这样,每个新请求都会重置计数器.
  3. 您尝试连接到服务器的时间可能太长(这是您所说的最可能的问题).您可以将值(您期望等待建立连接的时间)设置为大致合理"的时间,例如10秒;这意味着您不会永远等待离线服务器.使用

  1. If the browser doesn't get any response from the server, it will eventually time out. This is almost certainly not your problem right now; but it might become your issue if you fix the other problems
  2. php scripts typically have a built in "maximum time to run" before they decide you sent them into an infinite loop. If you know you will be making lots of requests, and these requests will take a lot of time, you may want to set the time-out higher. See http://www.php.net/manual/en/function.set-time-limit.php for details on how to do this. I would recommend setting the limit to a "reasonable" value inside the curl loop - so the counter gets reset for every new request.
  3. Your attempt to connect to the server may take too long (this is the most likely problem as you said). You can set the value (time you expect to wait to make the connection) to something "vaguely reasonable" like 10 seconds; this means you won't wait forever for the servers that are offline. Use

curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);

等待10秒钟.请参见在PHP中设置Curl的超时时间   最后,您将需要优雅地处理错误-如果连接未成功,则您不想处理响应.将所有这些放在一起可以得到类似以下的内容:

for a 10 second wait. See Setting Curl's Timeout in PHP   Finally you will want to handle the errors gracefully - if the connection did not succeed, you don't want to process the response. Putting all this together gets you something like this:

$i = 0;
foreach($urls as $url) {
  $temp = myCurl($url);
  if (strlen($temp) == 0) {
    echo 'no response from '.$url.'<br>';
  }
  else {
    $responses[$i] = parseString(myCurl($url));
    $i = $i + 1;
  }
}

echo '<html><body><pre>';
print_r($responses);
echo '</pre></body></html>';

function myCurl($f) {
// create curl resource 
   $ch = curl_init();
// set url 
   curl_setopt($ch, CURLOPT_URL, $f); 

//return the transfer as a string 
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
   curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
   curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); // try for 10 seconds to get a connection
   curl_setopt($ch, CURLOPT_TIMEOUT, 30);        // try for 30 seconds to complete the transaction

// $output contains the output string 
   $output = curl_exec($ch); 

// see if any error was set:
   $curl_errno = curl_errno($ch);

// close curl resource to free up system resources 
   curl_close($ch);    

// make response depending on whether there was an error
   if($curl_errno > 0) {
      return '';
   }
   else {
      return $output;
  }
}

最新更新??我已经更新了一次代码.现在

Last update? I have updated the code one more time. It now

  1. 从文件中读取URL列表(每行一个URL-格式完整)
  2. 尝试依次从每个文件中获取内容,处理超时并向屏幕回显进度
  3. 使用文件中的某些信息(包括重新格式化的时间戳记)创建表

要完成这项工作,我有以下文件:

To make this work, I had the following files:

www.floris.us/SO/ships.csv包含三行,

http://www.floris.us/SO/ships.txt
http://floris.dnsalias.com/noSuchFile.html
http://www.floris.us/SO/ships2.txt

文件ships.txtships2.txt位于同一位置(几乎相同的副本,但船名相同)-就像您的纯文本文件一样.

Files ships.txt and ships2.txt at the same location (almost identical copies but for name of ship) - these are like your plain text files.

文件ships3.php在同一位置.它包含以下源代码,该源代码执行前面所述的各个步骤,并尝试将它们全部串在一起:

File ships3.php in the same location. This contains the following source code, that performs the various steps described earlier, and attempts to string it all together:

<?php
$urlstring = file_get_contents('http://www.floris.us/SO/ships.csv');
$urls = explode("\n", $urlstring); // one url per line

$responses = Array();

// loop over the urls, and get the information
// then parse it into the $responses array
$i = 0;
foreach($urls as $url) {
 $temp = myCurl($url);
  if(strlen($temp) > 0) {
    $responses[$i] = parseString($temp);
    $i = $i + 1;
  }
  else {
    echo "URL ".$url." did not repond<br>";
  }
}

// produce the actual output table:
echo '<html><body>';
writeTable($responses);
echo '</pre></body></html>';

// ------------ support functions -------------
function parseString($s) {
  $lines = explode("\n", $s);
  $jstring = "{ ";
  $comma = "";
  foreach($lines as $line) {
    $elements = explode(":", $line);
    if (count($elements) > 1) {
      $jstring = $jstring . $comma . '"' . trim($elements[0]) . '" : "' . $elements[1] .'"';
      $comma = ",";
    }
    if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $jstring = $jstring . $comma . '"' . $key .'" : {';
      $comma = "";
    }
    if(strstr($line, "}")) {
      $jstring = $jstring . '} ';
      $comma = ",";
    }
  }
  $jstring = $jstring ."}";
  return json_decode($jstring, true);
}

function myCurl($f) {
// create curl resource 

   $ch = curl_init();
// set url 
   curl_setopt($ch, CURLOPT_URL, $f); 

//return the transfer as a string 
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
   curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
   curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); // try for 10 seconds to get a connection
   curl_setopt($ch, CURLOPT_TIMEOUT, 30);        // try for 30 seconds to complete the transaction

// $output contains the output string 
   $output = curl_exec($ch); 

// see if any error was set:
   $curl_errno = curl_errno($ch);
   $curl_error = curl_error($ch);

// close curl resource to free up system resources 
   curl_close($ch);    

// make response depending on whether there was an error
   if($curl_errno > 0) {
      echo 'Curl reported error '.$curl_error.'<br>';
      return '';
   }
   else {
      echo 'Successfully fetched '.$f.'<br>';
      return $output;
  }
}

function writeTable($r) {
  echo 'The following ships reported: <br>';
  echo '<table border=1>';
  foreach($r as $value) {
    if (strlen($value["vessel-name"]) > 0) {
      echo '<tr><table border=1><tr>';
      echo '<td>Vessel Name</td><td>'.$value["vessel-name"].'</td></tr>';
      echo '<tr><td>Time:</td><td>'.dateFormat($value["time"]).'</td></tr>';
      echo '<tr><td>Position:</td><td>'.$value["position"].'</td></tr>';
      echo '</table></tr>';
    }
    echo '</table>';
  }
}

function dateFormat($d) {
  // with input yyyymmddhhmm
  // return dd/mm/yy hh:mm
  $date = substr($d, 6, 2) ."/". substr($d, 4, 2) ."/". substr($d, 2, 2) ." ". substr($d, 9, 2) . ":" . substr($d, 11, 2);
  return $date;
}
?>

此输出为:

您显然可以使它更漂亮,并包括其他字段等.不过,我认为这应该可以使您走得很远.您可能会考虑(如果可以)每30分钟左右在后台运行一个脚本来创建这些表,然后将生成的html表保存到服务器上的本地文件中;然后,当人们希望看到结果时,他们不必等待不同远程服务器的(慢速)响应,而是获得了几乎即时"的结果.

You can obviously make this prettier, and include other fields etc. I think this should get you a long way there, though. You might consider (if you can) having a script run in the background to create these tables every 30 minutes or so, and saving the resulting html tables to a local file on your server; then, when people want to see the result, they would not have to wait for the (slow) responses of the different remote servers, but get an "almost instant" result.

但这与原始问题有些不同.如果您能够以可行的方式实现所有这些功能,然后又想回来问一个后续问题(如果您仍然对结果感到不满/对结果不满意),那可能就是解决之道.我认为我们现在已经把这个人打死了.

But that's somewhat far removed from the original question. If you are able to implement all this in a workable fashion, and then want to come back and ask a follow-up question (if you're still stuck / not happy with the outcome), that is probably the way to go. I think we've pretty much beaten this one to death now.

这篇关于解析多个不包含HTML的纯文本网站的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆