在 HTML 文件中的两个标签之间提取数据 [英] Extracting data between two tags in HTML file

查看:38
本文介绍了在 HTML 文件中的两个标签之间提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的系统上保存了一个 HUUUGE HTML 文件,其中包含来自产品目录的数据.数据的结构使得每个产品记录的名称位于两个标签 (name) 和 (/name) 之间.

I've got a HUUUGE HTML file here saved on my system, which contains data from a product catalogue. The data is structured such that for each product record the name is between two tags (name) and (/name) .

每个产品最多有 3 个属性:名称、产品 ID 和颜色,但并非所有产品都具有所有这些属性.

Each product has up to 3 attributes: name, productID, and color, but not all products will have all these attributes.

如何在不混淆产品属性的情况下为每个产品提取这些数据?文件也是50MB!

How would I go about extracting this data for each product without mixing up the product attributes? The file is also 50 megabyte!

代码示例....

<name>'hat'</name>
blah blah blah
<prodId>'1829493'</prodId>
blah blah blah
<color>'cyan'</color>

blah blah 
blah blah blah
blah blah blah

<name>'shirt'</name>
blah blah blahblah blah blah
<prodId>'193'</prodId>

<name>'dress'</name>
blah blah blah
blah blah blah
<prodId>'18'</prodId>
<color>'dark purple'</color>

推荐答案

一个 50 MB 大小的文件并没有太大,以至于您不能将其内容作为字符串直接加载到 MATLAB 中,您可以使用函数 FILEREAD:

A file of size 50 MB isn't so big that you can't just load its contents directly into MATLAB as a string, which you can do with the function FILEREAD:

strContents = fileread('yourfile.html');

假设您拥有上述文件格式,然后您可以使用函数 REGEXP(使用命名令牌捕获a>):

Assuming the file format you have above, you can then parse the contents with the function REGEXP (using named token capture):

expr = '<(?<tag>name|prodId|color)>''([^<>]+)''</k<tag>>';
tokens = regexp(strContents,expr,'tokens');
tokens = vertcat(tokens{:});

使用您的示例文件内容的 token 的内容将是:

And the contents of token using your sample file contents will be:

tokens = 

    'name'      'hat'        
    'prodId'    '1829493'    
    'color'     'cyan'       
    'name'      'shirt'      
    'prodId'    '193'        
    'name'      'dress'      
    'prodId'    '18'         
    'color'     'dark purple'

然后您可能想要解析生成的 N×2 元胞数组并将内容放入 结构数组,包含字段 'name''prodId''color'.困难在于并非每个条目都具有所有三个字段.假设每个 'name' 后跟一个 'prodId'、一个 'color'两者(按照 'prodId' then 'color' 的顺序),那么下面的代码应该适合你:

You may then want to parse the resulting N-by-2 cell array and place the contents in a structure array with fields 'name', 'prodId', and 'color'. The difficulty is that not every entry will have all three fields. Assuming each 'name' will be followed by either a 'prodId', a 'color', or both (in the order 'prodId' then 'color'), then the following code should work for you:

s = struct('name',[],'prodId',[],'color',[]);  %# Initialize structure
nTokens = size(tokens,1);                      %# Get number of tokens
nameIndex = find(strcmp(tokens(:,1),'name'));  %# Find indices of 'name'
[s(1:numel(nameIndex)).name] = deal(tokens{nameIndex,2});  %# Fill 'name' field

%# Find and fill 'prodId' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'prodId');
[s(index).prodId] = deal(tokens{nameIndex(index)+1,2});

%# Find and fill 'color' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'color');
[s(index).color] = deal(tokens{nameIndex(index)+1,2});

%# Find and fill 'color' that follows a 'prodId':
index = strcmp(tokens(min(nameIndex+2,nTokens),1),'color');
[s(index).color] = deal(tokens{min(nameIndex(index)+2,nTokens),2});

使用您的示例文件内容的 s 的内容将是:

And the contents of s using your sample file contents will be:

>> s(1)

      name: 'hat'
    prodId: '1829493'
     color: 'cyan'

>> s(2)

      name: 'shirt'
    prodId: '193'
     color: []

>> s(3)

      name: 'dress'
    prodId: '18'
     color: 'dark purple'

这篇关于在 HTML 文件中的两个标签之间提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆