在HTML文件中的两个标签之间提取数据 [英] Extracting data between two tags in HTML file
问题描述
我的系统中保存了一个HUUUGE HTML文件,其中包含来自产品目录的数据。这些数据的结构使得每个产品记录的名称都在两个标签(名称)和(/名称)之间。
每个产品最多有3个属性:名称, productID和颜色,但并非所有产品都具有所有这些属性。
如何在不混合产品属性的情况下为每个产品提取此数据?该文件也是50兆字节!
代码示例....
<名称> '帽子' < /名称>
等等等等等等
< prodId>'1829493'< / prodId>
等等等等等等
< color>'cyan'< / color>
等等等等等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于
blah blah blahblah blah blah
< prodId>'193'< / prodId>
<名称>'dress'< / name>
等等等等等$ b $等等等等等等
'18'< / prodId>
< color>'dark purple'< / color>
您不能直接将其内容作为字符串加载到MATLAB中,您可以使用 FILEREAD :
strContents = fileread('yourfile.html');
假设您有上述的文件格式,您可以使用函数 REGEXP (使用
<$ / $ gt; $ / $ gt; / (< tag> name | prodId | color)>''([^<>')< / \ k< tag>>';
tokens = regexp(strContents,expr,'tokens');
tokens = vertcat(tokens {:});
使用您的示例
令牌=
'名字'帽子'
'产品'1829493'
'color''cyan'
'name''shirt'
'prodId''193'
'name''dress'
' prodId''18'
'color''dark purple'
您可能想要解析生成的N乘2单元格数组,并将内容放入结构数组与字段'name'
,'prodId'
和'color' / code>。难点在于不是每个条目都有三个字段。假设每个
'name'
后面都会跟着一个'prodId'
,一个'颜色'
或两者(顺序'prodId'
then 'color'
),那么下面的代码应该适合你:
s = struct ( '姓名',[], 'PRODID',[], '颜色',[]); %#初始化结构
nTokens = size(tokens,1); %#获取令牌数
nameIndex = find(strcmp(tokens(:,1),'name')); %#查找'name'的索引
[s(1:numel(nameIndex))。name] = deal(tokens {nameIndex,2}); %#填充'名称'字段
%#查找并填写'name'后面的'prodId':
index = strcmp(tokens(min(nameIndex + 1,nTokens)),1 ), 'PRODID');
[s(index).prodId] = deal(tokens {nameIndex(index)+1,2});
%#查找并填充'name'后的'color':
index = strcmp(tokens(min(nameIndex + 1,nTokens),1),'color');
[s(index).color] = deal(tokens {nameIndex(index)+1,2});
%#查找并填充'prodId'后的'color':
index = strcmp(tokens(min(nameIndex + 2,nTokens),1),'color');
[s(index).color] = deal(tokens {min(nameIndex(index)+ 2,nTokens),2});
使用您的示例 s
的内容文件内容为:
>> (1)
名称:'hat'
产品编号:'1829493'
颜色:'cyan'
>> (2)
name:'shirt'
prodId:'193'
color:[]
>> (3)
name:'dress'
prodId:'18'
color:'dark purple'
I've got a HUUUGE HTML file here saved on my system, which contains data from a product catalogue. The data is structured such that for each product record the name is between two tags (name) and (/name) .
Each product has up to 3 attributes: name, productID, and color, but not all products will have all these attributes.
How would I go about extracting this data for each product without mixing up the product attributes? The file is also 50 megabyte!
Code example ....
<name>'hat'</name>
blah blah blah
<prodId>'1829493'</prodId>
blah blah blah
<color>'cyan'</color>
blah blah
blah blah blah
blah blah blah
<name>'shirt'</name>
blah blah blahblah blah blah
<prodId>'193'</prodId>
<name>'dress'</name>
blah blah blah
blah blah blah
<prodId>'18'</prodId>
<color>'dark purple'</color>
A file of size 50 MB isn't so big that you can't just load its contents directly into MATLAB as a string, which you can do with the function FILEREAD:
strContents = fileread('yourfile.html');
Assuming the file format you have above, you can then parse the contents with the function REGEXP (using named token capture):
expr = '<(?<tag>name|prodId|color)>''([^<>]+)''</\k<tag>>';
tokens = regexp(strContents,expr,'tokens');
tokens = vertcat(tokens{:});
And the contents of token
using your sample file contents will be:
tokens =
'name' 'hat'
'prodId' '1829493'
'color' 'cyan'
'name' 'shirt'
'prodId' '193'
'name' 'dress'
'prodId' '18'
'color' 'dark purple'
You may then want to parse the resulting N-by-2 cell array and place the contents in a structure array with fields 'name'
, 'prodId'
, and 'color'
. The difficulty is that not every entry will have all three fields. Assuming each 'name'
will be followed by either a 'prodId'
, a 'color'
, or both (in the order 'prodId'
then 'color'
), then the following code should work for you:
s = struct('name',[],'prodId',[],'color',[]); %# Initialize structure
nTokens = size(tokens,1); %# Get number of tokens
nameIndex = find(strcmp(tokens(:,1),'name')); %# Find indices of 'name'
[s(1:numel(nameIndex)).name] = deal(tokens{nameIndex,2}); %# Fill 'name' field
%# Find and fill 'prodId' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'prodId');
[s(index).prodId] = deal(tokens{nameIndex(index)+1,2});
%# Find and fill 'color' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'color');
[s(index).color] = deal(tokens{nameIndex(index)+1,2});
%# Find and fill 'color' that follows a 'prodId':
index = strcmp(tokens(min(nameIndex+2,nTokens),1),'color');
[s(index).color] = deal(tokens{min(nameIndex(index)+2,nTokens),2});
And the contents of s
using your sample file contents will be:
>> s(1)
name: 'hat'
prodId: '1829493'
color: 'cyan'
>> s(2)
name: 'shirt'
prodId: '193'
color: []
>> s(3)
name: 'dress'
prodId: '18'
color: 'dark purple'
这篇关于在HTML文件中的两个标签之间提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!