如何从这种类型的HTML源文件中提取文本? [英] How to extract text from such type of html source?
问题描述
我拥有包含约1000条微博的html源代码(每行一条推文)。大多数推文如下所示。我正在使用delphi备忘录尝试使用Pos函数和删除函数去除HTML标记,但失败了。
< div id ='tweetText'> RT< a onmousedown =return touch(this.href,0)href =http://twitter.com/HighfashionUK> @ HighfashionUK< / a> RT:令人惊喜的是,4个筹码都很好,好的。 < a onmousedown =return touch(this.href,0)href =http://plixi.com/p/57846587> http://plixi.com/p/57846587< / a>当我们得到150< / div>
我想剥离html标记并且只有:
RT:惊喜好吃的东西包了4次,好的。 http://plixi.com/p/57846587当我们得到150
如何提取这样的文本在delphi中?
非常感谢您提前。
更新:
Cosmin Prund是对的。我错误地跳过了一部分。我想要的是:
RT @HighfashionUK RT:让人惊喜的是,有四个抓包,好的。 http://plixi.com/p/57846587当我们得到150
Cosmin Prund很棒。
由于所有HTML标记位于< / code>和<$ c $之间c>>
,去除标记的例程可以像这样简单地写入。希望这是你想要的,因为正如你在我的评论中看到的那样, @HighfashionUK
存在一个问题 - 你的例子跳过了这一点,不知道为什么。
函数StripHtmlMarkup(const source:string):string;
var i,count:Integer;
InTag:布尔值;
P:PChar;
begin
SetLength(Result,Length(source));
P:= PChar(结果);
InTag:= False;
count:= 0;
for i:= 1 to Length(source)do b $ b if InTag then
begin
if source [i] ='>'then InTag:= False;
end
else
if source [i] ='<'then InTag:= True
else
begin
P [count]:= source [一世];
Inc(count);
end;
SetLength(Result,count);
end;
I have html source containing about 1000 microblogs (one tweet per line). Most of the tweets are like the below. I am using delphi memo to try to strip html marks by using Pos function and delete function but failed.
<div id='tweetText'> RT <a onmousedown="return touch(this.href,0)" href="http://twitter.com/HighfashionUK">@HighfashionUK</a> RT: Surprise goody bag up 4 grabs, Ok. <a onmousedown="return touch(this.href,0)" href="http://plixi.com/p/57846587">http://plixi.com/p/57846587</a> when we get 150</div>
I want to strip html marks and only have:
RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150
How can I extract such text in delphi?
Thank you very much in advance.
Update:
Cosmin Prund is right. I mistakenly skipped a part. What I want is :
RT @HighfashionUK RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150
Cosmin Prund is great.
Since all HTML markup is between <
and >
, a routine to strip markup can be trivially written like this. Hopefully this is what you want because, as you see in my comment, there's a issue with @HighfashionUK
- your example skipped that, don't know why.
function StripHtmlMarkup(const source:string):string;
var i, count: Integer;
InTag: Boolean;
P: PChar;
begin
SetLength(Result, Length(source));
P := PChar(Result);
InTag := False;
count := 0;
for i:=1 to Length(source) do
if InTag then
begin
if source[i] = '>' then InTag := False;
end
else
if source[i] = '<' then InTag := True
else
begin
P[count] := source[i];
Inc(count);
end;
SetLength(Result, count);
end;
这篇关于如何从这种类型的HTML源文件中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!