如何从这种类型的HTML源文件中提取文本? [英] How to extract text from such type of html source?

查看:164
本文介绍了如何从这种类型的HTML源文件中提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我拥有包含约1000条微博的html源代码(每行一条推文)。大多数推文如下所示。我正在使用delphi备忘录尝试使用Pos函数和删除函数去除HTML标记,但失败了。

 < div id ='tweetText'> RT< a onmousedown =return touch(this.href,0)href =http://twitter.com/HighfashionUK> @ HighfashionUK< / a> RT:令人惊喜的是,4个筹码都很好,好的。 < a onmousedown =return touch(this.href,0)href =http://plixi.com/p/57846587> http://plixi.com/p/57846587< / a>当我们得到150< / div> 

我想剥离html标记并且只有:

  RT:惊喜好吃的东西包了4次,好的。 http://plixi.com/p/57846587当我们得到150 

如何提取这样的文本在delphi中?



非常感谢您提前。

更新:



Cosmin Prund是对的。我错误地跳过了一部分。我想要的是:

  RT @HighfashionUK RT:让人惊喜的是,有四个抓包,好的。 http://plixi.com/p/57846587当我们得到150 

Cosmin Prund很棒。

解决方案

由于所有HTML标记位于< / code>和<$ c $之间c>> ,去除标记的例程可以像这样简单地写入。希望这是你想要的,因为正如你在我的评论中看到的那样, @HighfashionUK 存在一个问题 - 你的例子跳过了这一点,不知道为什么。

 函数StripHtmlMarkup(const source:string):string; 
var i,count:Integer;
InTag:布尔值;
P:PChar;
begin
SetLength(Result,Length(source));
P:= PChar(结果);
InTag:= False;
count:= 0;
for i:= 1 to Length(source)do b $ b if InTag then
begin
if source [i] ='>'then InTag:= False;
end
else
if source [i] ='<'then InTag:= True
else
begin
P [count]:= source [一世];
Inc(count);
end;
SetLength(Result,count);
end;


I have html source containing about 1000 microblogs (one tweet per line). Most of the tweets are like the below. I am using delphi memo to try to strip html marks by using Pos function and delete function but failed.

<div id='tweetText'> RT <a onmousedown="return touch(this.href,0)" href="http://twitter.com/HighfashionUK">@HighfashionUK</a> RT: Surprise goody bag up 4 grabs, Ok. <a onmousedown="return touch(this.href,0)" href="http://plixi.com/p/57846587">http://plixi.com/p/57846587</a> when we get 150</div>

I want to strip html marks and only have:

RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150 

How can I extract such text in delphi?

Thank you very much in advance.

Update:

Cosmin Prund is right. I mistakenly skipped a part. What I want is :

RT @HighfashionUK  RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150 

Cosmin Prund is great.

解决方案

Since all HTML markup is between < and >, a routine to strip markup can be trivially written like this. Hopefully this is what you want because, as you see in my comment, there's a issue with @HighfashionUK - your example skipped that, don't know why.

function StripHtmlMarkup(const source:string):string;
var i, count: Integer;
    InTag: Boolean;
    P: PChar;
begin
  SetLength(Result, Length(source));
  P := PChar(Result);
  InTag := False;
  count := 0;
  for i:=1 to Length(source) do
    if InTag then
      begin
        if source[i] = '>' then InTag := False;
      end
    else
      if source[i] = '<' then InTag := True
      else
        begin
          P[count] := source[i];
          Inc(count);
        end;
  SetLength(Result, count);
end;

这篇关于如何从这种类型的HTML源文件中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆