如何从这种类型的HTML源文件中提取文本？ [英] How to extract text from such type of html source?

查看：164 发布时间：2018/6/23 14:04:26 html delphi parsing

本文介绍了如何从这种类型的HTML源文件中提取文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我拥有包含约1000条微博的html源代码（每行一条推文）。大多数推文如下所示。我正在使用delphi备忘录尝试使用Pos函数和删除函数去除HTML标记，但失败了。

 < div id ='tweetText'> RT< a onmousedown =return touch（this.href，0）href =http://twitter.com/HighfashionUK> @ HighfashionUK< / a> RT：令人惊喜的是，4个筹码都很好，好的。 < a onmousedown =return touch（this.href，0）href =http://plixi.com/p/57846587> http：//plixi.com/p/57846587< / a>当我们得到150< / div>

我想剥离html标记并且只有：

  RT：惊喜好吃的东西包了4次，好的。 http://plixi.com/p/57846587当我们得到150

如何提取这样的文本在delphi中？

非常感谢您提前。

更新：

Cosmin Prund是对的。我错误地跳过了一部分。我想要的是：

RT @HighfashionUK RT：让人惊喜的是，有四个抓包，好的。 http://plixi.com/p/57846587当我们得到150
Cosmin Prund很棒。
解决方案
由于所有HTML标记位于< / code>和<$ c $之间c>> ，去除标记的例程可以像这样简单地写入。希望这是你想要的，因为正如你在我的评论中看到的那样， @HighfashionUK 存在一个问题 - 你的例子跳过了这一点，不知道为什么。
函数StripHtmlMarkup（const source：string）：string; var i，count：Integer; InTag：布尔值; P：PChar; begin SetLength（Result，Length（source））; P：= PChar（结果）; InTag：= False; count：= 0; for i：= 1 to Length（source）do b $ b if InTag then begin if source [i] ='>'then InTag：= False; end else if source [i] ='<'then InTag：= True else begin P [count]：= source [一世]; Inc（count）; end; SetLength（Result，count）; end;

I have html source containing about 1000 microblogs (one tweet per line). Most of the tweets are like the below. I am using delphi memo to try to strip html marks by using Pos function and delete function but failed.
<div id='tweetText'> RT <a onmousedown="return touch(this.href,0)" href="http://twitter.com/HighfashionUK">@HighfashionUK</a> RT: Surprise goody bag up 4 grabs, Ok. <a onmousedown="return touch(this.href,0)" href="http://plixi.com/p/57846587">http://plixi.com/p/57846587</a> when we get 150</div>
I want to strip html marks and only have:
RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150
How can I extract such text in delphi?

Thank you very much in advance.

Update:

Cosmin Prund is right. I mistakenly skipped a part. What I want is :
RT @HighfashionUK RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150
Cosmin Prund is great.
解决方案
Since all HTML markup is between < and >, a routine to strip markup can be trivially written like this. Hopefully this is what you want because, as you see in my comment, there's a issue with @HighfashionUK - your example skipped that, don't know why.
function StripHtmlMarkup(const source:string):string; var i, count: Integer; InTag: Boolean; P: PChar; begin SetLength(Result, Length(source)); P := PChar(Result); InTag := False; count := 0; for i:=1 to Length(source) do if InTag then begin if source[i] = '>' then InTag := False; end else if source[i] = '<' then InTag := True else begin P[count] := source[i]; Inc(count); end; SetLength(Result, count); end;

这篇关于如何从这种类型的HTML源文件中提取文本？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从这种类型的HTML源文件中提取文本？ [英] How to extract text from such type of html source?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何从这种类型的HTML源文件中提取文本？ [英] How to extract text from such type of html source?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭