如何从HTML页面中提取文本块? [英] How to extract blocks of text from a HTML page?

查看:309
本文介绍了如何从HTML页面中提取文本块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用PHP从大型HTML页面中提取超过100个字的文本块。文本是否包含在< p> ...< / p> 中并不重要。我只关心构成连贯文本块的单词数量,因此HTML段落以外的文本也应该被考虑。

I would like to extract blocks of texts with more than 100 words from a large HTML page using PHP. Whether the text is contained in <p>...</p> doesn't matter. I only care about the number of words that makes a coherent text block so texts outside of HTML paragraphs should also be taken into consideration.

这怎么做?

推荐答案

我使用phpQuery。你熟悉jQuery吗?他们共享相同的语法。你可能会担心安装一个新的库,但相信我这个库是值得的额外头顶

I use phpQuery. Are you familiar with jQuery? they share the same syntax. You might be concerned about installing a new library, but trust me this library is well worth the extra over head

然后你可以像这样访问它:

You can then access it like this:

foreach($doc->find('p') as $element){
   $element = pq($element);
   echo str_word_count($element->text());
}

这篇关于如何从HTML页面中提取文本块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆