如何从HTML页面中提取文本块? [英] How to extract blocks of text from a HTML page?
问题描述
我想使用PHP从大型HTML页面中提取超过100个字的文本块。文本是否包含在< p> ...< / p>
中并不重要。我只关心构成连贯文本块的单词数量,因此HTML段落以外的文本也应该被考虑。
I would like to extract blocks of texts with more than 100 words from a large HTML page using PHP. Whether the text is contained in <p>...</p>
doesn't matter. I only care about the number of words that makes a coherent text block so texts outside of HTML paragraphs should also be taken into consideration.
这怎么做?
推荐答案
我使用phpQuery。你熟悉jQuery吗?他们共享相同的语法。你可能会担心安装一个新的库,但相信我这个库是值得的额外头顶
I use phpQuery. Are you familiar with jQuery? they share the same syntax. You might be concerned about installing a new library, but trust me this library is well worth the extra over head
然后你可以像这样访问它:
You can then access it like this:
foreach($doc->find('p') as $element){
$element = pq($element);
echo str_word_count($element->text());
}
这篇关于如何从HTML页面中提取文本块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!