使用 Nokogiri 解析大型 HTML 文件 [英] Parsing large HTML files with Nokogiri

查看：38 发布时间：2021/6/8 18:48:29 ruby nokogiri

本文介绍了使用 Nokogiri 解析大型 HTML 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试解析 http://www.pro-medic.ru/index.php?ht=246&perpage=all 与 Nokogiri，但不幸的是我无法从页面中获取所有项目.

I'm trying to parse http://www.pro-medic.ru/index.php?ht=246&perpage=all with Nokogiri, but unfortunately I can't get all items from the page.

我的简单测试代码是:

require 'open-uri'
require 'nokogiri'

html = Nokogiri::HTML open('http://www.pro-medic.ru/index.php?ht=246&perpage=all')
p html.css('ul.products-grid-compact li .goods_container').count

它只返回 83 个项目，但实际计数约为 186 个.

It returns only 83 items but the real count is about 186.

我认为问题可能出在 open 上，但该函数似乎正确读取了 HTML 页面.

I thought that the problem could be in open, but it seems that function reads the HTML page correctly.

有人遇到过同样的问题吗?

Has anybody faced the same problem?

推荐答案

该文件似乎超出了 Nokogiri 的解析器限制.您可以通过添加 HUGE<来放宽限制/code> 标志:


The file seems to exceed Nokogiri's parser limits. You can relax the limits by adding the HUGE flag:
require 'open-uri'
require 'nokogiri'

url = 'http://www.pro-medic.ru/index.php?ht=246&perpage=all'
html = Nokogiri::HTML(open(url)) do |config|
  config.options |= Nokogiri::XML::ParseOptions::HUGE
end
html.css('ul.products-grid-compact li .goods_container').count
#=> 186

注意|=是一个按位或赋值运算符，不要和逻辑运算符||=
Note that |= is a bitwise OR assignment operator, don't confuse it with the logical operator ||=

根据解析选项，您也可以通过config.huge

这篇关于使用 Nokogiri 解析大型 HTML 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Nokogiri 解析大型 HTML 文件 [英] Parsing large HTML files with Nokogiri

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 Nokogiri 解析大型 HTML 文件 [英] Parsing large HTML files with Nokogiri

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭