HTML Agility Pack屏幕抓取XPATH不返回数据 [英] HTML Agility Pack Screen Scraping XPATH isn't returning data

查看:58
本文介绍了HTML Agility Pack屏幕抓取XPATH不返回数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为Digikey编写屏幕刮板,这将使我们的公司在零件停产时能够准确跟踪价格,零件可用性和产品更换情况.我在Chrome Devtools中看到的XPATH以及Firefox上的Firebug和我的C#程序看到的似乎之间存在差异.

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.

我当前正在抓取的页面为

The page that I'm scraping currently is http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND

我当前正在使用的代码非常快捷,肮脏...

The code I'm currently using is pretty quick and dirty...

   //This function retrieves data from the digikey
   private static List<string> ExtractProductInfo(HtmlDocument doc)
   {
       List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
       List<string> m_unparsedProductInfo = new List<string>();

       //Base Node for part info
       string m_baseNode = @"//html[1]/body[1]/div[2]";

       //Write part info to list
       m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
       //More lines of similar form will go here for more info
       //this retrieves digikey PN

       foreach(HtmlNode node in m_unparsedProductInfoNodes)
       {
           m_unparsedProductInfo.Add(node.InnerText);
       }

       return m_unparsedProductInfo;
   }

尽管我使用的路径似乎是正确的",但当我查看列表"m_unparsedProductInfoNodes"时,我一直得到NULL.

Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"

你知道这里发生了什么吗?我还要补充一点,如果我在baseNode上执行"SelectNodes",它将仅返回一个div,唯一的重要子元素为"cs = ####",这似乎因浏览器用户代理而异.如果我仍然尝试使用此方法(在无法识别的浏览器路径中输入/cs = 0),它会非常适合,坚持认为我的表达式不会求值到节点集,但仍然让它们留下所有数据过去的问题div [2]返回为NULL.

Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.

推荐答案

只需进行更新:

我从c#切换到更友好的Python(我的编程经验是asm,c和python,整个OO都是全新的),并设法解决了我的xpath问题.标签确实是问题所在,但幸运的是它是独特的,因此使用了一些正则表达式并删除了一行,我的状态良好.我不确定为什么这样的标签会破坏XPATH.如果有人有什么见识,我想听听.

I switched from c# into a bit more friendly Python (my experience with programming is asm, c, and python, the whole OO thing was totally new) and managed to correct my xpath issues. The tag was indeed the problem, but luckily it's unique, so a little regular expression and a removed line and I was in good shape. I'm not sure why a tag like that breaks the XPATH though. If anyone has some insight I'd like to hear it.

这篇关于HTML Agility Pack屏幕抓取XPATH不返回数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆