如何从Google搜索结果“20包”中提取来源条目? [英] How to extract source from Google search result "20-pack" entry?

查看:137
本文介绍了如何从Google搜索结果“20包”中提取来源条目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

本地Google搜索的搜索结果页通常看起来像



我正在提取商家名称。地址,电话和网站使用Python和WebDriver:

  address = driver.find_element_by_xpath(// div [@ id =' akp_uid_0'] / div / div / ol / li / div / div / div / ol / table / tbody / tr [2] / td / li / div / div / span [2]text 

name = driver.find_element_by_css_selector(.kno-ecr-pt).text.encode('raw_unicode_escape')
phone = driver.find_element_by_css_selector(div._mr:nth-​​child(2) span:nth-​​child(2))。text

website = driver.find_element_by_css_selector(a.lua-button:nth-​​child(1))get_attribute(href)

工作可靠,但非常 。加载每个地图叠加层可能需要几十秒的时间。我已经通过WebDriver尝试PhantomJS,但很快被Google的bot检测阻止。



如果我的Firebug的读数是正确的,左边的每个链接定义如下:

 < a data-ved =0CA4QyTMwAGoVChMIj66ruJHGxwIVTKweCh03Sgw0data-async-trigger =data-height =0数据CID =11660382088875336582数据AKP-棒=H4sIAAAAAAAAAGOovnz8BQMDgycHm5SIoaGZmYGxhZGBhYWFuamxsZmphZESVtEoyeSMzKL8gqLE5JL8omLtvNRyhcr8omztvMrkA51e-lt5XiW0n3kw-e7MFfkJwUIAxqbXGGYAAAA数据AKP-OQ =身在平衡整脊纽约,纽约JSL =$×3; data-rtid =ifLMvGmjeYOkjsaction =r.UQJvbqFUibgc​​lass =ifLMvGmjeYOk-6WH35iSZ2V0 rllt__link rllt__contenttabindex =0role =link>< div class =_ ml>< div class = _pl_ki>< div role =headingaria-level =3style =margin-right:0pxclass =_ rl> Body in Balance< wbr>< / wbr> Chiropractic< ; / div>< div class =_ lg>< span aria-hidden =trueclass =rtngstyle =margin-right:5px> 5.0< / span> review-stars>< span aria-label =额定5.0 out of 5class =_ pxg _Jxg>< span style =width:70px>< / span>< / span> / g-review-stars>< div style =display:inline; font-size:13px; margin-left:5px>< span> 20条评论< / span>< / div>< / div ;< div class =_ tf>< span>脊医< / span>& nbsp;& nbsp; W 45th St< / div>< div class =_ CRe>< div& < span>在上午8:00打开< / span>< / div>< / div>< / div>< / div>< / a& 

我对CSS和JavaScript的了解几乎为零,所以我可能不会问正确的问题。但是有没有办法得到最终悬停在地图窗格(可能有更多的技术术语)的底层来源,而不必点击左侧的链接来​​提起它?我的想法是,如果我可以得到那个解析HTML,而不必实际触发它,我可以节省很多时间。

解决方案

我试图检查您提供的页面的dom结构。基本上IE在这样的页面与Firefox有巨大的差异(IE将直接到另一个页面,一旦你点击左侧的项目。)



但由于我的环境限制,我可以为IE做这个。对于firefox,你可以试试下面的代码。可能有一些小问题(道歉,我无法测试它)。



注意:我写了一个java演示java。我也不擅长cssSelector所以我使用xpath代替。希望它可以帮助。

  driver.get(https://www.google.com/search?q=chiropractors% 2Bnew%20york%2Bny& rflfq = 1& tbm = lcl& tbs = 1f:1,1f_ui:2& oll = 40.754671143320074,-73.97722375000001& ospn = 0.017814865199625274,0.040340423583984375& oz = 15& fll = 40.75807315356519,-73.99290368792725& ; FSPN = 0.01641614335274255,0.040340423583984375&安培; FZ = 15安培; VED = 0CJIBENAnahUKEwj1jtnmtcbHAhVTCo4KHfkkCYM&安培; BAV = on.2,or.r_cp&安培;白车身= 1360&安培;波黑= 608安培; DPR = 1&安培; SEI = y4LdVYvcFsa7uATo_LngCQ&安培; EI = 4YTdVbWaENOUuAT5yaSYCA&安培; emsg = NCSR& noj = 1& rlfi = hd:; si:#emsg = NCSR& rlfi = hd:; si:& sei = y4LdVYvcFsa7uATo_LngCQ 

// 0。实际上没有必要,除非你有低连接速度与谷歌。
Thread.sleep(5000);


// 1。通过xpath'_gt'将提取所有的20个结果的左侧的div。 IE和firefox可以很好地工作。
List< WebElement> elements = driver.findElements(By.xpath(// div [@class ='_ gt']));

// 2。遍历所有结果。让'data-cid'作为标识符。注意:只能执行FF。对于IE没有data-cid s
for(int i = 0; i WebElement e = elements.get(i);


WebElement aTag = e.findElement(By.tagName(a));


String dataCid = aTag.getAttribute(data-cid);


// 3。这里,包含我们想要的信息的div可以由firefox中的data-cid标识
WebElement parentDivOfTable = driver.findElement(By.xpath(// div [@ class ='akp_uid_0'and @data -cid ='+ dataCid +']));

// 4。获取信息表。
WebElement table = parentDivOfTable.findElement(By.xpath(// table [@class ='_ B5g']));

//获取手机号码。
String phoneNum = table.findElement(By.xpath(// span [text()='Phone:'] / following-sibling))getText();
}


The search results page for a local Google search typically looks like this, containing 20 results.

In order to get the full contact details for any given result on the left-hand-side, the result needs to be clicked, bringing up (after a lengthy wait) an overlay (not sure of the technical term) over the Google Maps pane (on Firefox, does something different on other web browsers):

I am extracting the business name. address, phone and website with Python and WebDriver thus:

address = driver.find_element_by_xpath("//div[@id='akp_uid_0']/div/div/ol/li/div/div/div/ol/table/tbody/tr[2]/td/li/div/div/span[2]").text

name = driver.find_element_by_css_selector(".kno-ecr-pt").text.encode('raw_unicode_escape')
phone = driver.find_element_by_css_selector("div._mr:nth-child(2) > span:nth-child(2)").text

website = driver.find_element_by_css_selector("a.lua-button:nth-child(1)").get_attribute("href")

This is working reliably, but is extremely slow. Loading up each Maps overlay can take in the tens of seconds each time. I've tried PhantomJS via WebDriver, but got quickly blocked by Google's bot-detection.

If my reading of Firebug is correct, each of these links on the left hand side is defined like so:

<a data-ved="0CA4QyTMwAGoVChMIj66ruJHGxwIVTKweCh03Sgw0" data-async-trigger="" data-height="0" data-cid="11660382088875336582" data-akp-stick="H4sIAAAAAAAAAGOovnz8BQMDgycHm5SIoaGZmYGxhZGBhYWFuamxsZmphZESVtEoyeSMzKL8gqLE5JL8omLtvNRyhcr8omztvMrkA51e-lt5XiW0n3kw-e7MFfkJwUIAxqbXGGYAAAA" data-akp-oq="Body in Balance Chiropractic New York, NY" jsl="$x 3;" data-rtid="ifLMvGmjeYOk" jsaction="r.UQJvbqFUibg" class="ifLMvGmjeYOk-6WH35iSZ2V0 rllt__link rllt__content" tabindex="0" role="link"><div class="_Ml"><div class="_pl _ki"><div role="heading" aria-level="3" style="margin-right:0px" class="_rl">Body in Balance <wbr></wbr>Chiropractic</div><div class="_lg"><span aria-hidden="true" class="rtng" style="margin-right:5px">5.0</span><g-review-stars><span aria-label="Rated 5.0 out of 5" class="_pxg _Jxg"><span style="width:70px"></span></span></g-review-stars><div style="display:inline;font-size:13px;margin-left:5px"><span>20 reviews</span></div></div><div class="_tf"><span>Chiropractor</span>&nbsp;·&nbsp;W 45th St</div><div class="_CRe"><div><span>Opens at 8:00 am</span></div></div></div></div></a>

My knowledge of CSS and JavaScript is practically nil, so I may not be asking the right question. But is there a way to get at the underlying source of what eventually hovers over the Maps pane (there's probably a more technical term for it), without having to click on the link on the left hand side to bring it up? My thinking is that if I can get that parse that HTML without actually having to trigger it, I can save much time.

解决方案

I have tried to check the dom structure of the page you provided. Basically IE has huge differences on such a page with Firefox(IE will direct to another page once you've clicked the left-hand-side items.)

But due to my environmental limit, I can just have done this for IE. For firefox, you may have a try on the following code. There might be minor issues(apologize, I am unable to test it ).

Note: I wrote a java demo(Just for searching Phone num) because I am familiar with java. And I am also not good at cssSelector so I used xpath instead. Hope it can help.

        driver.get("https://www.google.com/search?q=chiropractors%2Bnew%20york%2Bny&rflfq=1&tbm=lcl&tbs=lf:1,lf_ui:2&oll=40.754671143320074,-73.97722375000001&ospn=0.017814865199625274,0.040340423583984375&oz=15&fll=40.75807315356519,-73.99290368792725&fspn=0.01641614335274255,0.040340423583984375&fz=15&ved=0CJIBENAnahUKEwj1jtnmtcbHAhVTCo4KHfkkCYM&bav=on.2,or.r_cp.&biw=1360&bih=608&dpr=1&sei=y4LdVYvcFsa7uATo_LngCQ&ei=4YTdVbWaENOUuAT5yaSYCA&emsg=NCSR&noj=1&rlfi=hd:;si:#emsg=NCSR&rlfi=hd:;si:&sei=y4LdVYvcFsa7uATo_LngCQ");

        //0. Actually no need unless you have low connection speed with google.
        Thread.sleep(5000);


        //1. By xpath '_gt' will extract all of the 20 results' div on left hand side. Both IE and firefox can work well. 
        List<WebElement> elements = driver.findElements(By.xpath("//div[@class='_gt']"));

        //2. Traverse all of the results. Let 'data-cid' as identifier. Note: Only FF can be done. For IE there are no data-cid s
        for(int i=0; i<elements.size(); i++) {
            WebElement e = elements.get(i);


            WebElement aTag = e.findElement(By.tagName("a"));


            String dataCid = aTag.getAttribute("data-cid");


            //3. Here, the div which contains the info we want can be identified by 'data-cid' in firefox
            WebElement parentDivOfTable = driver.findElement(By.xpath("//div[@class='akp_uid_0' and @data-cid='" + dataCid + "']"));

            //4. get the infomation table.
            WebElement table = parentDivOfTable.findElement(By.xpath("//table[@class='_B5g']"));

            //get the phone num.
            String phoneNum = table.findElement(By.xpath("//span[text()='Phone:']/following-sibling")).getText();
        }

这篇关于如何从Google搜索结果“20包”中提取来源条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆