如何收集与使用python美丽的汤谷歌搜索数据 [英] How to collect data of Google Search with beautiful soup using python

查看:136
本文介绍了如何收集与使用python美丽的汤谷歌搜索数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道我怎么可以将所有的URL和页面的源代码用美丽的汤,可以访问所有的人一个接一个地在谷歌搜索结果中,并移动到下一个谷歌索引页收集。

下面是URL 的https:// WWW。 google.com/search?q=site%3Awww.rashmi.com&rct=j 的,我要收集和屏幕在这里拍摄的http://www.rashmi.com/blog/wp-content/uploads/2014/11/screencapture-www-google-com-search-1433026719960.png

这里是code我试图

 高清getPageLinks(页):
链接= []
在page.find_all('A')链接:
    URL = link.get('href属性)
    如果网址:
        如果URL'www.rashmi.com/:
            links.append(URL)
返回链接高清链接(URL):
金银丝=里urlparse(URL)
返回parse_qs(pUrl.query)[0]高清PagesVisit(浏览器,printInfo):
的PageIndex = 1
走访= []
time.sleep(5)
而真正的:
    browser.get(https://www.google.com/search?q=site:www.rashmi.com&ei=50hqVdCqJozEogS7uoKADg+ STR(PageIndex的)+&放大器;启动= 10安培; SA = N)
    PLIST = []
    数= 0    的PageIndex + = 1


解决方案

试试这个它应该工作。

 高清getPageLinks(页):
链接= []
在page.find_all('A')链接:
URL = link.get('href属性)
如果网址:
    如果URL'www.rashmi.com/:
        links.append(URL)
返回链接高清链接(URL):
金银丝=里urlparse(URL)
返回parse_qs(pUrl.query)高清PagesVisit(浏览器,printInfo):
    启动= 0
    走访= []
    time.sleep(5)
    而真正的:
            browser.get(https://www.google.com/search?q=site:www.rashmi.com&ei=V896VdiLEcPmUsK7gdAH&+ STR(开始)+&放大器; SA = N)
    PLIST = []
    数= 0
    #随机睡眠,以确保一切加载
    time.sleep(random.randint(1,5))
    页= BeautifulSoup(browser.page_source)
    启动+ = 10
    如果start == 500:
    browser.close()

I want to know about how I can collect all the URL's and from the page source using beautiful soup and can visit all of them one by one in the google search results and move to next google index pages.

here is the URL https://www.google.com/search?q=site%3Awww.rashmi.com&rct=j that I want to collect and screen shot here http://www.rashmi.com/blog/wp-content/uploads/2014/11/screencapture-www-google-com-search-1433026719960.png

here is the code I'm trying

def getPageLinks(page):
links = []
for link in page.find_all('a'):
    url = link.get('href')
    if url:
        if 'www.rashmi.com/' in url:
            links.append(url)
return links

def Links(url):
pUrl = urlparse(url)
return parse_qs(pUrl.query)[0]

def PagesVisit(browser, printInfo):
pageIndex = 1
visited = []
time.sleep(5)
while True:  
    browser.get("https://www.google.com/search?q=site:www.rashmi.com&ei=50hqVdCqJozEogS7uoKADg" + str(pageIndex)+"&start=10&sa=N")
    pList = []
    count = 0

    pageIndex += 1

解决方案

Try this it should work.

def getPageLinks(page):
links = []
for link in page.find_all('a'):
url = link.get('href')
if url:
    if 'www.rashmi.com/' in url:
        links.append(url)
return links

def Links(url):
pUrl = urlparse(url)
return parse_qs(pUrl.query)

def PagesVisit(browser, printInfo):
    start = 0
    visited = []
    time.sleep(5)
    while True:  
            browser.get("https://www.google.com/search?q=site:www.rashmi.com&ei=V896VdiLEcPmUsK7gdAH&" + str(start) + "&sa=N")


    pList = []
    count = 0
    # Random sleep to make sure everything loads
    time.sleep(random.randint(1, 5))
    page = BeautifulSoup(browser.page_source)


    start +=10      
    if start ==500:
    browser.close()   

这篇关于如何收集与使用python美丽的汤谷歌搜索数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆