在Python中进行网络抓取的For循环 [英] For loop for web scraping in python

查看:80
本文介绍了在Python中进行网络抓取的For循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个小项目正在研究使用关键字列表在网络上搜索Google搜索.我建立了一个嵌套的For循环来抓取搜索结果.问题在于,用于搜索列表中关键字的for循环无法按我的预期工作,这正在从每个搜索结果中抓取数据.除前两个搜索结果外,结果仅获取最后一个关键字的结果.

I have a small project working on web-scraping Google search with a list of keywords. I have built a nested For loop for scraping the search results. The problem is that a for loop for searching keywords in the list does not work as I intended to, which is scraping the data from each searching result. The results get only the result of the last keyword, except for the first two search results.

这是代码:

browser = webdriver.Chrome(r"C:\...\chromedriver.exe")

df = pd.DataFrame(columns = ['ceo', 'value'])

baseUrl = 'https://www.google.com/search?q='

html = browser.page_source
soup = BeautifulSoup(html)

ceo_list = ["Bill Gates", "Elon Musk", "Warren Buffet"]
values =[]


for ceo in ceo_list:
    browser.get(baseUrl + ceo)
    r = soup.select('div.g.rhsvw.kno-kp.mnr-c.g-blk')

    df = pd.DataFrame()
    for i in r:

        value = i.select_one('div.Z1hOCe').text                     
        ceo = i.select_one('.kno-ecr-pt.PZPZlf.gsmt.i8lZMc').text   
        values = [ceo, value]

    s = pd.Series(values)
    df = df.append(s,ignore_index=True)


print(df)

输出:

              0                                                  1
0  Warren Buffet  Born: October 28, 1955 (age 64 years), Seattle...

我期望的输出是这样的:

The output that I am expecting is as this:

              0                                                  1
0  Bill Gates      Born:..........
1  Elon Musk       Born:...........
2  Warren Buffett  Born: August 30, 1930 (age 89 years), Omaha, N...


Any suggestions or comments are welcome here.

推荐答案

在for循环外声明df = pd.DataFrame()

从现在开始,您已经在循环内定义了它,对于列表中的每个关键字,它将初始化一个新的数据框,而较旧的将被替换.这就是为什么您只获得最后一个关键字的结果的原因.

Since currently, you have defined it inside the loop, for each keyword in your list it will initialize a new data frame and the older will be replaced. That's why you are just getting the result for the last keyword.

尝试一下:

browser = webdriver.Chrome(r"C:\...\chromedriver.exe")
df = pd.DataFrame(columns = ['ceo', 'value'])
baseUrl = 'https://www.google.com/search?q='
html = browser.page_source
soup = BeautifulSoup(html)
ceo_list = ["Bill Gates", "Elon Musk", "Warren Buffet"]
df = pd.DataFrame()
for ceo in ceo_list:
    browser.get(baseUrl + ceo)
    r = soup.select('div.g.rhsvw.kno-kp.mnr-c.g-blk')
    for i in r:
        value = i.select_one('div.Z1hOCe').text                     
        ceo = i.select_one('.kno-ecr-pt.PZPZlf.gsmt.i8lZMc').text
    s = pd.Series([ceo, value])
    df = df.append(s,ignore_index=True)
print(df)

这篇关于在Python中进行网络抓取的For循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆