使用python Webscrape多个页面-输出问题 [英] Webscrape Multiple Pages with python - output issue

查看:75
本文介绍了使用python Webscrape多个页面-输出问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

新年快乐python社区

Happy new year python community,

我正在尝试使用Python Beautifulsoup4从网站中提取一张表

I am trying to extract a table from website using Python Beautifulsoup4

我正在努力在输出文件中查看结果. 代码运行平稳,但文件未写入任何内容.

I am struggling to see the results in my output files. The code run smoothly but nothing is written the file.

我的下面的代码

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

base_url = 'http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value='
r = rq.get(base_url)

soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value=&page=.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
    num_pages = int(page_count_links[-1].get_text())
except IndexError:
    num_pages = 1

# Add 1 because Python range.
url_list = ["{}&page={}".format(base_url, str(page)) for page in range(1, 3)]

# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
    for url_ in url_list:
        print("Processing {}...".format(url_))
        r_new = rq.get(url_)
        soup_new = bsoup(r_new.text)
        for tr in soup_new.find_all('tr', align='center'):
            stack = []
            for td in tr.findAll('td'):
                stack.append(td.text.replace('\n', '').replace('\t', '').strip())
            acct.write(", ".join(stack) + '\n')

推荐答案

soup_new.find_all('tr', align='center')返回一个空列表

尝试将其切换为for tr in soup_new.find_all('tr'):

其次,由于您使用的是字符串,因此将模式with open("results.txt","wb")切换为with open("results.txt","w")

and secondly, since you're using strings, switch the mode with open("results.txt","wb") to with open("results.txt","w")

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

base_url = 'http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value=&page=2'
r = rq.get(base_url)

soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value=&page=.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
    num_pages = int(page_count_links[-1].get_text())
except IndexError:
    num_pages = 1

# Add 1 because Python range.
url_list = ["{}&page={}".format(base_url, str(page)) for page in range(1, 3)]

# Open the text file. Use with to save self from grief.
with open("results.txt","w") as acct:
    for url_ in url_list:

        #url_ = url_list[0]

        print("Processing {}...".format(url_))
        r_new = rq.get(url_)
        soup_new = bsoup(r_new.text)


        for tr in soup_new.find_all('tr'):
            stack = []
            for td in tr.findAll('td'):
                stack.append(td.text.replace('\n', '').replace('\t', '').strip())
            acct.write(", ".join(stack) + '\n')

这篇关于使用python Webscrape多个页面-输出问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆