有没有更好的办法,我的Python的网络爬虫codeS使用BeautifulSoup? [英] Is there a better approach to use BeautifulSoup in my python web crawler codes?

查看:199
本文介绍了有没有更好的办法,我的Python的网络爬虫codeS使用BeautifulSoup?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图抓取页面从URL中的信息,并将其保存在文本文件中。

我在这个问题收到很大的帮助
<一href=\"http://stackoverflow.com/questions/36528692/how-to-get-the-right-source-$c$c-with-python-from-the-urls-using-my-web-crawler\">How从我使用的网络爬虫的URL得到正确的源$ C ​​$ C与Python?
我尝试使用我在BeautifulSoup学会完成我的基于问题codeS。

但是,当我看着我的codeS,虽然他们已经满足了我的需要,但它们看起来pretty搞砸了。
谁能帮我优化他们一点,尤其是在BeautifulSoup一部分?如infoLists部分和saveInfo一部分。
谢谢!

下面是我的codeS:

 进口要求
从BS4进口BeautifulSoup
从进口里urlparse urljoinURL = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'q若要从URL获得源$ C ​​$ C
高清getsourse(URL):
    标题= {'的User-Agent:Mozilla的/ 5.0(兼容; MSIE 9.0; Windows NT的10.0; WOW64;三叉戟/ 8.0;触摸)'}
    HTML = requests.get(URL,标题=标题)
    返回html.contentq若要获得当前页面所有链接
高清getallLinksinPage(来源$ C ​​$ C):
    汤= BeautifulSoup(来源$ C ​​$ C)
    返回[一[HREF]为在soup.select(#对答a.xst)]q若要保存信息在info.txt文件中
高清saveinfo(infoLists):
    F =开放('信息.txt','A')
    对于每个在infoLists:
        f.writelines('职位名称:'+ STR(每个['标题']恩code(UTF-8))+'\\ N')
        f.writelines('公司名称:'+ STR(每个['的companyName']恩code(UTF-8))+'\\ N')
        f.writelines('公司地址:'+ STR(每个['地址']恩code(UTF-8))+'\\ N')
        f.writelines('工作职位:'+ STR(每个['位置']恩code(UTF-8))+'\\ N')
        f.writelines(薪酬:'+ STR(每个['薪水']恩code(UTF-8))+'\\ N')
        f.writelines('全职/兼职:'+ STR(每个['jobType']恩code(UTF-8))+'\\ N')
        f.writelines(本公司电话:+ STR(每个['电话']恩code(UTF-8))+'\\ N')
        f.writelines(本公司电子邮件:+ STR(每个[电子邮件]恩code(UTF-8))+'\\ N')
        f.writelines('烹饪时间:'+ STR(每个['烹饪时间']恩code(UTF-8))+\\ n \\ n)
    f.close()来源$ C ​​$ C = getsourse(URL)#源$ C ​​$的URL页面Ç
allLinksinPage = getallLinksinPage(来源$ C ​​$ c)该网址在当前页面#A名单
linkNum = 1
infoLists = []
对于eachLink在allLinksinPage:
    打印(现在下载链接'+ STR(linkNum))
    URL ='http://bbs.skykiwi.com/
    realUrl = urljoin(URL,eachLink)
    HTML = getsourse(realUrl)
    汤= BeautifulSoup(HTML)
    infoList = {} q若要保存以下信息,如标题的companyName等
    infoList ['标题'] = soup.find(ATTRS = {'ID':'thread_subject'})。字符串
    infoList2 = [] q若要临时保存除了标题信息
    #FROM这里就会变得混乱...
    在soup.find_all(ATTRS = {'类':'typeoption'}):行#首先找到大类
        在line.find_all('TD')TD:#然后找到所有的TD的
            infoList2.append(td.string)
        尝试:
            对于eachInfo在infoList2:
                infoList ['的companyName'] = infoList2 [0]
                infoList ['地址'] = infoList2 [1]
                infoList ['位置'] = infoList2 [2]
                infoList ['薪水'] = infoList2 [3]
                infoList ['jobType'] = infoList2 [4]
                infoList ['电话'] = infoList2 [5]
                infoList [电子邮件] = infoList2 [6]
                infoList ['烹饪时间'] = infoList2 [7]
        最后:
            linkNum + = 1#打印链接数
    infoLists.append(infoList)saveinfo(infoLists)


解决方案

使用 的zip() 和的清单COM prehension 将大大提高可读性:

 标题= ['的companyName,地址,位置,工资,jobType,电话,电子邮件,烹饪时间]infoLists = [字典(邮政编码(头,[在line.find_all('TD'item.string的项目)[8]))
             在soup.select线(。typeoption)]

I'm trying to crawl information from urls in a page and save them in a text file.

I have recieved great help in the question How to get the right source code with Python from the URLs using my web crawler? and I try to use what I have learned in BeautifulSoup to finish my codes based on that question.

But when I look at my codes, although they have satisfied my need, but they look pretty messed up. Can anyone help me to optimize them a little, especially on the BeautifulSoup part? Such as the infoLists part and the saveInfo part. Thanks!

Here are my codes:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'

#To get the source code from the url
def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.content

#To get all the links in current page
def getallLinksinPage(sourceCode):
    soup = BeautifulSoup(sourceCode)
    return [a["href"] for a in soup.select("#threadlist a.xst")]

#To save the info in the info.txt file
def saveinfo(infoLists):
    f = open('info.txt', 'a')
    for each in infoLists:
        f.writelines('Job Title: ' + str(each['title'].encode('utf-8')) + '\n')
        f.writelines('Company Name: ' + str(each['companyName'].encode('utf-8')) + '\n')
        f.writelines('Company Address: ' + str(each['address'].encode('utf-8')) + '\n')
        f.writelines('Job Position: ' + str(each['position'].encode('utf-8')) + '\n')
        f.writelines('Salary: ' + str(each['salary'].encode('utf-8')) + '\n')
        f.writelines('Full/Part time: ' + str(each['jobType'].encode('utf-8')) + '\n')
        f.writelines('Company Tel: ' + str(each['tel'].encode('utf-8')) + '\n')
        f.writelines('Company Email: ' + str(each['email'].encode('utf-8')) + '\n')
        f.writelines('WorkTime: ' + str(each['workTime'].encode('utf-8')) + '\n\n')
    f.close()

sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
linkNum=1
infoLists=[]
for eachLink in allLinksinPage:
    print('Now downloading link '+str(linkNum))
    url = 'http://bbs.skykiwi.com/'
    realUrl=urljoin(url, eachLink)
    html = getsourse(realUrl)
    soup= BeautifulSoup(html)
    infoList={} #To save the following info,such as title companyName etc
    infoList['title']=soup.find(attrs={'id':'thread_subject'}).string
    infoList2=[] #To temporarily save info except 'title'
    #FROM HERE IT GETS MESSY...
    for line in soup.find_all(attrs={'class':'typeoption'}): # first locate the bigClass
        for td in line.find_all('td'):  # then locate all the 'td's
            infoList2.append(td.string)
        try:
            for eachInfo in infoList2:
                infoList['companyName'] = infoList2[0]
                infoList['address'] = infoList2[1]
                infoList['position'] = infoList2[2]
                infoList['salary'] = infoList2[3]
                infoList['jobType'] = infoList2[4]
                infoList['tel'] = infoList2[5]
                infoList['email'] = infoList2[6]
                infoList['workTime'] = infoList2[7]
        finally:
            linkNum += 1 # To print link number
    infoLists.append(infoList)

saveinfo(infoLists)

解决方案

Using zip() and a list comprehension would dramatically improve readability:

headers = ['companyName', 'address', 'position', 'salary', 'jobType', 'tel', 'email', 'workTime']

infoLists = [dict(zip(headers, [item.string for item in line.find_all('td')[:8]])) 
             for line in soup.select(".typeoption")]

这篇关于有没有更好的办法,我的Python的网络爬虫codeS使用BeautifulSoup?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆