有没有更好的办法，我的Python的网络爬虫codeS使用BeautifulSoup？ [英] Is there a better approach to use BeautifulSoup in my python web crawler codes?

查看：199 发布时间：2016/8/5 19:09:47 python python-2.7 beautifulsoup web-crawler

本文介绍了有没有更好的办法，我的Python的网络爬虫codeS使用BeautifulSoup？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图抓取页面从URL中的信息，并将其保存在文本文件中。

我在这个问题收到很大的帮助
<一href=\"http://stackoverflow.com/questions/36528692/how-to-get-the-right-source-$c$c-with-python-from-the-urls-using-my-web-crawler\">How从我使用的网络爬虫的URL得到正确的源$ C $ C与Python？
我尝试使用我在BeautifulSoup学会完成我的基于问题codeS。

但是，当我看着我的codeS，虽然他们已经满足了我的需要，但它们看起来pretty搞砸了。
谁能帮我优化他们一点，尤其是在BeautifulSoup一部分？如infoLists部分和saveInfo一部分。
谢谢！

下面是我的codeS：

 进口要求
从BS4进口BeautifulSoup
从进口里urlparse urljoinURL = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'q若要从URL获得源$ C $ C
高清getsourse（URL）：
    标题= {'的User-Agent：Mozilla的/ 5.0（兼容; MSIE 9.0; Windows NT的10.0; WOW64;三叉戟/ 8.0;触摸）'}
    HTML = requests.get（URL，标题=标题）
    返回html.contentq若要获得当前页面所有链接
高清getallLinksinPage（来源$ C $ C）：
    汤= BeautifulSoup（来源$ C $ C）
    返回[一[HREF]为在soup.select（＃对答a.xst）]q若要保存信息在info.txt文件中
高清saveinfo（infoLists）：
    F =开放（'信息.txt'，'A'）
    对于每个在infoLists：
        f.writelines（'职位名称：'+ STR（每个['标题']恩code（UTF-8））+'\\ N'）
        f.writelines（'公司名称：'+ STR（每个['的companyName']恩code（UTF-8））+'\\ N'）
        f.writelines（'公司地址：'+ STR（每个['地址']恩code（UTF-8））+'\\ N'）
        f.writelines（'工作职位：'+ STR（每个['位置']恩code（UTF-8））+'\\ N'）
        f.writelines（薪酬：'+ STR（每个['薪水']恩code（UTF-8））+'\\ N'）
        f.writelines（'全职/兼职：'+ STR（每个['jobType']恩code（UTF-8））+'\\ N'）
        f.writelines（本公司电话：+ STR（每个['电话']恩code（UTF-8））+'\\ N'）
        f.writelines（本公司电子邮件：+ STR（每个[电子邮件]恩code（UTF-8））+'\\ N'）
        f.writelines（'烹饪时间：'+ STR（每个['烹饪时间']恩code（UTF-8））+\\ n \\ n）
    f.close（）来源$ C $ C = getsourse（URL）＃源$ C $的URL页面Ç
allLinksinPage = getallLinksinPage（来源$ C $ c）该网址在当前页面#A名单
linkNum = 1
infoLists = []
对于eachLink在allLinksinPage：
    打印（现在下载链接'+ STR（linkNum））
    URL ='http://bbs.skykiwi.com/
    realUrl = urljoin（URL，eachLink）
    HTML = getsourse（realUrl）
    汤= BeautifulSoup（HTML）
    infoList = {} q若要保存以下信息，如标题的companyName等
    infoList ['标题'] = soup.find（ATTRS = {'ID'：'thread_subject'}）。字符串
    infoList2 = [] q若要临时保存除了标题信息
    #FROM这里就会变得混乱...
    在soup.find_all（ATTRS = {'类'：'typeoption'}）：行＃首先找到大类
        在line.find_all（'TD'）TD：＃然后找到所有的TD的
            infoList2.append（td.string）
        尝试：
            对于eachInfo在infoList2：
                infoList ['的companyName'] = infoList2 [0]
                infoList ['地址'] = infoList2 [1]
                infoList ['位置'] = infoList2 [2]
                infoList ['薪水'] = infoList2 [3]
                infoList ['jobType'] = infoList2 [4]
                infoList ['电话'] = infoList2 [5]
                infoList [电子邮件] = infoList2 [6]
                infoList ['烹饪时间'] = infoList2 [7]
        最后：
            linkNum + = 1＃打印链接数
    infoLists.append（infoList）saveinfo（infoLists）

解决方案

使用 的zip（） 和的清单COM prehension 将大大提高可读性：

 标题= ['的companyName，地址，位置，工资，jobType，电话，电子邮件，烹饪时间]infoLists = [字典（邮政编码（头，[在line.find_all（'TD'item.string的项目）[8]））
             在soup.select线（。typeoption）]

I'm trying to crawl information from urls in a page and save them in a text file.

I have recieved great help in the question How to get the right source code with Python from the URLs using my web crawler? and I try to use what I have learned in BeautifulSoup to finish my codes based on that question.

But when I look at my codes, although they have satisfied my need, but they look pretty messed up. Can anyone help me to optimize them a little, especially on the BeautifulSoup part? Such as the infoLists part and the saveInfo part. Thanks!

Here are my codes:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'

#To get the source code from the url
def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.content

#To get all the links in current page
def getallLinksinPage(sourceCode):
    soup = BeautifulSoup(sourceCode)
    return [a["href"] for a in soup.select("#threadlist a.xst")]

#To save the info in the info.txt file
def saveinfo(infoLists):
    f = open('info.txt', 'a')
    for each in infoLists:
        f.writelines('Job Title: ' + str(each['title'].encode('utf-8')) + '\n')
        f.writelines('Company Name: ' + str(each['companyName'].encode('utf-8')) + '\n')
        f.writelines('Company Address: ' + str(each['address'].encode('utf-8')) + '\n')
        f.writelines('Job Position: ' + str(each['position'].encode('utf-8')) + '\n')
        f.writelines('Salary: ' + str(each['salary'].encode('utf-8')) + '\n')
        f.writelines('Full/Part time: ' + str(each['jobType'].encode('utf-8')) + '\n')
        f.writelines('Company Tel: ' + str(each['tel'].encode('utf-8')) + '\n')
        f.writelines('Company Email: ' + str(each['email'].encode('utf-8')) + '\n')
        f.writelines('WorkTime: ' + str(each['workTime'].encode('utf-8')) + '\n\n')
    f.close()

sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
linkNum=1
infoLists=[]
for eachLink in allLinksinPage:
    print('Now downloading link '+str(linkNum))
    url = 'http://bbs.skykiwi.com/'
    realUrl=urljoin(url, eachLink)
    html = getsourse(realUrl)
    soup= BeautifulSoup(html)
    infoList={} #To save the following info,such as title companyName etc
    infoList['title']=soup.find(attrs={'id':'thread_subject'}).string
    infoList2=[] #To temporarily save info except 'title'
    #FROM HERE IT GETS MESSY...
    for line in soup.find_all(attrs={'class':'typeoption'}): # first locate the bigClass
        for td in line.find_all('td'):  # then locate all the 'td's
            infoList2.append(td.string)
        try:
            for eachInfo in infoList2:
                infoList['companyName'] = infoList2[0]
                infoList['address'] = infoList2[1]
                infoList['position'] = infoList2[2]
                infoList['salary'] = infoList2[3]
                infoList['jobType'] = infoList2[4]
                infoList['tel'] = infoList2[5]
                infoList['email'] = infoList2[6]
                infoList['workTime'] = infoList2[7]
        finally:
            linkNum += 1 # To print link number
    infoLists.append(infoList)

saveinfo(infoLists)

解决方案

Using zip() and a list comprehension would dramatically improve readability:

headers = ['companyName', 'address', 'position', 'salary', 'jobType', 'tel', 'email', 'workTime']

infoLists = [dict(zip(headers, [item.string for item in line.find_all('td')[:8]])) 
             for line in soup.select(".typeoption")]

这篇关于有没有更好的办法，我的Python的网络爬虫codeS使用BeautifulSoup？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有没有更好的办法，我的Python的网络爬虫codeS使用BeautifulSoup？ [英] Is there a better approach to use BeautifulSoup in my python web crawler codes?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

有没有更好的办法，我的Python的网络爬虫codeS使用BeautifulSoup？ [英] Is there a better approach to use BeautifulSoup in my python web crawler codes?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭