有没有更好的办法,我的Python的网络爬虫codeS使用BeautifulSoup? [英] Is there a better approach to use BeautifulSoup in my python web crawler codes?
问题描述
我试图抓取页面从URL中的信息,并将其保存在文本文件中。
我在这个问题收到很大的帮助
<一href=\"http://stackoverflow.com/questions/36528692/how-to-get-the-right-source-$c$c-with-python-from-the-urls-using-my-web-crawler\">How从我使用的网络爬虫的URL得到正确的源$ C $ C与Python?
我尝试使用我在BeautifulSoup学会完成我的基于问题codeS。
但是,当我看着我的codeS,虽然他们已经满足了我的需要,但它们看起来pretty搞砸了。
谁能帮我优化他们一点,尤其是在BeautifulSoup一部分?如infoLists部分和saveInfo一部分。
谢谢!
下面是我的codeS:
进口要求
从BS4进口BeautifulSoup
从进口里urlparse urljoinURL = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'q若要从URL获得源$ C $ C
高清getsourse(URL):
标题= {'的User-Agent:Mozilla的/ 5.0(兼容; MSIE 9.0; Windows NT的10.0; WOW64;三叉戟/ 8.0;触摸)'}
HTML = requests.get(URL,标题=标题)
返回html.contentq若要获得当前页面所有链接
高清getallLinksinPage(来源$ C $ C):
汤= BeautifulSoup(来源$ C $ C)
返回[一[HREF]为在soup.select(#对答a.xst)]q若要保存信息在info.txt文件中
高清saveinfo(infoLists):
F =开放('信息.txt','A')
对于每个在infoLists:
f.writelines('职位名称:'+ STR(每个['标题']恩code(UTF-8))+'\\ N')
f.writelines('公司名称:'+ STR(每个['的companyName']恩code(UTF-8))+'\\ N')
f.writelines('公司地址:'+ STR(每个['地址']恩code(UTF-8))+'\\ N')
f.writelines('工作职位:'+ STR(每个['位置']恩code(UTF-8))+'\\ N')
f.writelines(薪酬:'+ STR(每个['薪水']恩code(UTF-8))+'\\ N')
f.writelines('全职/兼职:'+ STR(每个['jobType']恩code(UTF-8))+'\\ N')
f.writelines(本公司电话:+ STR(每个['电话']恩code(UTF-8))+'\\ N')
f.writelines(本公司电子邮件:+ STR(每个[电子邮件]恩code(UTF-8))+'\\ N')
f.writelines('烹饪时间:'+ STR(每个['烹饪时间']恩code(UTF-8))+\\ n \\ n)
f.close()来源$ C $ C = getsourse(URL)#源$ C $的URL页面Ç
allLinksinPage = getallLinksinPage(来源$ C $ c)该网址在当前页面#A名单
linkNum = 1
infoLists = []
对于eachLink在allLinksinPage:
打印(现在下载链接'+ STR(linkNum))
URL ='http://bbs.skykiwi.com/
realUrl = urljoin(URL,eachLink)
HTML = getsourse(realUrl)
汤= BeautifulSoup(HTML)
infoList = {} q若要保存以下信息,如标题的companyName等
infoList ['标题'] = soup.find(ATTRS = {'ID':'thread_subject'})。字符串
infoList2 = [] q若要临时保存除了标题信息
#FROM这里就会变得混乱...
在soup.find_all(ATTRS = {'类':'typeoption'}):行#首先找到大类
在line.find_all('TD')TD:#然后找到所有的TD的
infoList2.append(td.string)
尝试:
对于eachInfo在infoList2:
infoList ['的companyName'] = infoList2 [0]
infoList ['地址'] = infoList2 [1]
infoList ['位置'] = infoList2 [2]
infoList ['薪水'] = infoList2 [3]
infoList ['jobType'] = infoList2 [4]
infoList ['电话'] = infoList2 [5]
infoList [电子邮件] = infoList2 [6]
infoList ['烹饪时间'] = infoList2 [7]
最后:
linkNum + = 1#打印链接数
infoLists.append(infoList)saveinfo(infoLists)
使用 的zip()
和的清单COM prehension 将大大提高可读性:
标题= ['的companyName,地址,位置,工资,jobType,电话,电子邮件,烹饪时间]infoLists = [字典(邮政编码(头,[在line.find_all('TD'item.string的项目)[8]))
在soup.select线(。typeoption)]
I'm trying to crawl information from urls in a page and save them in a text file.
I have recieved great help in the question How to get the right source code with Python from the URLs using my web crawler? and I try to use what I have learned in BeautifulSoup to finish my codes based on that question.
But when I look at my codes, although they have satisfied my need, but they look pretty messed up. Can anyone help me to optimize them a little, especially on the BeautifulSoup part? Such as the infoLists part and the saveInfo part. Thanks!
Here are my codes:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
#To get the source code from the url
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.content
#To get all the links in current page
def getallLinksinPage(sourceCode):
soup = BeautifulSoup(sourceCode)
return [a["href"] for a in soup.select("#threadlist a.xst")]
#To save the info in the info.txt file
def saveinfo(infoLists):
f = open('info.txt', 'a')
for each in infoLists:
f.writelines('Job Title: ' + str(each['title'].encode('utf-8')) + '\n')
f.writelines('Company Name: ' + str(each['companyName'].encode('utf-8')) + '\n')
f.writelines('Company Address: ' + str(each['address'].encode('utf-8')) + '\n')
f.writelines('Job Position: ' + str(each['position'].encode('utf-8')) + '\n')
f.writelines('Salary: ' + str(each['salary'].encode('utf-8')) + '\n')
f.writelines('Full/Part time: ' + str(each['jobType'].encode('utf-8')) + '\n')
f.writelines('Company Tel: ' + str(each['tel'].encode('utf-8')) + '\n')
f.writelines('Company Email: ' + str(each['email'].encode('utf-8')) + '\n')
f.writelines('WorkTime: ' + str(each['workTime'].encode('utf-8')) + '\n\n')
f.close()
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
linkNum=1
infoLists=[]
for eachLink in allLinksinPage:
print('Now downloading link '+str(linkNum))
url = 'http://bbs.skykiwi.com/'
realUrl=urljoin(url, eachLink)
html = getsourse(realUrl)
soup= BeautifulSoup(html)
infoList={} #To save the following info,such as title companyName etc
infoList['title']=soup.find(attrs={'id':'thread_subject'}).string
infoList2=[] #To temporarily save info except 'title'
#FROM HERE IT GETS MESSY...
for line in soup.find_all(attrs={'class':'typeoption'}): # first locate the bigClass
for td in line.find_all('td'): # then locate all the 'td's
infoList2.append(td.string)
try:
for eachInfo in infoList2:
infoList['companyName'] = infoList2[0]
infoList['address'] = infoList2[1]
infoList['position'] = infoList2[2]
infoList['salary'] = infoList2[3]
infoList['jobType'] = infoList2[4]
infoList['tel'] = infoList2[5]
infoList['email'] = infoList2[6]
infoList['workTime'] = infoList2[7]
finally:
linkNum += 1 # To print link number
infoLists.append(infoList)
saveinfo(infoLists)
Using zip()
and a list comprehension would dramatically improve readability:
headers = ['companyName', 'address', 'position', 'salary', 'jobType', 'tel', 'email', 'workTime']
infoLists = [dict(zip(headers, [item.string for item in line.find_all('td')[:8]]))
for line in soup.select(".typeoption")]
这篇关于有没有更好的办法,我的Python的网络爬虫codeS使用BeautifulSoup?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!