在 python 中制作我自己的网络爬虫,它显示了页面排名的主要思想 [英] Making my own web crawler in python which shows main idea of the page rank
问题描述
我正在尝试制作显示页面排名基本概念的网络爬虫.对我来说,代码对我来说似乎很好,但会给我返回错误,例如
I'm trying to make web crawler which shows basic idea of page rank. And code for me seems fine for me but gives me back errors e.x.
`Traceback (most recent call last):
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 89, in <module>
webpages()
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 17, in webpages
get_single_item_data(href)
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 23, in get_single_item_data
source_code = requests.get(item_url)
File "C:\Python34\lib\site-packages\requests\api.py", line 65, in get
return request('get', url, **kwargs)
File "C:\Python34\lib\site-packages\requests\api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 447, in request
prep = self.prepare_request(req)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Python34\lib\site-packages\requests\models.py", line 303, in prepare
self.prepare_url(url, params)
File "C:\Python34\lib\site-packages\requests\models.py", line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url))
requests.exceptions.MissingSchema: Invalid URL '//www.hm.com/lv/logout': No schema supplied. Perhaps you meant http:////www.hm.com/lv/logout?`
我运行后python给我的最后一行代码是:
and the last row of code which python gives me back after I run it is:
//www.hm.com/lv/logout
也许问题出在两个 //
上,但我敢肯定,无论如何,当我尝试调用其他网页时,例如http://en.wikipedia.org/wiki/Wiki 它给了我 None
和同样的错误.
Maybe problem is with two //
but I'm sure, anyway when I try to crall other web pages e.x. http://en.wikipedia.org/wiki/Wiki it gives me back None
and same errors.
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
from operator import itemgetter
all_links = defaultdict(int)
def webpages():
url = 'http://www.hm.com/lv/'
source_code = requests.get(url)
text = source_code.text
soup = BeautifulSoup(text)
for link in soup.findAll ('a'):
href = link.get('href')
print(href)
get_single_item_data(href)
return all_links
def get_single_item_data(item_url):
#if not item_url.startswith('http'):
#item_url = 'http' + item_url
source_code = requests.get(item_url)
text = source_code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a'):
href = link.get('href')
if href and href.startswith('http://www.'):
if href:
all_links[href] += 1
print(href)
def sort_algorithm(list):
for index in range(1,len(list)):
value= list[index]
i = index - 1
while i>=0:
if value < list[i]:
list[i+1] = list[i]
list[i] = value
i=i -1
else:
break
vieni = ["", "viens", "divi", "tris", "cetri", "pieci",
"sesi", "septini", "astoni", "devini"]
padsmiti = ["", "vienpadsmit", "divpadsmit", "trispadsmit", "cetrpadsmit",
"piecpadsmit", 'sespadsmit', "septinpadsmit", "astonpadsmit", "devinpadsmit"]
desmiti = ["", "desmit", "divdesmit", "trisdesmit", "cetrdesmit",
"piecdesmit", "sesdesmit", "septindesmit", "astondesmit", "devindesmit"]
def num_to_words(n):
words = []
if n == 0:
words.append("zero")
else:
num_str = "{}".format(n)
groups = (len(num_str) + 2) // 3
num_str = num_str.zfill(groups * 3)
for i in range(0, groups * 3, 3):
h = int(num_str[i])
t = int(num_str[i + 1])
u = int(num_str[i + 2])
print()
print(vieni[i])
g = groups - (i // 3 + 1)
if h >= 1:
words.append(vieni[h])
words.append("hundred")
if int(num_str) % 100:
words.append("and")
if t > 1:
words.append(desmiti[t])
if u >= 1:
words.append(vieni[u])
elif t == 1:
if u >= 1:
words.append(padsmiti[u])
else:
words.append(desmiti[t])
else:
if u >= 1:
words.append(vieni[u])
return " ".join(words)
webpages()
for k, v in sorted(webpages().items(),key=itemgetter(1),reverse=True):
print(k, num_to_words(v))
推荐答案
来自网页函数循环的链接可能以两个斜杠开头,表示该链接使用当前的 Schema .例如,打开 https://en.wikipedia.org/wiki/Wiki 链接//en.wikipedia.org/login" 将是 "https://en.wikipedia.org/login".打开 http://en.wikipedia.org/wiki/Wiki 将是http://en.wikipedia.org/login.
The links come from the loop of webpages functions may be start with two slash.It means this link use the current Schema . For ex, open https://en.wikipedia.org/wiki/Wiki the link "//en.wikipedia.org/login" will be "https://en.wikipedia.org/login". open http://en.wikipedia.org/wiki/Wiki will be http://en.wikipedia.org/login.
在 html "a" 标签中打开 url 的更好方法是使用 urlparse.urljoin 函数.它将目标和当前 url 连接起来.不管绝对/相对路径.
A better way to open url in a html "a" tag is using the urlparse.urljoin function.It joins the target and current url. Regardless of absolute / relative path.
希望能帮到你.
这篇关于在 python 中制作我自己的网络爬虫,它显示了页面排名的主要思想的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!