使用python登录网站和网页抓取 [英] Login a Website and Web Scraping using python

查看:56
本文介绍了使用python登录网站和网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出通过网络抓取房地产网站的方法 https://www.brickz.my/用于我的研究项目.我一直在硒和美丽汤之间进行尝试,并决定选择美丽汤对我来说是最好的方法,因为每个房地产的网址结构都使我的代码可以轻松快捷地浏览网站

I am trying to figure out ways to web scraping a real estate website https://www.brickz.my/ for my research project. I have been trying between selenium and beautiful soup and decide to choose beautiful soup was the best way for me since the structure of url for each real estate allow my code to navigate the website easily and faster

我正在尝试为每个房地产建立数据库交易.如果不登录,则仅显示10个针对特定属性的最新交易.通过登录,我可以访问特定类型的属性的整个交易.这是例子

I am trying to build a database transaction for each real estate'. Without login, only 10 latest transactions will be displayed for a particular property. By login, I can access to the whole transaction for a particular type of property. here is the example

没有登录,我每个属性只能访问10个交易

登录后,我可以访问10个以上的交易以及以前晦涩的财产地址

我尝试使用python中的请求登录,但是它一直将我带到该页面,而无需登录并最终退出,我只是设法取消了10个最新的交易,而不是整个交易.这是我的python登录代码示例

i try to login using request in python, yet it keep bringing me to the page without login and end up, i just manage to scrap the 10 latest transaction instead of whole transaction. here is the example of my login code in python

import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.brickz.my/login/", auth=
('email', 'password'))

headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}

soup = BeautifulSoup(page.content, 'html.parser')
#I put one of the property url to be scrapped inside response
response = get("https://www.brickz.my/transactions/residential/kuala- 
           lumpur/titiwangsa/titiwangsa-sentral-condo/non-landed/?range=2012+Oct-", 
           headers = headers)

这是我用来刮桌子的东西

Here is what I used to scrape the table

  table = BeautifulSoup(response.text, 'html.parser')
  table_rows = table.find_all('tr')

  names = []   
  for tr in table_rows:
      td = tr.find_all('td')
      row = [i.text for i in td]
      names.append(row)

我如何能够成功登录并访问整个交易?我听说过Mechanize库,但是它不适用于python 3.

How am I able to successfully login and get access to the whole transaction? I heard about Mechanize library but it is not available for python 3.

很抱歉,如果我的问题不清楚,这是我第一次发帖,而我只是几个月前才学会使用python.

I am sorry if my question is not clear, this is my first time posting, and i just learn to use python only a couple of months ago.

推荐答案

尝试以下代码.打印时(更改电子邮件密码),您看到什么?它不会打印登出作为结果吗?

Try the below code. What do you see when you print it (changing email and password)? Doesn't it print Logoutas result?

import requests
from bs4 import BeautifulSoup

URL = "https://www.brickz.my/login/"

payload = {
'email': 'your_email',
'pw': 'your_password',
'submit': 'Submit'
}

with requests.Session() as s:
    s.headers = {"User-Agent":"Mozilla/5.0"}
    s.post(URL,data=payload)
    res = s.get("https://www.brickz.my/")
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select("select#menu_select .nav2"):
        data = [' '.join(item.text.split()) for item in items.select("option")[-1:]]
        print(data)

这篇关于使用python登录网站和网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆