如何从网页中删除特定的ID [英] How to scrap specific IDs from a Webpage

查看:52
本文介绍了如何从网页中删除特定的ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要做一些房地产市场研究,为此需要价格以及新房的其他价值.

I need to do some real estate market research and for this in need the prices, and other values from new houses.

所以我的想法是进入获得信息的网站.转到Main-Search-Site并抓取所有RealEstateID,这些RealEstateID会将我直接导航到我可以在其中每个房子的单个页面,然后提取我需要的信息.

So my idea was to go on the website where i get the information. Go to the Main-Search-Site and scrape all the RealEstateIDs that would navigate me directly to the single pages for each house where i can than extract my infos that i need.

我的问题是如何从主页获取所有房地产ID并将其存储在列表中,因此我可以在下一步中使用它们来构建带有URL的URL以进入实际站点.

My problem is how do i get all the real estate ids from the main page and store them in a list, so i can use them in the next step to build the urls with them to go to the acutal sites.

我用beautifulsoup尝试过,但是失败了,因为我不知道如何搜索特定单词并提取其后的内容.

I tried it with beautifulsoup but failed because i dont understand how to search for a specific word and extract what comes after it.

html代码如下:

""realEstateId":110356727,"newHomeBuilder":"false","disabledGrouping":"false","resultlist.realEstate":{"@xsi.type":"search:ApartmentBuy","@id":"110356727","title":"

由于值"realEstateId"出现了约60次,因此我想每次都刮擦它后面的数字(此处为:110356727)并将其存储在列表中,以便以后使用.

Since the value "realEstateId" appears around 60 times, i want to scrape evertime the number (here: 110356727) that comes after it and store it in a list, so that i can use them later.

    import time
    import urllib.request
    from urllib.request import urlopen
    import bs4 as bs
    import datetime as dt
    import matplotlib.pyplot as plt
    from matplotlib import style
    import numpy as np
    import os
    import pandas as pd
    import pandas_datareader.data as web
    import pickle
    import requests
    from requests import get 
url = 'https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list'
        response = get(url)
        from bs4 import BeautifulSoup
        html_soup = BeautifulSoup(response.text, 'html.parser')
        type(html_soup)

        def expose_IDs():
            resp = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
            soup = bs.BeautifulSoup(resp.text, 'lxml')
            table = soup.find('resultListModel')
            tickers = []
            for row in table.findAll('realestateID')[1:]:
                ticker = row.findAll(',')[0].text
                tickers.append(ticker)
            with open("exposeID.pickle", "wb") as f:
                pickle.dump(tickers, f)
            return tickers

        expose_IDs()

推荐答案

是这样的吗?词典中有68个ID为ID的键.我使用regex来捕获与您所捕获的脚本相同的脚本,并修剪掉不需要的字符,然后加载 json.loads 并访问json对象,如底部的图像所示.

Something like this? There are 68 keys in a dictionary that are ids. I use regex to grab the same script as you are after and trim of an unwanted character, then load with json.loads and access the json object as shown in image at bottom.

import requests
import json
from bs4 import BeautifulSoup as bs
import re

res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
#resultListModel: 
results = json.loads(script)
ids = list(results['searchResponseModel']['entryInformation'].keys())
print(ids)

编号:

自网站更新以来:

import requests
import json
from bs4 import BeautifulSoup as bs
import re

res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
results = json.loads(script)
ids = [item['@id'] for item in results['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']]
print(ids)

这篇关于如何从网页中删除特定的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆