使用beautifulsoup从页面中抓取表格,找不到表格 [英] Scraping a table from a page using beautifulsoup, table is not found

查看:71
本文介绍了使用beautifulsoup从页面中抓取表格,找不到表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力从此处,但在我看来,BeautifulSoup找不到任何桌子.

I've been trying to scrape the table from here but it seems to me that BeautifulSoup doesn't find any table.

我写道:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,'xml')
table=soup.find_all('table')
print table   #prints nothing..

基于其他类似的问题,我认为HTML某种程度上已损坏,但我不是专家.在这些地方找不到答案:(美丽的汤缺少一些html表标签),(从网站中提取表格),(使用BeautifulSoup刮擦表),甚至是(">Python+BeautifulSoup:从网页上抓取特定表)

Based on other similar questions, I assume that the HTML is broken in someway, but I'm not an expert.. Couldn't find an answer in those: (Beautiful soup missing some html table tags), (Extracting a table from a website), (Scraping a table using BeautifulSoup), or even (Python+BeautifulSoup: scraping a particular table from a webpage)

感谢一堆!

推荐答案

您正在解析 html ,但是您使用了 xml 解析器.
您应该使用 soup = BeautifulSoup(data,"html.parser")
您所需的数据在 script 标记中,实际上实际上没有 table 标记.因此,您需要在 script 中查找文本.
注意:如果您使用的是Python 2.x,请使用"HTMLParser"而不是"html.parser".

You are parsing html but you used xml parser.
You should use soup=BeautifulSoup(data,"html.parser")
Your necessary data is in script tag, in fact there is no table tag actually. So, you need to find texts inside script.
N.B: If you are using Python 2.x then use "HTMLParser" instead of "html.parser".

这是代码.

import csv
import requests
from bs4 import BeautifulSoup

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")

file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []

list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])

for script in scripts:
    text = script.text
    start = 0
    end = 0
    if(len(text) > 10000):
        while(start > -1):
            start = text.find('"School Name":"',start)
            if(start == -1):
                break
            start += len('"School Name":"')
            end = text.find('"',start)
            school_name = text[start:end]

            start = text.find('"Early Career Median Pay":"',start)
            start += len('"Early Career Median Pay":"')
            end = text.find('"',start)
            early_pay = text[start:end]

            start = text.find('"Mid-Career Median Pay":"',start)
            start += len('"Mid-Career Median Pay":"')
            end = text.find('"',start)
            mid_pay = text[start:end]

            start = text.find('"Rank":"',start)
            start += len('"Rank":"')
            end = text.find('"',start)
            rank = text[start:end]

            start = text.find('"% High Job Meaning":"',start)
            start += len('"% High Job Meaning":"')
            end = text.find('"',start)
            high_job = text[start:end]

            start = text.find('"School Type":"',start)
            start += len('"School Type":"')
            end = text.find('"',start)
            school_type = text[start:end]

            start = text.find('"% STEM":"',start)
            start += len('"% STEM":"')
            end = text.find('"',start)
            stem = text[start:end]

            list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
writer.writerows(list_to_write)
file_name.close()

这将在csv中生成您所需的表.完成操作后,别忘了关闭文件.

This will generate your necessary table in csv. Don't forget to close the file when you are done.

这篇关于使用beautifulsoup从页面中抓取表格,找不到表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆