BeautifulSoup表到数据框 [英] BeautifulSoup table to dataframe

查看:38
本文介绍了BeautifulSoup表到数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎无法将表中的值正确复制到数据框中.如果运行raw_data,它将输出所有值的列表.知道如何使它结构化吗?

Can't seem to copy over the values from the table into a dataframe correctly. If you run raw_data, it outputs a list of all the values. Any idea how to make it structured?

pop_source = requests.get("http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm").text

soup = BeautifulSoup(pop_source, 'html5lib')    
source = soup.find_all('td',class_ = 'report_data')


pop = pd.DataFrame(columns=['Zip Code','Population'])


row_data = [data.text for data in source]

temp_df = pd.DataFrame([row_data], columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])   

temp_df = temp_df[['Zip Code','Population']]   
pop = pop.append(temp_df).reset_index(drop=True)    
pop

推荐答案

代码很少:

  1. 您上面的代码实际上没有循环任何内容.实际上,如果您尝试运行此程序,则应该会收到大量错误.无法看到仅用您提供的值循环相同值的输出.

  1. The code you have above actually doesn't loop through anything. In fact if you try to run this you should be getting tons of errors. There's no way to see that output of just looping same values with what you provided.

缩进已关闭,

您引用了一个尚未定义的变量 source .您引用的变量 cols 也未定义.

you reference a variable source which hasn't been defined. You reference a variable cols which is also not defined.

您初始化具有2列的数据框 pop ,并尝试附加具有7列的数据框.

You initialize a dataframe pop with 2 columns, and try to append a dataframe with 7 columns.

这里发生的各种问题.

您是否考虑过仅使用向上的熊猫来构成数据框?您仍然可以使用BeautifulSoup,但是熊猫可以为您完成工作,而且您似乎还需要更多实践来使用BeautifulSoup遍历元素(实际上您甚至从未使用过 .find .find_all 来查找与所需表关联的标签.)

Have you considered just using straight up Pandas to form the dataframe? You could still use BeautifulSoup, but pandas can do the work for you, and it seems like you need to get a little more practice of iterating through elements with BeautifulSoup (you actually never even used .find or .find_all to locate the tags associated with the table you want.)

如果您需要使用BeautifulSoup方法,请告诉我,我也可以提供这种方法,但是说实话,与仅将 .read_html()与熊猫在这里.

If you NEED the BeautifulSoup way to do it, just let me know and I can offer that way as well, but honestly, it's a lot more work than just using the .read_html() with pandas here.

import pandas as pd

url = 'http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm'

tables = pd.read_html(url)

df = tables[11][1:]
df.columns = tables[11].iloc[0]

输出:

print (df)
0     # Zip Code      ...      People / Sq. Mile National Rank
1    1.    78705      ...               11008.66          #519
2    2.    78751      ...                5822.28        #1,374
3    3.    78752      ...                5435.92        #1,528
4    4.    78741      ...                5346.47        #1,562
5    5.    78723      ...                5175.95        #1,640
6    6.    78704      ...                5001.96        #1,713
7    7.    78758      ...                4954.80        #1,730
8    8.    78702      ...                4501.98        #2,015
9    9.    78757      ...                4380.92        #2,087
10  10.    78756      ...                4298.80        #2,139
11  11.    78745      ...                4063.22        #2,295
12  12.    78753      ...                3973.96        #2,350
13  13.    78703      ...                3491.54        #2,753
14  14.    78731      ...                3031.63        #3,167
15  15.    78759      ...                2998.68        #3,199
16  16.    78727      ...                2856.67        #3,371
17  17.    78749      ...                2795.02        #3,438
18  18.    78728      ...                2640.31        #3,614
19  19.    78721      ...                2568.43        #3,690
20  20.    78722      ...                2567.53        #3,692
21  21.    78729      ...                2366.94        #3,944
22  22.    78701      ...                2326.65        #3,995
23  23.    78748      ...                1961.73        #4,504
24  24.    78750      ...                1731.01        #4,870
25  25.    78744      ...                1464.78        #5,311
26  26.    78746      ...                1152.39        #5,971
27  27.    78717      ...                1081.05        #6,119
28  28.    78739      ...                 768.80        #7,006
29  29.    78734      ...                 698.96        #7,267
30  30.    78724      ...                 555.85        #7,870
31  31.    78726      ...                 543.24        #7,940
32  32.    78733      ...                 510.92        #8,116
33  33.    78754      ...                 484.73        #8,255
34  34.    78735      ...                 474.14        #8,318
35  35.    78732      ...                 416.13        #8,702
36  36.    78742      ...                 321.40        #9,467
37  37.    78730      ...                 257.86       #10,189
38  38.    78738      ...                 213.29       #10,829
39  39.    78747      ...                 194.02       #11,173
40  40.    78736      ...                 187.88       #11,301
41  41.    78737      ...                 143.90       #12,372
42  42.    78725      ...                 116.87       #13,282
43  43.    78719      ...                  93.88       #14,377

[43 rows x 7 columns]

with BeautifulSoup

这不是执行此操作的理想方法.虽然此站点使用 table tr td 标签非常简单.您可能想要的是首先获取所有行,然后遍历每行以获取< td> 标签.但是,您一口气抓住了所有< td> 标签.没关系,但是我们需要将其分解成每一行.

This is not the ideal way to do this. While this site is pretty straight forward with the table, tr, td tags. What you probably would want is first grab all the rows, then iterate through each row to grab the <td> tags. But you grabbed all the <td> tags in one big swoop. Which is still ok, but we would need to break that up into each row.

然后我要做的就是将其分成7组,因为那是多少列.注意,我正在做一个巨大的假设,所有数据都在那里.如果不是,则表将关闭或行,列将移动.

All I did was then break that up into groups of 7, since that's how many columns there are. Note, I'm making a huge assumption all the data is there. If it's not, then the table would be off or rows, columns would be shifted.

import requests
import pandas as pd
import bs4


# Create a function called "chunks" with two arguments, l and n:
def chunks(l, n):
    # For item i in a range that is a length of l,
    for i in range(0, len(l), n):
        # Create an index range for l of n items:
        yield l[i:i+n]



pop_source = requests.get("http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm").text

soup = bs4.BeautifulSoup(pop_source, 'html5lib')


source = soup.find_all('td',class_ = 'report_data')

pop = pd.DataFrame(columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])

row_data = [data.text for data in source]

rows_data = list(chunks(row_data, 7))

for ele in rows_data:
    temp_df = pd.DataFrame([ele], columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])
    pop = pop.append(temp_df).reset_index(drop=True)

这篇关于BeautifulSoup表到数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆