BeautifulSoup表到数据框 [英] BeautifulSoup table to dataframe
问题描述
似乎无法将表中的值正确复制到数据框中.如果运行raw_data,它将输出所有值的列表.知道如何使它结构化吗?
Can't seem to copy over the values from the table into a dataframe correctly. If you run raw_data, it outputs a list of all the values. Any idea how to make it structured?
pop_source = requests.get("http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm").text
soup = BeautifulSoup(pop_source, 'html5lib')
source = soup.find_all('td',class_ = 'report_data')
pop = pd.DataFrame(columns=['Zip Code','Population'])
row_data = [data.text for data in source]
temp_df = pd.DataFrame([row_data], columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])
temp_df = temp_df[['Zip Code','Population']]
pop = pop.append(temp_df).reset_index(drop=True)
pop
推荐答案
代码很少:
-
您上面的代码实际上没有循环任何内容.实际上,如果您尝试运行此程序,则应该会收到大量错误.无法看到仅用您提供的值循环相同值的输出.
The code you have above actually doesn't loop through anything. In fact if you try to run this you should be getting tons of errors. There's no way to see that output of just looping same values with what you provided.
缩进已关闭,
您引用了一个尚未定义的变量 source
.您引用的变量 cols
也未定义.
you reference a variable source
which hasn't been defined. You reference a variable cols
which is also not defined.
您初始化具有2列的数据框 pop
,并尝试附加具有7列的数据框.
You initialize a dataframe pop
with 2 columns, and try to append a dataframe with 7 columns.
这里发生的各种问题.
您是否考虑过仅使用向上的熊猫来构成数据框?您仍然可以使用BeautifulSoup,但是熊猫可以为您完成工作,而且您似乎还需要更多实践来使用BeautifulSoup遍历元素(实际上您甚至从未使用过 .find
或 .find_all
来查找与所需表关联的标签.)
Have you considered just using straight up Pandas to form the dataframe? You could still use BeautifulSoup, but pandas can do the work for you, and it seems like you need to get a little more practice of iterating through elements with BeautifulSoup (you actually never even used .find
or .find_all
to locate the tags associated with the table you want.)
如果您需要使用BeautifulSoup方法,请告诉我,我也可以提供这种方法,但是说实话,与仅将 .read_html()
与熊猫在这里.
If you NEED the BeautifulSoup way to do it, just let me know and I can offer that way as well, but honestly, it's a lot more work than just using the .read_html()
with pandas here.
import pandas as pd
url = 'http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm'
tables = pd.read_html(url)
df = tables[11][1:]
df.columns = tables[11].iloc[0]
输出:
print (df)
0 # Zip Code ... People / Sq. Mile National Rank
1 1. 78705 ... 11008.66 #519
2 2. 78751 ... 5822.28 #1,374
3 3. 78752 ... 5435.92 #1,528
4 4. 78741 ... 5346.47 #1,562
5 5. 78723 ... 5175.95 #1,640
6 6. 78704 ... 5001.96 #1,713
7 7. 78758 ... 4954.80 #1,730
8 8. 78702 ... 4501.98 #2,015
9 9. 78757 ... 4380.92 #2,087
10 10. 78756 ... 4298.80 #2,139
11 11. 78745 ... 4063.22 #2,295
12 12. 78753 ... 3973.96 #2,350
13 13. 78703 ... 3491.54 #2,753
14 14. 78731 ... 3031.63 #3,167
15 15. 78759 ... 2998.68 #3,199
16 16. 78727 ... 2856.67 #3,371
17 17. 78749 ... 2795.02 #3,438
18 18. 78728 ... 2640.31 #3,614
19 19. 78721 ... 2568.43 #3,690
20 20. 78722 ... 2567.53 #3,692
21 21. 78729 ... 2366.94 #3,944
22 22. 78701 ... 2326.65 #3,995
23 23. 78748 ... 1961.73 #4,504
24 24. 78750 ... 1731.01 #4,870
25 25. 78744 ... 1464.78 #5,311
26 26. 78746 ... 1152.39 #5,971
27 27. 78717 ... 1081.05 #6,119
28 28. 78739 ... 768.80 #7,006
29 29. 78734 ... 698.96 #7,267
30 30. 78724 ... 555.85 #7,870
31 31. 78726 ... 543.24 #7,940
32 32. 78733 ... 510.92 #8,116
33 33. 78754 ... 484.73 #8,255
34 34. 78735 ... 474.14 #8,318
35 35. 78732 ... 416.13 #8,702
36 36. 78742 ... 321.40 #9,467
37 37. 78730 ... 257.86 #10,189
38 38. 78738 ... 213.29 #10,829
39 39. 78747 ... 194.02 #11,173
40 40. 78736 ... 187.88 #11,301
41 41. 78737 ... 143.90 #12,372
42 42. 78725 ... 116.87 #13,282
43 43. 78719 ... 93.88 #14,377
[43 rows x 7 columns]
with BeautifulSoup
这不是执行此操作的理想方法.虽然此站点使用 table
, tr
, td
标签非常简单.您可能想要的是首先获取所有行,然后遍历每行以获取< td>
标签.但是,您一口气抓住了所有< td>
标签.没关系,但是我们需要将其分解成每一行.
This is not the ideal way to do this. While this site is pretty straight forward with the table
, tr
, td
tags. What you probably would want is first grab all the rows, then iterate through each row to grab the <td>
tags. But you grabbed all the <td>
tags in one big swoop. Which is still ok, but we would need to break that up into each row.
然后我要做的就是将其分成7组,因为那是多少列.注意,我正在做一个巨大的假设,所有数据都在那里.如果不是,则表将关闭或行,列将移动.
All I did was then break that up into groups of 7, since that's how many columns there are. Note, I'm making a huge assumption all the data is there. If it's not, then the table would be off or rows, columns would be shifted.
import requests
import pandas as pd
import bs4
# Create a function called "chunks" with two arguments, l and n:
def chunks(l, n):
# For item i in a range that is a length of l,
for i in range(0, len(l), n):
# Create an index range for l of n items:
yield l[i:i+n]
pop_source = requests.get("http://zipatlas.com/us/tx/austin/zip-code-comparison/population-density.htm").text
soup = bs4.BeautifulSoup(pop_source, 'html5lib')
source = soup.find_all('td',class_ = 'report_data')
pop = pd.DataFrame(columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])
row_data = [data.text for data in source]
rows_data = list(chunks(row_data, 7))
for ele in rows_data:
temp_df = pd.DataFrame([ele], columns=['#','Zip Code','Location','City', 'Population','People/Sq.Mile','National Rank'])
pop = pop.append(temp_df).reset_index(drop=True)
这篇关于BeautifulSoup表到数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!