将表格抓取到列表 [英] Web-scrapeing a table to a list

查看:19
本文介绍了将表格抓取到列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网页中提取表格.我已经设法将表中的所有数据放入一个列表中.然而,所有表格数据都被放入一个列表元素中.我需要帮助从表格的行中获取干净"的数据(即字符串,没有所有的 HTML 打包)到它们自己的列表元素中.

I'm trying to extract a table from a webpage. I have managed to get all the data in the table into a list. However all the table data is being put into one list element. I need assistance getting the 'clean' data (i.e. the strings, without all the HTML packaging) from the rows of the table into their own list elements.

所以,而不是...

list  = [<tr>
         <th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS"><img alt="TTAKBS.png" decoding="async" height="64" src="https://static.wikia.nocookie.net/escapefromtarkov_gamepedia/images/6/61/TTAKBS.png/revision/latest/scale-to-width-down/64?cb=20190519001904" width="64"/></a>
         </th>
         <th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS</a>
         </th>
         <td>58
         </td>
         <td>12
         </td>
         <td>32]

我想...

list  = ['href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS"><img alt="TTAKBS.png" decoding="async" height="64" src="https://static.wikia.nocookie.net/escapefromtarkov_gamepedia/images/6/61/TTAKBS.png/revision/latest/scale-to-width-down/64?cb=20190519001904" width="64"',
         'href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS',
         '58',
         '12',
         '32']

我的代码和 list 可以使用以下内容复制.

My code and list can be replicated using the following.

#Import Modules
import re
import requests
from bs4 import BeautifulSoup

#Get page
cartridge_url = 'https://escapefromtarkov.gamepedia.com/7.62x25mm_Tokarev'
cartridge_page = requests.get(cartridge_url)
cartridge_soup = BeautifulSoup(cartridge_page.content, 'html.parser')

#This gets the rows of the table I want
list = cartridge_soup.find_all(lambda t: t.name =='tr')

#This gets rid of an element which is not useful
list = [n for n in dirty_temp_type if not 'class="va-navbox' in str(n)]

#I had hoped this might assemble a list..  
list = [str(n) for n in list]


我正在学习 Python,我想我掌握了 HTML,但是我无法让 Python 与我的 bs4.element.ResultSet 进行交互.我知道这不是一个复杂的解决方案,但在尝试了多种不同的方法后,我遇到了麻烦.我的真正"最终目标是如下列表...


I'm learning python, I think I grasp HTML, but I cannot get python to interact with my bs4.element.ResultSet. I know this is not a sophisticated solution but I have hit a brick wall after trying a number of different approaches. My 'true' end goal is a list like the following...

list  = ['7.62x25mm_TT_AKBS',
         '58',
         '12',
         '32']


尝试实施建议的解决方案:

--->正如 AzyCrw4282

顺便说一句,这是一个令人难以置信的用户名.

(i)

我 [认为我] 可以大致了解我应该做什么,但我没有正确执行它.

I [think I] can see roughly what I'm supposed to do but I'm failing to properly implement it.

使用...

cartridge_table = cartridge_soup.find_all('table')

我得到了存储在 cartridge_table 中的 HTML 格式的所有正确数据.但是,运行...

I get what looks to be all the right data in HTML format stored inside cartridge_table. However, running...

for row in cartridge_table.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

...返回...

ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

... 将 find_all 替换为 find 并不能解决问题.

... and replacing find_all with find doesn't remedy the issue.

(ii)

我心不在焉地跑……

for row in cartridge_soup.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

...但这会返回一个空列表.

...but this returns an empty list.

(iii)

您最初链接的问题在用必要的数据填充 table 变量之前定义了一个名为 header 的变量...

The question you originally linked to defines a variable called header prior to filling the table variable with the necassary data...

header = soup.find("b", text="Payable")
table = header.find_parent("table")

我不知道用什么来代替应付账款"用它来为我工作.

I'm not grasping what to replace "Payable" with to get this to work for me.

(iv)

我试图在 (iii) 中否定上述问题,给它一个刺...

I tried to negate the above problem in (iii) by giving this a stab...

cartridge_table = cartridge_soup.find_parent("table")

for row in cartridge_soup.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

但它返回一个空列表.当我检查时,是因为 cartridge_table 变量下没有存储任何内容.

But it returns an empty list. When I checked it's because nothing gets stored under the cartridge_table variable.

(V)

我试过跑步...

header = cartridge_soup.find("b", text="Payable")

... 并用各种看似合理的替代方案替换 Payable" 以查看会发生什么,但我一无所获.最终,header 变量似乎始终为空.

... and replacing "Payable" with a variety of seemingly sensible alternatives to see what would happen, but I got nowhere. Ultimately the header variable always seemed to remain empty.

示例:图标"名称"碎片机会"wikitable sortable"7.62x25mm TT LRN"7.62x25mm_TT_AKBS".

推荐答案

我已经尝试解决问题,但页面上给出的表格似乎有问题——至少我是这么认为的.对于给定的行数,表的提取应该产生大小为 n 的元素,但出于某种原因,它将所有行作为数组中的单个元素.我确实研究过,但并没有深入研究(而且我也没有时间).

I have played around to solve the problem but there seems to be something wrong with the table given on the page — at least that's what I think. The extraction of the table should yield elements of size n for the given number of rows but for some reason, it gives all of the rows as a single element in the array. I did look into but didn't get far with this(and I am also short of time).

假设您只对第一行中的单元格感兴趣,那么在这种情况下,您可以通过使用 XPath 方法定位这些元素来轻松实现.这将很容易找到元素并产生您需要的值.Xpath 但是不适用于 BeautifulSoup.

Given that you are only interested in the cells in the first rows then in this case you can easily do it by targetting those elements with the XPath approach. This will easily locate the elements and yield the values you require. Xpath however doesn't work with BeautifulSoup.

为了解决这个问题,我最终使用了一种硬编码的方法来选择数组中所需的元素.这针对 name 列的第一次提取,然后是其他列.

To solve this problem, I ended up using a hardcoded approach to select the required elements in the array. This targets the first extraction of the name column, followed by the other columns.

代码

import re
import requests
from bs4 import BeautifulSoup
import urllib.request

#Get page
cartridge_url = 'https://escapefromtarkov.gamepedia.com/7.62x25mm_Tokarev'
page = urllib.request.urlopen(cartridge_url)
cartridge_soup = BeautifulSoup(page.read())
tables = cartridge_soup.findChildren('table')
my_table = tables[0]

cartridge_table = my_table.findChildren(['table','th', 'tr'])
dataArray = []
dataArray.append(str(cartridge_table[13]).split('</a>')[0][45:62].replace(" ","_"))
splitChar = str(cartridge_table[13]).split("</td>")

for data in splitChar[:3]:
    dataArray.append(data[-3:-1])

print(dataArray)

给予

['7.62x25mm_TT_AKBS', '58', '12', '32']

让我知道它是否解决了您的问题,或者它是否需要适应其他用例.

Let me know if it solves your problem or if it needs adapting for other use cases.

这篇关于将表格抓取到列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆