在循环中查找漂亮的汤返回TypeError [英] Find on beautiful soup in loop returns TypeError

查看:185
本文介绍了在循环中查找漂亮的汤返回TypeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Beautiful Soup在ajax页面上刮一张桌子,然后用表格形式用TextTable库打印出来。

I'm trying to scrape a table on an ajax page with Beautiful Soup and print it out in table form with the TextTable library.

import BeautifulSoup
import urllib
import urllib2
import getpass
import cookielib
import texttable

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

...

def show_queue():
    url = 'https://www.animenfo.com/radio/nowplaying.php'
    values = {'ajax' : 'true', 'mod' : 'queue'}
    data = urllib.urlencode(values)
    f = opener.open(url, data)
    soup = BeautifulSoup.BeautifulSoup(f)
    stable = soup.find('table')
    table = texttable.Texttable()
    header = stable.findAll('th')
    header_text = []
    for th in header:
        header_append = th.find(text=True)
        header.append(header_append)
    table.header(header_text)
    rows = stable.find('tr')
    for tr in rows:
        cells = []
        cols = tr.find('td')
        for td in cols:
            cells_append = td.find(text=True)
            cells.append(cells_append)
        table.add_row(cells)
    s = table.draw
    print s

...

虽然我试图抓取的HTML的URL显示在代码中,这是一个示例:

Although the URL for the HTML in question I'm trying to scrape is shown in the code, here is an example of it:

<table cellspacing="0" cellpadding="0">
    <tbody>
        <tr>
                        <th>Artist - Title</th>
            <th>Album</th>
            <th>Album Type</th>
            <th>Series</th>
            <th>Duration</th>
            <th>Type of Play</th>
            <th>
                <span title="...">Time to play</span>
            </th>
                    </tr>
                <tr>
                        <td class="row1">
                <a href="..." class="songinfo">Song 1</a>
            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 1</a>
            </td>
            <td class="row1">...</td>
            <td class="row1">

            </td>
            <td class="row1" style="text-align: center">
                5:43
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:00:00
            </td>
                    </tr>
                <tr>
                        <td class="row2">
                <a href="..." class="songinfo">Song2</a>
            </td>
            <td class="row2">
                <a href="..." class="album_link">Album 2</a>
            </td>
            <td class="row2">...</td>
            <td class="row2">

            </td>
            <td class="row2" style="text-align: center">
                6:16
            </td>
            <td class="row2" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row2" style="text-align: center">
                ~0:05:43
            </td>
                    </tr>
                <tr>
                        <td class="row1">
                <a href="..." class="songinfo">Song 3</a>
            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 3</a>
            </td>
            <td class="row1">...</td>
            <td class="row1">

            </td>
            <td class="row1" style="text-align: center">
                4:13
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:11:59
            </td>
                    </tr>
                <tr>
                        <td class="row2">
                <a href="..." class="songinfo">Song 4</a>
            </td>
            <td class="row2">
                <a href="..." class="album_link">Album 4</a>
            </td>
            <td class="row2">...</td>
            <td class="row2">

            </td>
            <td class="row2" style="text-align: center">
                5:34
            </td>
            <td class="row2" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row2" style="text-align: center">
                ~0:16:12
            </td>
                    </tr>
                <tr>
                        <td class="row1"><a href="..." class="songinfo">Song 5</a>

            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 5</a>
            </td>
            <td class="row1">...</td>
            <td class="row1"></td>
            <td class="row1" style="text-align: center">
                4:23
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:21:46
            </td>
                    </tr>
                <tr>
            <td style="height: 5px;">
        </td></tr>
        <tr>
            <td class="row2" style="font-style: italic; text-align: center;" colspan="5">There are x songs in the queue with a total length of x:y:z.</td>
        </tr>
    </tbody>
</table>

每当我尝试运行此脚本函数时,它会以中止:TypeError:find ()在 header_append = th.find(text = True)行上没有关键字参数。我有点难过,因为我似乎正在做代码示例中显示的内容,它似乎应该可以工作,但事实并非如此。

Whenever I try to run this script function, it aborts with TypeError: find() takes no keyword arguments on the line header_append = th.find(text=True). I'm sort of stumped, as it seems that I'm doing what is shown in code examples and it seems it should work, yet it doesn't.

In简而言之,我如何修复代码,以便没有TypeError以及我做错了什么?

In short, how do I fix the code so that there is no TypeError and what am I doing wrong?

编辑:
我提到的文章和文档写剧本:

Articles and documentation that I referred to when writing the script:

  • http://segfault.in/2010/07/parsing-html-table-in-python-with-beautifulsoup/
  • http://oneau.wordpress.com/2010/05/30/simple-formatted-tables-in-python-with-texttable/

推荐答案

基本问题



解析器表现正常。您只是使用相同的表达式来解析不同类型的元素。

The Basic Issue

The parser is behaving correctly. You are just using the same expressions to parse different types of elements.

这是一个片段,仅关注返回已删除的列表。获得列表后,您可以轻松地格式化文本表:

Here is a snippet, focusing only on returning scraped lists. Once you have the lists, you can format the text table easily:

import BeautifulSoup

def get_queue(data):
    # Args:
    #   data: string, contains the html to be scraped
    soup = BeautifulSoup.BeautifulSoup(data)
    stable = soup.find('table')

    header = stable.findAll('th')
    headers = [ th.text for th in header ]

    cells = [ ]
    rows = stable.findAll('tr')
    for tr in rows[1:-2]:
        # Process the body of the table
        row = []
        td = tr.findAll('td')
        row.append( td[0].find('a').text )
        row.append( td[1].find('a').text )
        row.extend( [ td.text for td in td[2:] ] )
        cells.append( row )

    footer = rows[-1].find('td').text
    return headers, cells, footer



输出



headers cells ,和页脚,单元格现在可以输入 texttable 格式化函数:

Output

headers, cells, and footer, cells can now be fed into a texttable formatting function:

import texttable
def show_table(headers, cells, footer):
    retval = ''
    table = texttable.Texttable()
    table.header(headers)
    for cell in cells:
        table.add_row(cell)
    retval = table.draw()
    return retval + '\n' + footer

print show_table(headers, cells, footer)



+----------+----------+----------+----------+----------+----------+----------+
| Artist - |  Album   |  Album   |  Series  | Duration | Type of  | Time to  |
|  Title   |          |   Type   |          |          |   Play   |   play   |
+==========+==========+==========+==========+==========+==========+==========+
| Song 1   | Album 1  | ...      |          | 5:43     | S.A.M.   | ~0:00:00 |
+----------+----------+----------+----------+----------+----------+----------+
| Song2    | Album 2  | ...      |          | 6:16     | S.A.M.   | ~0:05:43 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 3   | Album 3  | ...      |          | 4:13     | S.A.M.   | ~0:11:59 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 4   | Album 4  | ...      |          | 5:34     | S.A.M.   | ~0:16:12 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 5   | Album 5  | ...      |          | 4:23     | S.A.M.   | ~0:21:46 |
+----------+----------+----------+----------+----------+----------+----------+
There are x songs in the queue with a total length of x:y:z.

这篇关于在循环中查找漂亮的汤返回TypeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆