试图用美丽的汤(蟒蛇)找到属性的值2部分匹配 [英] Trying to use Beautiful Soup (Python) to find 2 partial matches in an attribute's value

查看:132
本文介绍了试图用美丽的汤(蟒蛇)找到属性的值2部分匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(这是一个后续问题为previous 帖子 ,其中用户 http://stackoverflow.com/users/771848/alecxe 帮我。更有意义张贴此后续作为一个独立的问题了,所以它是为别人更检索。)

(This is a follow-up question to a previous post, which user http://stackoverflow.com/users/771848/alecxe helped me with. Makes more sense to post this follow-up as an independent question though, so it is more searchable for others.)

我用美丽的汤来定位一个托管服务一些网络报告python脚本。

I have a python script using Beautiful Soup to locate some web reports on a hosting service.

现在的脚本是pretty严格。我想使它多一点灵活。我觉得像REG-EX是我需要的,但也许有些嵌套搜索将工作太。我接受建议。

Right now the script is pretty exacting. I would like to make it a bit more flexible. I feel like reg-ex is what I need, but maybe some nested searches would work too. I'm open to suggestion.

我目前的code的工作原理是:

My current code works like:

def search_table_for_report(table, report_name, report_type):
    #search rows of table to find given report name, then grab the download URL for the given type
    for row in table.findAll('tr')[1:]:
        #the [1:]: modifier instructs the loop to skip the first item, aka the headers.
        col = row.findAll('td')

        if report_name in col[0].string:
            print "----- parse out file type request url"
            report_type = report_type.upper()
            #this works, using exact match
            label = row.find("input", {"aria-label": "Select " + report_name + " I format " + report_type})
            #this doesn't work, using reg-ex
            #label = row.find("input", {"aria-label": re.compile("\b" + report_name + ".*\b" + report_type + ".*")})

            print "----- okay found the right checkbox, now grab the href link ----"
            link_url = label.find_next_sibling("a", href=True)["href"]
            return link_url  

这将通过这样的表中搜索:

Which would search through a table like this:

<tr class="odd">
 <td header="c1">
  Report Download
 </td>
 <td header="c2">
  <input aria-label="Select Report I format PDF" id="documentChkBx0" name="documentChkBx" type="checkbox" value="5446"/>
  <a href="/a/document.html?key=5446">
   <img alt="Portable Document Format" src="/img/icons/icon_PDF.gif">
   </img>
  </a>
  <input aria-label="Select Report I format XLS" id="documentChkBx1" name="documentChkBx" type="checkbox" value="5447"/>
  <a href="/a/document.html?key=5447">
   <img alt="Excel Spreadsheet Format" src="/img/icons/icon_XLS.gif">
   </img>
  </a>
 </td>
 <td header="c4">
  04/27/2015
 </td>
 <td header="c5">
  05/26/2015
 </td>
 <td header="c6">
  05/26/2015 10:00AM EDT
 </td>
</tr>

我想搜索的咏叹调标签值两个值,或在这两个部分的比赛。从本质上讲,有时代替找到的选择报告格式XLS后,我可能需要找到选择矩阵格式PDF。 pretty确保选择和格式位将永远存在,但不能肯定,所以才需要进行第二个字和最后的外延式是部分匹配搜索。因为有时报告一词可能有尾的话,我不指望,例如选择报表II格式XLS等,这些会失败,如果它的偏位(而不是精确的)是非常重要的是的选择报告格式XLS

I'd like to search the "aria-label" value for two values, or two partial matches within it. Essentially, sometimes instead of finding "Select Report format XLS", I may need to find "Select Matrix format PDF". Pretty sure the "select" and "format" bit will always be there but can't be sure, so just need to make the 2nd word and final extension type be partial match searches. The partial bit (instead of exact) is important because sometimes the "report" word may have trailing words I don't expect, like "Select Report II format XLS", etc, which would fail if it was an exact search for "Select Report format XLS".

所以,我需要code(正则表达式presuambly),将一个给定的名称来搜索(取代报告),并给定类型(到位XLS的)这是我做过尝试,但它不是加工。我认为,REG-EX的语法是好的,但我觉得我在干扰错误的现场re.compile,以一种美丽的汤并不指望用它。

So I need code (regex presuambly) that will search for a given name (in place of Report) and a given type (in place of XLS) This is what I've tried but it's not working. I think the reg-ex syntax is good, but I think I'm jamming the re.compile in the wrong spot, using it in a way that Beautiful Soup does not expect.

label = row.find("input", {"aria-label": re.compile("\b" + report_name + ".*\b" + report_type + ".*")}) 

希望我解释说,好。快乐澄清任何混淆。

Hope I explained that well. Happy to clarify any confusion.

推荐答案

我想通了这个问题。我BS4搜索技术是好的,它只是有点聪明所需要的正则表达式。它使用下面的伟大工程!我不知道如何使这个搜索不区分大小写,但现在它的好。

I figured out the issue. My BS4 search technique was fine, it was just the regex pattern needed to be a bit smarter. It works great using the below! I'm not sure how to make this search case insensitive but for now it's okay.

#build the pattern to search on 
#where report_name and report_type are strings passed into the function
regex_criteria = r'.*' + report_name + r'.*' + report_type

#search the value of the "aria-label" attribute 
#across all the inputs on the page
target_input = row.find("input", {"aria-label": re.compile(regex_criteria)})

这篇关于试图用美丽的汤(蟒蛇)找到属性的值2部分匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆