美丽的汤找不到第一个标签(XML) [英] Beautiful Soup Can't Find the First Tag (XML)

查看:48
本文介绍了美丽的汤找不到第一个标签(XML)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoup 4(和解析器lmxl)来解析用于 MLB API的XML文件.该API会为特定日期的当前游戏生成一个记分板,而我在让Beautiful Soup识别特定选项卡方面遇到困难.

I am using BeautifulSoup 4 (and the parser lmxl) to parse an XML file used for the MLB API. The API generates a scoreboard for the current games for a particular day, and I'm having trouble getting Beautiful Soup to recognize a particular tab.

例如,我正在查看今天的游戏,尝试根据他们的 away_file_code home_file_code 提取特定球队的得分和姓名.如果我们看巴尔的摩金莺vs多伦多蓝鸟队,比赛记分板XML将如下所示:

For instance, I am looking at today's games, trying to extract the scores and names for a certain team based on their away_file_code or home_file_code. If we look at the Baltimore Orioles vs Toronto Blue Jays, the game scoreboard XML will look like this:

<games year="2017" month="04" day="16" modified_date="2017-04-17T01:42:57Z" next_day_date="2017-04-17">
<game id="2017/04/16/balmlb-tormlb-1" venue="Rogers Centre" game_pk="490271" time="1:07" time_date="2017/04/16 1:07" time_date_aw_lg="2017/04/16 1:07" time_date_hm_lg="2017/04/16 1:07" time_zone="ET" ampm="PM" first_pitch_et="" away_time="1:07" away_time_zone="ET" away_ampm="PM" home_time="1:07" home_time_zone="ET" home_ampm="PM" game_type="R" tiebreaker_sw="N" resume_date="" original_date="2017/04/16" time_zone_aw_lg="-4" time_zone_hm_lg="-4" time_aw_lg="1:07" aw_lg_ampm="PM" tz_aw_lg_gen="ET" time_hm_lg="1:07" hm_lg_ampm="PM" tz_hm_lg_gen="ET" venue_id="14" scheduled_innings="9" description="" away_name_abbrev="BAL" home_name_abbrev="TOR" away_code="bal" away_file_code="bal" away_team_id="110" away_team_city="Baltimore" away_team_name="Orioles" away_division="E" away_league_id="103" away_sport_code="mlb" home_code="tor" home_file_code="tor" home_team_id="141" home_team_city="Toronto" home_team_name="Blue Jays" home_division="E" home_league_id="103" home_sport_code="mlb" day="SUN" gameday_sw="P" double_header_sw="N" game_nbr="1" tbd_flag="N" away_games_back="-" home_games_back="6.5" away_games_back_wildcard="" home_games_back_wildcard="5.5" venue_w_chan_loc="CAXX0504" location="Toronto, Canada" gameday="2017_04_16_balmlb_tormlb_1" away_win="8" away_loss="3" home_win="2" home_loss="10" game_data_directory="/components/game/mlb/year_2017/month_04/day_16/gid_2017_04_16_balmlb_tormlb_1" league="AA">
<status status="Final" ind="F" reason="" inning="9" top_inning="N" b="0" s="0" o="3" inning_state="" note="" is_perfect_game="N" is_no_hitter="N"/>
<linescore>...</linescore>
<home_runs>...</home_runs>
<winning_pitcher id="605164" last="Bundy" first="Dylan" name_display_roster="Bundy" number="37" era="1.86" wins="2" losses="1"/>
<losing_pitcher id="457918" last="Happ" first="J.A." name_display_roster="Happ" number="33" era="4.50" wins="0" losses="3"/>
<save_pitcher id="" last="" first="" number="" name_display_roster="" era="0" wins="0" losses="0" saves="0" svo="0"/>
<links mlbtv="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'video'})" wrapup="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=wrap&c_id=mlb" home_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" away_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" home_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" away_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" tv_station="SNET-1"/>
<broadcast>...</broadcast>
<alerts text="Final score in Toronto: Baltimore 11, Toronto 4" brief_text="At TOR: Final - BAL 11, TOR 4" type="status"/>
<game_media>...</game_media>
<video_thumbnail>...</video_thumbnail>
<video_thumbnails>...</video_thumbnails>
</game>
<game>...</game> (etc...)

以下是我用来尝试找到 game (不是 games )标签的代码段及其属性.问题是,当我请求游戏时,它返回None.但是,我可以返回任何其他标签而没有问题-例如, status 可以很好地工作.

The below is a snippet of code I am using to try and find the game (not games) tag, and it's attributes. The issue is, when I request game, it returns None. However, I can return any other tag without an issue-- status, for example, works perfectly fine.

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games
tags = soup.findAll('game', {'home_file_code': 'tor'}) #supposed to find the tags for the home_file_code matching the home team's abbreviation
for games in tags:
    print(games.find('status')['status'] #works without an issue
    print(games.find('game')['home_file_code'] #throws below error, because games.find('game') is None

TypeError:"NoneType"对象不可下标

TypeError: 'NoneType' object is not subscriptable

此外,如果我打印列表的子项( print(list(games.children))),它将返回游戏以外的所有内容.

Also, if I print the children for list (print(list(games.children))), it returns everything except game.

关于XML为什么不能抓住第一个标签,我是否缺少一些东西?我很困惑,因为这在不久前对我有用,而且我不确定我所做的更改导致了错误.

Is there something I'm missing about the XML as to why it can't grab that first tag? I'm pretty confused because this was working for me not too long ago, and I'm not sure what I changed that's causing the error.

推荐答案

我不是程序员中的佼佼者,但是我很确定您没有找到第一个标记,因为它的定义不正确.XML标记(如果包含任何内容)必须具有开头和结尾部分,如下所示:< games> year ="2017"month ="04"day ="16"</games> 而不是这样:< games year ="2017"month ="04"day ="16"> 因此,首先您需要修复XML格式,然后从那里获取它.

I'm not the greatest of programmers, but I'm pretty sure you're not finding the first tag because it is incorrectly defined. XML tags, if they contain anything, must have an opening and a closing part like this: <games>year="2017" month="04" day="16"</games> and not like this: <games year="2017" month="04" day="16"> So first thing you need to fix your XML formatting and then take it from there.

这篇关于美丽的汤找不到第一个标签(XML)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆