使用beautifulsoup在HTML中的链接标记内获取标题 [英] get title inside link tag in HTML using beautifulsoup

查看:89
本文介绍了使用beautifulsoup在HTML中的链接标记内获取标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从 https://data中提取数据.gov.au/dataset?organization=澳大利亚储备银行& _groups_limit = 0& groups = business 并得到了我想要的输出,但是现在的问题是:我得到的输出是业务支持和...以及澳大利亚储备银行....,不是完整的文本,我想打印整个文本而不是......."对所有人.我用jezrael替换了第9行和第10行,请参阅,当我单独运行但无法运行完整代码时,它给出了字符串类型的答案.

I am extracting data from https://data.gov.au/dataset?organization=reservebankofaustralia&_groups_limit=0&groups=business and got output I wanted but now problem is: the output that I am getting is Business Support an... and Reserve Bank of Aus...., not complete text, I want to print the whole text not "......." for all. I replaced line 9 and 10 in answer by jezrael, please refer to Fetching content from html and write fetched content in a specific format in CSV with code org = soup.find_all('a', {'class':'nav-item active'})[0].get('title') groups = soup.find_all('a', {'class':'nav-item active'})[1].get('title') . And I am running it separately and getting error: list index out of range. What should I use to extract complete sentences? I also tried : org = soup.find_all('span',class_="filtered pill"), it gave answer of type string when I ran separately but could not run with whole code.

推荐答案

所有文本较长的数据均在属性 title 中,而文本较短.因此添加双 if :

All data with longer text are in attribut title, shorter are in text. So add double if:

for i in webpage_urls:
    wiki2 = i
    page= urllib.request.urlopen(wiki2)
    soup = BeautifulSoup(page, "lxml")

    lobbying = {}
    #always only 2 active li, so select first by [0]  and second by [1]
    l = soup.find_all('li', class_="nav-item active")

    org = l[0].a.get('title')
    if org == '':
        org = l[0].span.get_text()

    groups = l[1].a.get('title')
    if groups == '':
        groups = l[1].span.get_text()

    data2 = soup.find_all('h3', class_="dataset-heading")
    for element in data2:
        lobbying[element.a.get_text()] = {}
    data2[0].a["href"]
    prefix = "https://data.gov.au"
    for element in data2:
        lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
        lobbying[element.a.get_text()]["Organisation"] = org
        lobbying[element.a.get_text()]["Group"] = groups

        #print(lobbying)
        df = pd.DataFrame.from_dict(lobbying, orient='index') \
               .rename_axis('Titles').reset_index()
        dfs.append(df)


df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)

df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')


print (df1.head())

                                              Titles  \
0                                     Banks – Assets   
1  Consolidated Exposures – Immediate and Ultimat...   
2  Foreign Exchange Transactions and Holdings of ...   
3  Finance Companies and General Financiers – Sel...   
4                   Liabilities and Assets – Monthly   

                                                link  \
0           https://data.gov.au/dataset/banks-assets   
1  https://data.gov.au/dataset/consolidated-expos...   
2  https://data.gov.au/dataset/foreign-exchange-t...   
3  https://data.gov.au/dataset/finance-companies-...   
4  https://data.gov.au/dataset/liabilities-and-as...   

                Organisation                            Group  
0  Reserve Bank of Australia  Business Support and Regulation  
1  Reserve Bank of Australia  Business Support and Regulation  
2  Reserve Bank of Australia  Business Support and Regulation  
3  Reserve Bank of Australia  Business Support and Regulation  
4  Reserve Bank of Australia  Business Support and Regulation  

这篇关于使用beautifulsoup在HTML中的链接标记内获取标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆