使用 Python 和 news3k lib 进行网页抓取不返回数据 [英] Web Scraping with Python and newspaper3k lib does not return data

查看:26
本文介绍了使用 Python 和 news3k lib 进行网页抓取不返回数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用 sudo pip3 install Newspapper3k 在我的 Mac 上安装了 Newspapper3k Lib.我使用 Python 3.我想返回文章对象支持的数据,即 url、日期、标题、文本、摘要和关键字,但我没有得到任何数据:

I have installed Newspapper3k Lib on my Mac with sudo pip3 install Newspapper3k. Im using Python 3. I want to return data thats supported at Article object, and that is url, date, title, text, summarisation and keywords but I do not get any data:

import newspaper
from newspaper import Article

#creating website for scraping
cnn_paper = newspaper.build('https://www.euronews.com/', memoize_articles=False)

#I have tried for https://www.euronews.com/, https://edition.cnn.com/, https://www.bbc.com/


for article in cnn_paper.articles:

    article_url = article.url #works

    news_article = Article(article_url)#works

    print("OBJECT:", news_article, '\n')#works
    print("URL:", article_url, '\n')#works
    print("DATE:", news_article.publish_date, '\n')#does not work
    print("TITLE:", news_article.title, '\n')#does not work
    print("TEXT:", news_article.text, '\n')#does not work
    print("SUMMARY:", news_article.summary, '\n')#does not work
    print("KEYWORDS:", news_article.keywords, '\n')#does not work
    print()
    input()

我得到了文章对象和 URL,但其他一切都是 ''.我在不同的网站上尝试过,但结果是一样的.

I get Article object and URL but everything else is ''. I have tried on different websites, but result is the same.

然后我尝试添加:

    news_article.download()
    news_article.parse()
    news_article.nlp()

我也尝试过设置 Config 并设置 HEADERS 和 TIMEOUT,但结果是一样的.

I have also tried to set Config and to set HEADERS and TIMEOUTs but results are the same.

当我这样做时,对于每个网站,我只得到 16 篇带有日期、标题和正文值的文章.这对我来说很奇怪,对于每个网站,我都获得了相同数量的数据,但对于超过 95% 的新闻文章,我没有获得.

When I do that, for each website I get only 16 Articles with date, title, and body values. That is very strange to me, for each website I'm getting the same number of data, but for more than 95% of news articles I'm getting None.

美汤能帮我吗?

有人可以帮助我理解问题是什么,为什么我得到这么多 Null/Nan/"值,我该如何解决?

Can someone help me with understanding what is the problem, why I'm getting so much Null/Nan/"" values, and how can I fix that?

这是 lib 的文档:

This is the docs for lib:

https://newspaper.readthedocs.io/en/latest/

推荐答案

我建议您查看 报纸概述 我在 GitHub 上发布的文档.该文档包含多个提取示例和其他可能有用的技术.

I would recommend that you review the newspaper overview document that I published on GitHub. The document has multiple extraction examples and other techniques that might be useful.

关于你的问题...

Newspaper3K 几乎可以完美地解析某些网站.但是有很多网站需要查看页面的导航结构以确定如何正确解析文章元素.

Newspaper3K will parse certain websites nearly flawlessly. But there are plenty of websites that will require reviewing a page's navigational structure to determine how to parse the article elements correctly.

例如,https://www.marketwatch.com文章元素,例如标题、发布日期和其他存储在页面元标记部分中的项目.

For instance, https://www.marketwatch.com has individual article elements, such as title, publish date and others items stored within the meta tag section of the page.

下面的 newspaper 示例将正确解析元素.我注意到您可能需要对关键字或标签输出进行一些数据清理.

The newspaper example below will parse the elements correctly. I noted that you might need to do some data cleaning of the keyword or tag output.

import newspaper
from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.marketwatch.com'
article_urls = set()
marketwatch = newspaper.build(base_url, config=config, memoize_articles=False, language='en')
for sub_article in marketwatch.articles:
article = Article(sub_article.url, config=config, memoize_articles=False, language='en')
article.download()
article.parse()
if article.url not in article_urls:
    article_urls.add(article.url)

    # The majority of the article elements are located
    # within the meta data section of the page's
    # navigational structure
    article_meta_data = article.meta_data

    published_date = {value for (key, value) in article_meta_data.items() if key == 'parsely-pub-date'}
    article_published_date = " ".join(str(x) for x in published_date)

    authors = sorted({value for (key, value) in article_meta_data.items() if key == 'parsely-author'})
    article_author = ', '.join(authors)

    title = {value for (key, value) in article_meta_data.items() if key == 'parsely-title'}
    article_title = " ".join(str(x) for x in title)

    keywords = ''.join({value for (key, value) in article_meta_data.items() if key == 'keywords'})
    keywords_list = sorted(keywords.lower().split(','))
    article_keywords = ', '.join(keywords_list)

    tags = ''.join({value for (key, value) in article_meta_data.items() if key == 'parsely-tags'})
    tag_list = sorted(tags.lower().split(','))
    article_tags = ', '.join(tag_list)

    summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
    article_summary = " ".join(str(x) for x in summary)

    # the replace is used to remove newlines
    article_text = article.text.replace('\n', '')
    print(article_text)

https://www.euronews.com 类似于 https://www.marketwatch.com,除了一些文章元素位于正文中,其他项目位于元标记部分中.

https://www.euronews.com is similar to https://www.marketwatch.com, except some of the article elements are located in the main body and other items are within the meta tag section.

import newspaper
from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.euronews.com'
article_urls = set()
euronews = newspaper.build(base_url, config=config, memoize_articles=False, language='en')
for sub_article in euronews.articles:
   if sub_article.url not in article_urls:
     article_urls.add(sub_article.url)
     article = Article(sub_article.url, config=config, memoize_articles=False, language='en')
     article.download()
     article.parse()

     # The majority of the article elements are located
     # within the meta data section of the page's
     # navigational structure
     article_meta_data = article.meta_data
    
     published_date = {value for (key, value) in article_meta_data.items() if key == 'date.created'}
     article_published_date = " ".join(str(x) for x in published_date)
    
     article_title = article.title

     summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
     article_summary = " ".join(str(x) for x in summary)

     keywords = ''.join({value for (key, value) in article_meta_data.items() if key == 'keywords'})
     keywords_list = sorted(keywords.lower().split(','))
     article_keywords = ', '.join(keywords_list).strip()

     # the replace is used to remove newlines
     article_text = article.text.replace('\n', '')

这篇关于使用 Python 和 news3k lib 进行网页抓取不返回数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆