用于抓取所有评论和回复的 YouTube 数据 API [英] YouTube Data API to crawl all comments and replies

查看:51
本文介绍了用于抓取所有评论和回复的 YouTube 数据 API的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在拼命寻找一种解决方案来抓取我的研究的所有评论和相应回复.我很难创建一个包含正确和相应顺序的评论数据的数据框.

I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment data in correct and corresponding orders.

我将在这里分享我的代码,以便您的专业人士可以看看并给我一些见解.

I am gonna share my code here so you professionals can take a look and give me some insights.

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    data = {'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]}
                    print(rauthor)
                    print(rtext)
            data = {'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]}
            result = pd.DataFrame(data)
            result.to_csv('youtube.csv', mode='a',header=False)
            print(comment)
            print(comment2)
            print(comment3)
            print(comment4)
            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

当我这样做时,我的抓取工具会收集评论,但不会收集某些评论下的一些回复.

When I do this, my crawler collects comments but doesn't collect some of the replies that are under certain comments.

如何让它收集评论及其相应的回复并将它们放在一个数据框中?

How can I make it collect comments and their corresponding replies and put them in a single data frame?

所以,我设法在 Jupyter Notebook 的输出部分提取了我想要的信息.我现在要做的就是将结果附加到数据框中.

So, somehow I managed to pull the information I wanted at the output section of Jupyter Notebook. All I have to do now is to append the result at the data frame.

这是我更新的代码:

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    print(rtext)
                    print(rtime)
                    print(rauthor)
                    print('Likes: ', rlike)
                    
            print(comment)
            print(comment2)
            print(comment3)
            print("Likes: ", comment4)

            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

结果是:

如您所见,分组在 ======== 行下的评论是下面的评论和相应的回复.

As you can see, the comments grouped under ======== lines are the comment and corresponding replies underneath.

将结果附加到数据框中的好方法是什么?

What would be a good way to append the result into the data frame?

推荐答案

根据官方文档,属性 replies.comments[]/commentThreads#resource-representation" rel="nofollow noreferrer">CommentThreads 资源具有以下规范:

According to the official doc, the property replies.comments[] of CommentThreads resource has the following specification:

replies.comments[](列表)
对顶级评论的一个或多个回复的列表.列表中的每一项都是一个评论资源.

列表包含有限数量的回复,除非列表中的项目数等于 snippet.totalReplyCount 属性,回复列表只是顶级评论可用回复总数的一个子集.要检索顶级评论的所有回复,您需要调用 Comments.list 方法并使用 parentId 请求参数,用于标识您要检索回复的评论.

The list contains a limited number of replies, and unless the number of items in the list equals the value of the snippet.totalReplyCount property, the list of replies is only a subset of the total number of replies available for the top-level comment. To retrieve all of the replies for the top-level comment, you need to call the Comments.list method and use the parentId request parameter to identify the comment for which you want to retrieve replies.

因此,如果想要获取与给定顶级评论关联的所有回复条目,您必须使用 Comments.list API 端点已正确查询.

Consequently, if wanting to obtain all reply entries associated to a given top-level comment, you will have to use the Comments.list API endpoint queried appropriately.

我建议您阅读我对一个非常相关的问题的回答;分为三个部分:

I recommend you to read my answer to a very much related question; there are three sections:

  • 顶级评论和相关回复
  • 属性nextPageToken和参数pageToken,以及
  • 设计强加的 API 限制.

从一开始,您就必须承认,当这些评论的数量超过某个(未指定的)上限时,API(当前实现的)不允许获取与给定视频相关联的所有顶级评论绑定.

From the get go, you'll have to acknowledge that the API (as currently implemented) does not allow to obtain all top-level comments associated to a given video when the number of those comments exceeds a certain (unspecified) upper bound.

对于 Python 实现,我建议您按如下方式构建代码:

For what concerns a Python implementation, I would suggest that you do structure the code as follows:

def get_video_comments(service, video_id):
    request = service.commentThreads().list(
        videoId = video_id,
        part = 'id,snippet,replies',
        maxResults = 100
    )
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies is not None and \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's
            # 'replies.comments' an array of 'Comments Resource'

            # Do fill in the 'comments' data structure 
            # to be provided by this function:
            ...

        request = service.commentThreads().list_next(
            request, response)

    return comments

def get_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 100
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies

请注意,上面的省略号点 -- ... -- 必须替换为填充 get_video_comments 返回的结构数组的实际代码给它的调用者.

Note that the ellipsis dots above -- ... -- would have to be replaced with actual code that fills in the array of structures to be returned by get_video_comments to its caller.

最简单的方法(对快速测试有用)是将 ... 替换为 comments.append(comment) 然后是 get_video_comments 的调用者 简单地漂亮地打印(使用 json.dump)从该函数获得的对象.

The simplest way (useful for quick testing) would be to have ... replaced with comments.append(comment) and then the caller of get_video_comments to simply pretty print (using json.dump) the object obtained from that function.

这篇关于用于抓取所有评论和回复的 YouTube 数据 API的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆