如何使用API​​从Google文档中提取标题 [英] How to pull headings from Google document using API

查看:61
本文介绍了如何使用API​​从Google文档中提取标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前正在尝试创建一个python脚本,该脚本将检查google文档中的各种SEO页面指标.

Currently trying to create a python script that will check a google document for various SEO onpage metrics.

google docs API的好示例显示了如何从Google文档中提取所有文本.但是,这只会返回不带格式的纯文本.

The google docs API has a good sample showing how to extract ALL the text from a google document. However, this simply returns plain text with no formatting.

要执行检查,我需要将H1,H2-H4,粗体文本等拆分出来,但是经过两个小时的玩耍/在API文档/网络中进行搜索后,我不知道该如何做.编辑以下循环以获取(例如)所有HEADING_2元素.

To perform my checks I need to be able to split out the H1, H2-H4, text in bold etc but after two hours of playing around/searching around the API docs/web, I can't figure out how to edit the following loop to be able to get (for example) all the HEADING_2 elements.

    text = ''
    for value in elements:
        if 'paragraph' in value:
            elements = value.get('paragraph').get('elements')
            for elem in elements:
                text += read_paragraph_element(elem)
        elif 'table' in value:
            # The text in table cells are in nested Structural Elements and tables may be
            # nested.
            table = value.get('table')
            for row in table.get('tableRows'):
                cells = row.get('tableCells')
                for cell in cells:
                    text += read_strucutural_elements(cell.get('content'))
        elif 'tableOfContents' in value:
            # The text in the TOC is also in a Structural Element.
            toc = value.get('tableOfContents')
            text += read_strucutural_elements(toc.get('content'))
    return text

任何帮助表示赞赏.谢谢.

Any help appreciated. Thanks.

推荐答案

我相信您的目标和当前情况如下.

I believe your goal and your current situation as follows.

  • 您要检索段落样式的 HEADING_2 的文本.
  • 您要使用适用于python的googleapis实现此目标.
  • 您想使用问题中的脚本实现目标.
  • 您已经使用Docs API从Google文档中获取了值.
  • 在这种情况下,我认为当 namedStyleType 的值为 HEADING_2 时,需要检索文本.
  • In this case, I thought that when the value of namedStyleType is HEADING_2, the text is required to be retrieved.

当这一点反映到您的脚本中时,它如下所示.

When this point is reflected to your script, it becomes as follows.

for value in elements:
    if 'paragraph' in value:
        elements = value.get('paragraph').get('elements')

至:

for value in elements:
    if 'paragraph' in value and value['paragraph']['paragraphStyle']['namedStyleType'] == 'HEADING_2':  # Modified
        elements = value.get('paragraph').get('elements')

参考:

  • NamedStyleType
  • 这篇关于如何使用API​​从Google文档中提取标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆