从Google Scholar中提取文本 [英] extract text from google scholar

查看:163
本文介绍了从Google Scholar中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Google学术搜索针对特定查询提供的测试代码段中提取文本.文字片段是指标题下方的文字(黑色字母). 目前,我正在尝试使用python从html文件中提取它,但它包含许多额外的测试,例如

I am trying to extract the text from the test snippet that google scholar gives for a particular query. By text snippet I mean the text below the title (in black letter). Currently I am trying to extract it from the html file using python but it contains a lot of extra test such as

/div><div class="gs_fl" ...等.

/div><div class="gs_fl"...etc.

有没有简单的方法或一些代码可以帮助我在没有这些多余文本的情况下获得文本.

Is there a easy way or some code which can help me get the text without these redundant texts.

推荐答案

您需要HTML解析器:

You need an html parser:

import lxml.html

doc = lxml.html.fromstring(html)
text = doc.xpath('//div[@class="gs_fl"]').text_content()

您可以使用"pip install lxml"安装lxml,但是您需要构建其依赖项,具体取决于平台是什么.

You can install lxml with "pip install lxml", but you'll need to build its dependencies, and the details will be different depending on what your platform is.

这篇关于从Google Scholar中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆