使用Chrome扩展程序提取网页中包含的全部文字 [英] Extract the whole text contained in webpage using Chrome extension

查看:3567
本文介绍了使用Chrome扩展程序提取网页中包含的全部文字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个Chrome扩展程序,用于Google搜索结果的文本分析。我希望用户在多功能框中插入某些文本,然后直接进入Google搜索页面。

  function navigate( url){
chrome.tabs.query({active:true,currentWindow:true},function(tabs){
chrome.tabs.update(tabs [0] .id,{url:url} );
});
}

chrome.omnibox.onInputEntered.addListener(function(text){
navigate(https://www.google.com.br/search?hl=pt -BR& lr = lang_pt& q =+ text +%20%2B%20cnpj);
});

alert('这里是提取文本的地方');

在将当前标签指向搜索页面后,我想要获取页面的纯文本格式,以后解析它。什么是最直接的方法来完成这项工作?解析方案

解析方法

那么,解析网页可能会更容易做到DOM而不是纯文本。然而,这不是你问的问题。



你的代码在浏览页面和处理web导航的异步特性方面存在问题。这也不是你问的问题,而是影响你如何询问的问题,从网页获取文本,是实现的。



因此,要回答你的问题如何从网页中提取纯文本,我在用户单击 browser_action 按钮时执行此操作。这分开回答了如何做到这一点从你的代码中的其他问题。



由于 wOxxOm 在评论中提到,要访问网页的DOM,您必须使用内容脚本。正如他所做的那样,我建议您阅读 Chrome扩展程序概述。您可以使用 chrome.tabs.executeScript 注入内容脚本。 code> 。通常,您可以使用详细信息参数的文件属性注入包含在单独文件中的脚本。对于只是简单地回送网页文本(无需解析等)的代码,只需插入最基本的代码所需的单行代码是合理的。要插入一小段代码,可以使用详细信息参数的代码属性来完成。在这种情况下,鉴于您对文本的要求没有提及, document.body.innerText 是返回的文本。



要将文本发送回后台脚本,请 使用chrome.runtime.sendMessage() 。为了接收后台脚本中的文本,一个监听器 receiveText 被添加到 https://developer.chrome.com/extensions/runtime#event-onMessagerel =nofollow noreferrer> chrome.runtime.onMessage



background.js

  chrome.browserAction.onClicked.addListener(function(tab){
console.log('Injecting content script(s)');
//在Firefox上document.body.textContent是可能更合适
chrome.tabs.executeScript(tab.id,{
code:'document.body.innerText;'
//如果你有更复杂的东西,可以使用IIFE :
// code:'(function(){return document.body.innerText})();'
//如果你的代码很复杂,你应该把它存储在
/ / separate .js文件,你使用file:property注入。
},receiveText);
});

//tabs.executeScript()返回执行脚本
//在结果数组中的结果,每帧中有一个条目,其中脚本
//被注入。
函数receiveText(resultsArray){
console.log(resultsArray [0]);

manifest.json

{
description:获取网页文本并将其记录到控制台,
manifest_version:2,
name:Get Webpage Text,
version:0.1,

permissions:[
activeTab
],

background:{
scripts:[
background.js
]


browser_action:{
default_icon:{
32:myIcon.png
},
default_title:获取网页文本,
browser_style:true
}
}


I'm developing a Chrome extension for text parsing of Google search results. I want the user to insert a certain text in the omnibox, and then be direct to a Google search page.

function navigate(url) {
    chrome.tabs.query({active: true, currentWindow: true}, function(tabs) { 
    chrome.tabs.update(tabs[0].id, {url: url});
    });
}

chrome.omnibox.onInputEntered.addListener(function(text) {
    navigate("https://www.google.com.br/search?hl=pt-BR&lr=lang_pt&q=" + text + "%20%2B%20cnpj");
});

alert('Here is where the text will be extracted');

After directing the current tab to the search page, I want to get the plain text form of the page, to parse it afterwards. What is the most straightforward way to accomplish this?

解决方案

Well, parsing the webpage is probably going to be easier to do as a DOM instead of plain text. However, that is not what your question asked.

Your code has issues with how you are navigating to the page and dealing with the asynchronous nature of web navigation. This is also not what your question asked, but impacts how what you did ask about, getting text from a webpage, is implemented.

As such, to answer your question of how to extract the plain text from a webpage, I implemented doing so upon the user clicking a browser_action button. This separates answering how this can be done from the other issues in your code.

As wOxxOm mentioned in a comment, to have access to the DOM of a webpage, you have to use a content script. As he did, I suggest you read the Overview of Chrome extensions. You can inject a content script using chrome.tabs.executeScript. Normally, you would inject a script contained in a separate file using the file property of the details parameter. For code that is just the simple task of sending back the text of the webpage (without parsing, etc), it is reasonable to just insert the single line of code that is required for the most basic way of doing so. To insert a short segment of code, you can do so using the code property of the details parameter. In this case, given that you have said nothing about your requirements for the text, document.body.innerText is the text returned.

To send the text back to the background script, chrome.runtime.sendMessage() is used.

To receive the text in the background script, a listener, receiveText, is added to chrome.runtime.onMessage.

background.js:

chrome.browserAction.onClicked.addListener(function(tab) {
    console.log('Injecting content script(s)');
    //On Firefox document.body.textContent is probably more appropriate
    chrome.tabs.executeScript(tab.id,{
        code: 'document.body.innerText;'
        //If you had something somewhat more complex you can use an IIFE:
        //code: '(function (){return document.body.innerText})();'
        //If your code was complex, you should store it in a
        // separate .js file, which you inject with the file: property.
    },receiveText);
});

//tabs.executeScript() returns the results of the executed script
//  in an array of results, one entry per frame in which the script
//  was injected.
function receiveText(resultsArray){
    console.log(resultsArray[0]);
}

manifest.json:

{
    "description": "Gets the text of a webpage and logs it to the console",
    "manifest_version": 2,
    "name": "Get Webpage Text",
    "version": "0.1",

    "permissions": [
        "activeTab"
    ],

    "background": {
        "scripts": [
            "background.js"
        ]
    },

    "browser_action": {
        "default_icon": {
            "32": "myIcon.png"
        },
        "default_title": "Get Webpage Text",
        "browser_style": true
    }
}

这篇关于使用Chrome扩展程序提取网页中包含的全部文字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆