PHP - 从 HTML 中提取文本,翻译并放回原处 [英] PHP - extract text from HTML, translate and put it back

查看:40
本文介绍了PHP - 从 HTML 中提取文本,翻译并放回原处的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 API 来翻译我的博客,但它有时会使我的 html 混乱,这让我有更多的工作来修复所有问题.

我现在要做的是从 html 中提取内容,进行翻译并将其放回原处.

我首先尝试使用 preg_replace 执行此操作,我将用 ##a_number## 之类的内容替换每个标签,然后在翻译文本后恢复到原始标签.不幸的是,它很难管理,因为我需要用唯一值替换每个标签.

然后我用simple html dom"尝试了它,可以在这里找到:http://simplehtmldom.sourceforge.net/manual.htm

$html = str_get_html($content);$str = $html;$ret = $html->find('div');foreach ($ret as $key=>$value){回声 $value;}

这样我得到了所有文本,但值中仍然有一些 html(div 内的 div),我不知道如何将翻译后的文本放回原始对象中.这个对象的结构非常复杂,显示它时,我的浏览器崩溃了.

我的选择有点少,可能有更直接的方法可以做到这一点.我想找到一种方法来获取一个对象或数组,其中包含一侧的所有 html 和另一侧的所有文本.我会遍历文本以将其翻译并合并回所有内容以避免破坏 html.

您是否有更好的选择来实现这一目标?

谢谢洛朗

解决方案

例如,我有以下 HTML,其中所有单词都是小写:

<h2>找不到页面!</h2><p>转到<a href="/">主页</a>或使用 <a href="/search">search</a>.</p>

我的任务是将文本转换为大写单词.为了解决这个问题,我获取所有文本节点并使用 ucwords 函数转换它们(当然,你应该使用你的翻译函数而不是它).

libxml_use_internal_errors(true);$dom = 新的 DomDocument();$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);$xpath = new DOMXPath($dom);foreach ($xpath->query('//text()') as $text) {如果(修剪($ text-> nodeValue)){$text->nodeValue = ucwords($text->nodeValue);}}echo $dom->saveHTML();

以上输出如下:

<h2>找不到页面!</h2><p>转到<a href="/">主页</a>或者使用<a href="/search">搜索</a>.</p>

I'm using an API to translate my blog but it sometimes messes up with my html in a way that it gives me more work to fix everything.

What I'm now trying to do is to extract the content from the html, translate it and put it back where it was.

I have first tried to do this with preg_replace where I would replace every tag by something like ##a_number## and then revert back to the original tag once the text has been translated. Unfortunately it's very difficult to manage because I need to replace every tag by a unique value.

I have then tried it with "simple html dom" which can be found here: http://simplehtmldom.sourceforge.net/manual.htm

$html = str_get_html($content);
$str = $html;
$ret = $html->find('div');
foreach ($ret as $key=>$value)
    {  
        echo $value;
    }

This way I get all texts but there is still some html in the value (div inside div) and I don't know how I can put back translated text into the original object. The structure of this object is so complex that when displaying it, it crashes my browser.

I'm running a bit out of options and there are probably more straightforward ways of doing this. What I'd like to find is a way to get an object or array containing all the html on one side and all the text on the other side. I would loop through the text to get it translated and the merge back everything to avoid breaking the html.

Do you see better options to achieve this?

thanks Laurent

解决方案

For example, I have the following HTML, where all the words are lowercase:

<div>
    <h2>page not found!</h2>
    <p>go to <a href="/">home page</a> or use the <a href="/search">search</a>.</p>
</div>

My task is to convert text to capitalized words. To solve it, I fetch all text nodes and convert them using the ucwords function (of course, you should use your translation function instead of it).

libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);

foreach ($xpath->query('//text()') as $text) {
    if (trim($text->nodeValue)) {
        $text->nodeValue = ucwords($text->nodeValue);
    }
}

echo $dom->saveHTML();

The above outputs the following:

<div>
    <h2>Page Not Found!</h2>
    <p>Go To <a href="/">Home Page</a> Or Use The <a href="/search">Search</a>.</p>
</div>

这篇关于PHP - 从 HTML 中提取文本,翻译并放回原处的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆