PostgreSQL - 替换 HTML 实体 [英] PostgreSQL - Replace HTML Entities
问题描述
我刚刚开始着手从我们的数据库中去除 HTML 实体的任务,因为我们进行了大量的抓取,而一些抓取工具在输入时并没有这样做:(
I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(
所以我开始写一堆看起来像的查询;
So I started writing a bunch of queries that look like;
UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';
这显然是一种非常幼稚的方法.我一直试图弄清楚我是否可以用 decode 函数做一些聪明的事情;也许通过像 /&#x(..);/
这样的正则表达式抓取 html 实体,然后将 just %1
部分传递给ascii 解码器,并重建字符串......或其他......
Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/
, then passing just the %1
part to the ascii decoder, and reconstructing the string...or something...
我可以继续提问吗?大概只有 40 个左右.
Shall I just press on with the queries? There will probably only be 40 or so of them.
推荐答案
使用 pl/perlu 编写一个函数并使用这个模块 https://metacpan.org/pod/HTML::Entities
Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities
当然,你需要安装 perl 并且 pl/perl 可用.
Of course you need to have perl installed and pl/perl available.
1)首先创建过程语言pl/perlu:
1) First of all create the procedural language pl/perlu:
CREATE EXTENSION plperlu;
2) 然后创建一个这样的函数:
2) Then create a function like this:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
3) 然后你可以这样使用它:
3) Then you can use it like this:
select decode_html_entities('aaabbb&.... asasdasdasd …');
decode_html_entities
---------------------------
aaabbb&.... asasdasdasd …
(1 row)
这篇关于PostgreSQL - 替换 HTML 实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!