PostgreSQL-替换HTML实体 [英] PostgreSQL - Replace HTML Entities

查看:159
本文介绍了PostgreSQL-替换HTML实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始着手从数据库中剥离HTML实体的任务,因为我们进行了大量的爬网,并且某些爬网程序在输入时没有这样做:(

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(

所以我开始写一堆看起来像这样的查询;

So I started writing a bunch of queries that look like;

UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';

这显然是一种非常幼稚的方法,我一直在尝试找出如果有什么聪明的事情我可以使用解码功能;也许可以通过正则表达式来获取html实体,例如 /& #x(..); / ,然后传递只是%1 部分传递给ascii解码器,然后重建字符串...或其他内容...

Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...

我要继续查询吗? ly中只有40个左右。

Shall I just press on with the queries? There will probably only be 40 or so of them.

推荐答案

使用pl / perlu编写函数并使用此模块 https://metacpan.org/pod/HTML::Entities

Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities

当然,您需要安装perl并提供pl / perl。

Of course you need to have perl installed and pl/perl available.

1)
首先创建程序语言pl / perlu:

1) First of all create the procedural language pl/perlu:

CREATE EXTENSION plperlu;

2)然后创建如下函数:

CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
    use HTML::Entities;
    return decode_entities($_[0]);
$$ LANGUAGE plperlu;

3)然后,您可以像这样使用它:

3) Then you can use it like this:

select decode_html_entities('aaabbb&.... asasdasdasd …');
   decode_html_entities    
---------------------------
 aaabbb&.... asasdasdasd …
(1 row)

这篇关于PostgreSQL-替换HTML实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆