给定URL,自动确定网站页面的自然语言 [英] Automatically determine the natural language of a website page given its URL

查看:59
本文介绍了给定URL,自动确定网站页面的自然语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种自动确定网站页面所使用的自然语言的方法,只要它具有URL.

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.

在Python中,像这样的函数:

In Python, a function like:

def LanguageUsed (url):
    #stuff

哪个返回语言说明符(例如,英语为'en',日语为'jp'等)

Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)

结果摘要: 我有一个合理的解决方案,可以在Python中使用来自PyPi的oice.langdet . 我在区分英语和非英语方面做得不错.请注意,您必须使用Python urllib来获取html.另外,oice.langdet是GPL许可证.

Summary of Results: I have a reasonable solution working in Python using code from the PyPi for oice.langdet. It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.

有关在Python中使用Trigrams的更一般的解决方案(如其他建议所示),请参阅此来自ActiveState的Python Cookbook食谱.

For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.

Google自然语言检测API效果很好(如果不是我所见过的最好的).但是,它是Javascript,并且其TOS禁止自动化其使用.

The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.

推荐答案

这通常是通过使用字符n-gram模型来完成的.您可以在此处找到最新的语言标识符对于Java.如果您需要将其转换为Python的帮助,请询问.希望对您有所帮助.

This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.

这篇关于给定URL,自动确定网站页面的自然语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆