Apache编码在UNICODE中为apache httpclient 4 [英] URI encoding in UNICODE for apache httpclient 4

查看:198
本文介绍了Apache编码在UNICODE中为apache httpclient 4的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与apache http客户端4一起使用我所有的网页访问。
这意味着我需要做的每一个查询必须传递URI语法检查。
我正在尝试访问的网站之一使用UNICODE作为URL参数编码,即:



http://maya.tase.co .il内/囊/ HTTP的index.asp:?//maya.tase.co.il/bursa/index.asp视图=搜索和安培; company_group = 147&安培; srh_txt =%u05E0%u05D9%u05D1&安培; arg_comp =安培; srh_from = 2009-06-01& srh_until = 2010-02-16& srh_anaf = -1& srh_event = 9999& is_urgent = 0& srh_company_press =



问题是URI不支持UNICODE编码(它只支持UTF-8格式) 8)
这里真的很大的问题是,这个网站希望它的参数在U中被编码NICODE,所以任何尝试使用String.format转换url( http://...srh_txt=% s& ...,URLEncoder.encode(ניב,UTF8))
导致一个url是合法的,可以用于构建一个URI,但是它的站点响应与一个错误消息,因为它不是它期望的编码。



通过URL对象的创建,甚至可以使用非转换网址连接到网站。
是否有任何方式在非UTF-8编码中创建URI?
有没有办法使用apache httpclient 4与常规URL(而不是URI)?



谢谢,
Niv

解决方案


(paramsrh_txt =%u05E0%u05D9%u05D1在UNICODE中编码srh_txt =ניב)


它不是真的。这不是URL编码,URL中的序列%u 无效



%u05E0%u05D9%u05D1仅在JavaScript的oddball escape ניב c $ c>语法 escape 与除 + 之外的所有ASCII字符的URL编码相同,但是%u #### 为Unicode字符生成的转义完全是自己的发明。



一般来说,不要使用 escape ,而是使用 encodeURIComponent 来生成正确的URL编码UTF-8,ניב = %D7%A0%D7%99%D7%91 。)



如果站点需要在查询字符串中 %u #### 序列,那是非常严重的。


有没有办法以非UTF-8编码创建URI?


是的,URI可以使用你喜欢的任何字符编码,它通常是UTF-8;那就是在IRI要求的情况下,如果用户在地址栏中输入非ASCII字符,通常将提交哪些浏览器,但URI本身仅涉及字节。



所以您可以将ניב转换为%F0%E9%E1 。 Web应用程序无法告诉这些字节表示在代码页1255中编码的字符(希伯来语,类似于ISO-8859-8)。但是,在上面的链接上,UTF-8版本没有起作用。哦,亲爱的!


I am working with apache http client 4 for all of my web accesses. This means that every query that I need to do has to pass the URI syntax checks. One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:

http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from=2009-06-01&srh_until=2010-02-16&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=

(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)

The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8) The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8")) results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.

by the way URL object can be created and even used to connect to the web site using the non converted url. Is there any way of creating URI in non UTF-8 encoding? Is there any way of working with apache httpclient 4 with regular URL(and not URI)?

thanks, Niv

解决方案

(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)

It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.

%u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.

(One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)

If a site requires %u#### sequences in its query string, it is very badly broken.

Is there any way of creating URI in non UTF-8 encoding?

Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.

So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!

这篇关于Apache编码在UNICODE中为apache httpclient 4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆