带有西里尔字符和未定义编码的文件上的file_get_contents [英] file_get_contents on file with cyrillic characters and undefined encoding

查看:65
本文介绍了带有西里尔字符和未定义编码的文件上的file_get_contents的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法从未知编码的.txt文件中获取php中的西里尔字符.我尝试了几乎可以在网上找到的所有内容.我需要使用什么php函数来获取该文件的内容?

I cannot get cyrillic characters in php from a .txt file with unknown encoding. I tried almost everything I could find on the web. What php function do I need to use get the contents of this file?

https://www.dropbox.com/s/w7cex4wiogyytvm/100004- 6.txt

编辑

输入:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    debug($string);

输出:调试已损坏,如果我尝试将值保存到数据库中,它将失败(BOM会造成一些麻烦,并且无法保存该值).

Output: debug is broken, if I try to save the value to database it fails (BOM does some trouble and the value cannot be saved).

输入

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = mb_convert_encoding ($string , 'utf-8');
    debug($string);

输出:

    '????? ???:300/500V
    ???? ???:2000V
    ????? ???? ??????: ? +70??
    ?? ??? ?? (????? 5 ??.): ? +160??
    ????? ?????? ?? ?????: ? +5??   '

输入:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = iconv("UTF-16", "UTF-8//TRANSLIT//IGNORE", $string);
    debug($string);

输出:

췮㌰〯㔰ざഊ죱㈰〰嘍્⃰⃲㨠‫㜰냑ഊ쿰⃱밠⣭㔠⤺⃤⬱㘰냑ഊ췠볭

输入:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = iconv("ISO-8859-5", "UTF-8//TRANSLIT//IGNORE", $string);
    debug($string);

输出:

    Эюьшэрыхэ эряюэ:300/500V
    Шёяшђхэ эряюэ:2000V
    ЭрМтшёюър №рсюђэр ђхьях№рђѓ№р: фю +70Аб
    Я№ш ъ№рђюъ ёяюМ (эрМьэюуѓ 5 ёхъ.): фю +160Аб
    ЭрМэшёър ђхьях№рђѓ№р я№ш шэёђрырішМр: фю +5Аб

现在我测试了多个文件,我认为输入文件不再是Unicode编码的.我成功读取了我的测试文件,但在那个重要的文件上(但我不知道其编码)仍然一无所获.所以我改变了这个问题,编码似乎仍未定义.

Now that I tested multiple files, I don't think the input file is Unicode encoded anymore. I succeeded on reading my test file, but on the one that matters (and I don't know the encoding of) still nothing. So I changed the question, the encoding seems to be undefined still.

更多的间隙.我可以打开该文件,然后在记事本中正常看到它.它包含导致该问题的西里尔字母.

A little bit more for clearance. I can open this file and see it normally in notepad. It contains cyrillic characters that make this problem.

推荐答案

文件编码为 CP1251 aka MS-CYRL aka西里尔(Windows)".

The file is encoded in CP1251 a.k.a. MS-CYRL a.k.a. "Cyrillic (Windows)".

$string = file_get_contents($path);
$string = iconv('CP1251', 'UTF-8', $string);

我是怎么知道的?在文本编辑器中打开文件,并尝试了一些相关的编码,直到看起来正确为止.如果文件编码未知,您几乎无能为力.

How did I figure this out? Opened the file in a text editor and tried a few relevant encodings until it looked right. There's hardly anything else you can do if the file encoding is unknown.

这篇关于带有西里尔字符和未定义编码的文件上的file_get_contents的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆