fgetcsv()删除字符与变音符号(即非ASCII) - 如何解决? [英] fgetcsv() drops characters with diacritics (i.e. non-ASCII) - how to fix?

查看:343
本文介绍了fgetcsv()删除字符与变音符号(即非ASCII) - 如何解决?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


类似问题:

在PHP fgetcsv()中,不读取CSV文件中的某些字符

fgetcsv()忽略特殊字符,当它们在行的开头

我的应用程序有一个表单,用户可以上传CSV文件(其5个内部用户总是上传有效的文件 - 逗号分隔,引用,记录以LF结尾),然后使用PHP将该文件导入到数据库:

My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:

$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
    print_r($row);
    // further code not relevant as the data is already corrupt at this point
}

更改,用户正在上传以 Windows-1250 字符集编码的文件 - 单字节,8位字符编码。

For reasons I cannot change, the users are uploading the file encoded in the Windows-1250 charset - a single-byte, 8-bit character encoding.

问题:和 fgetcsv()中放置超过127(扩展ASCII)的一些 。示例数据:

The problem: and some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv(). Example data:

"15","Ústav"
"420","Špičák"
"7","Tmaň"

成为

Array (
  0 => 15
  1 => "stav"
)
Array (
  0 => 420
  1 => "pičák"
)
Array (
  0 => 7
  1 => "Tma"
)

(请注意č Ú已删除)

fgetcsv 说自从4.3.5 fgetcsv()现在二进制安全,但看起来不是。

The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?

推荐答案

它会变成出来,我没有阅读文档足够 - fgetcsv()只是有点二进制安全。对于纯ASCII码, 127,但文档还说

It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:


注意:

Note:

通过此函数将区域设置考虑在
中。如果LANG是例如
en_US.UTF-8,一个字节的文件
编码由此
函数读取错误

Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function

换句话说, fgetcsv()试图是二进制安全的,但它实际上不是(因为它也在同时混乱的字符集),它可能会篡改数据读取(因为此设置未在php.ini中配置,而是从 $ LANG 中读取)。

In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG).

通过读取 fgets (它工作在字节,而不是字符)和使用文档中的注释中的CSV函数将其解析为数组:

I've sidestepped the issue by reading the lines with fgets (which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:

$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle, 0)) { // fgets is actually binary safe
    $row = csvstring_to_array($raw_row, ',', '"', "\n");
    // $row is now read correctly
}

这篇关于fgetcsv()删除字符与变音符号(即非ASCII) - 如何解决?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆