go to bug id or search bugs for
When internal_encoding is set (directly, say, internal_encoding=UTF-8, or indirectly through default_charset) to a value different than the current console charset, PHP will change console charset to that value when executing script, leading to visual corruption of the console buffer.
This may not look like much of a problem, since it will try to revert the change before ecript ends, but two moments remain:
1. If startup error occured (f.e. extension not found), console CP will not be restored.
2. While script works, console will remain in corrupted state.
3. This is RATHER surprising behavior, especially when you have both input_encoding and output_encoding set to expected (and desired) values.
For my example, I have russian Windows with default console charset being CP866.
I set the settings for PHP CLI as follows:
default_mimetype = "text/plain" ; superfluous for CLI, but still
default_charset = "CP866" ; -- // --
input_encoding = "CP866" ; Presume I gonna read console input directly.
output_encoding = "CP866" ; I want program messages to be readable.
Then I have common shared configuration that sets
internal_encoding = "UTF-8"
...since I'm not stupid and prefer to work with data from different sources with least possible chances for corruption.
Boom! Console corruption the moment I try this config with newly installed PHP 7.1. Since half the extensions can't be loaded.
Yes, a fast run of "chcp" to realize it is turned into CP65001 AKA UTF-8 and a run of "chcp 866" to quickly put it into place, but that's me. I know what I'm looking at.
Add a Patch
Add a Pull Request
If you would like to convert input/output encoding, you need to use mbstring/iconv module's conversion feature.
This isn't enabled by default and you should enabled them by yourself.
BTW, I don't think mbstring converts console input. You might want to create feature request for this.
Reproduce script: (Windows 10, en_US)
Active code page: 437
>php -r "system('taskkill /f /pid '.getmypid());"
Active code page: 65001
I wouldn't call this a PHP bug because the "normal" exit situations I tried, including fatal errors, would revert the codepage correctly. The problem comes when the PHP process itself quits, so PHP doesn't have a chance to recover and that could cause other issues on its own. Setting the codepage to match the internal_encoding is a net benefit.
That said, I think this could be controlled by an INI setting, default enabled. The normal reasons not to add one don't apply here: for example, the value of the setting does not impact code so PHP libraries don't need to detect the setting's value to alter their behavior.
If disabled then code still works the same way, but output may be mojibake-d. And if the output is redirected to a file then the setting doesn't even matter at all.
So I'm going to convert this to a feature request.
A bit more background regarding these behaviors. The world is UTF-8 today. The Windows path issue, both long and UTF-8, was long standing. UTF-8 is now default in PHP on Windows, like it became relatively long ago on Linux and other platforms, and in PHP-5.6 not very long ago. Still Windows is very different from other internationalization approaches, despite some improvements in system locale handling are to see. Many APIs are still codepage bound, including console, path, I/O, etc. That is unlikely to change soon, if at all. This makes the portability of PHP on Windows itself to lag. So for one - the behavior in PHP needs to be more consolidated across platform, for the other - it needs to be done a simple way. But even then - there are various platform issues, so then the actual thing is sometimes tricky to stretch straight.
@anrdiemon, the INI configuration listed is inconsistent for 7.1 and even for earlier. For 7.1, it diverges from what was documented in UPGRADING in first place. Then, the output_encoding directive is only useful, if the usage of the iconv/mbstring ob handler use is intended. As Yasuo mentioned, this ob handler has no effect on CLI. Furthermore - any of *_encoding are deprecated, see http://php.net/manual/en/iconv.configuration.php
Here's what i have with a non existent extension DLL
Active code page: 437
$ x64\Release\php.exe -n -d extension_dir=nowhere -d extension=notfound -v
PHP Warning: PHP Startup: Unable to load dynamic library 'nowhere\notfound' - The specified module could not be found.
in Unknown on line 0
PHP 7.1.1-dev (cli) (built: Dec 12 2016 15:15:56) ( NTS MSVC14 (Visual C++ 2015) x64 )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.1.0, Copyright (c) 1998-2016 Zend Technologies
Active code page: 437
The encoding determination sequence, as specifically documented in UPGRADING, is kept same, despite internal_encoding is used. Otherwise, there is no behavior difference, neither in earlier PHP version, nor on another platform. What is new - yes, the console codepage is switched automatically, but that is not without a reason. The console is UTF-8 on the overwhelming number of platforms.
@requinix, that's a good catch. Of course, if a process is sent a KILL, it won't be able to handle it. Same will happen when SIGSEGV and several other situations occur. This kind of behavior is what i expect @anrdaemon experiences. Any controlled exit from will sure restore the console, but if -9 is sent - there's nothing that can be done. So this is a real crash in a C program, that have to be happening. As for me, adding an INI to just workaround a crash, is not sensible. And probably, doing this colud be even misleading.
Imagine, you output a cyrillic string with 7.1, while the console codepage is 437 -
Active code page: 437
$ x64\Release\php.exe -n -d default_charset="CP866" -r "var_dump(sapi_windows_cp_get(), 'привет');"
Active code page: 437
If there were default_charset=cp437 directive set to 7.1, what it gave is this - string(6) "??????", just like 7.0 would do by default. With default_charset=UTF-8 as default in 7.0, it were same, but in 7.1, it's
Now, if one would want to out another one, say a Czech string - this is broken again. The only what works is - using UTF-8 console output. This conserns same for direct input, or for warnings, error messages, open and output paths, getting various data like user names, et cetera. Even there were an INI, and the output would be turned off, either one or the other of the cases will put mojibake onto console. Other side - the input will be in incompatible incoding. If the console is, say cp866, but PHP uses UTF-8 internaly, and you want to read a filename from console to put it into an I/O functionin. PHP will internally try to convert the cp866 char's into wchar_t's using utf-8. Now, we can of course say - lets use different codepage for input, than convert it to internal, then convert again into output. Diverging codepages for PHP internal, PHP input, PHP output, console input, console output ... well, hopefully one can see where it leads. Subsequent weirdness and bug reports are guaranteed :)
The way to do it, if UTF-8 is not desired, is simply setting like internal_encoding=cp1251, which will keep the current console codepage but also disable the multibyte path and other feature support. In this case - all the behavior is turned to what it was before 7.1. This is documented and done this way to explicitly keep the the backward compatibility for older apps or for scripts requiring the old behavior.
Sure, any implementation can't be perfect enough, this one exhausts the most of the possibilities systems provide. If interested, it were also worth it to check, how this topic is handled in other language, Python for example :)
@anrdaemon, the reported issue is something, that is caused by not following the recommended UPGRADING way. The ini configuration is not supposed to deliver the expected result in any case. With the crash behavior as described by @requinix- yeah, that's a known one, but it is something different. It is noticeable in an abnormal crash situation and is impossible to catch. Any other program won't behave different in this case. Now, if we say, PHP crashes that often, that it becomes an issue of this kind - then we have something else to fix.
UTF-8 and the wide APIs usage is what really matters for the future and is the worthy goal to strive. The old PHP on Windows behavior is still available by putting the corresponding configuration. UTF-8 became default in 5.6, now it's reflected on the Windows side as well. There is a number of factors, that can make UTF-8 usage not as easy as on other platforms. There are Windows specific things, and there is some learn curve, but there is no reason to mix the old and new behavior. Either an app has a clean UTF-8 support, or it goes by the legacy behavior. A "half UTF-8" support an over complicated implementation would be something weird, as for me. If there's a crash scenario to investigate, that'd what should be done. But otherwise, i'd see it as "not a bug", same as Yasuo.
World doesn't rotate around PHP, and there's other programs writing to the same console at the same time.
Which totally do not expect random CP changes.
Least of all, I totally do not expect CP changes from INTERNAL program settings.
Said that, let's take *NIX as example.
When you are writing to terminal in *NIX, you don't suddenly change terminal codepage, you translate your data from your program's internal codepage to the terminal's one.
Why on earth Windows terminal has to be any different?
I can imagine the time it took for you to write all that text, but it's senseless.
How's "UTF-8 is undesirable"? It IS desirable. Internally. I want my application to use UTF-8 wherever possible.
Emphasis on "possible". As opposed to "wherever it want regardless of my expectations".
I'm reading your argumentations and all my reaction is an urge to shake my head in an attempt to get the wrongs out of it.
Since you seems to using ISO-8859 compatible encoding, your situation is better than CP932(SJIS). You can simply use CP866, but CP932 cannot be internal_encoding.
Therefore, we(Japanese) has to use UTF-8/EUC-JP for internal_encoding on Windows. Although, web input/output conversion can be handled by mbstring/iconv, inputs/outputs have to converted manually, filenames especially. For this reason, I suppose most Japanese PHP users do not use multibyte filenames with PHP.
If you could use your code page without problems, I suggest to use it as internal_encoding with CLI. It's a lot easier.
You may try
cmd.exe /f:on /k "chcp 65001
to use UTF-8 with cmd.exe.
I tried it on my Windows. Japanese filenames (CP932) raised error and stopped "dir"ing. If all of your filenames are UTF-8, it may work. However, other programs like explorer may have problems with UTF-8 filenames. (I don't know how your version of Windows behave)
Anyway, feasible resolution for mixed encoding environment that treats various encoding automagically is very tough subject and use of UTF-8 could be problematic as described above.
That's just reinforcing my point, that single-minded pseudosolutions are not going to cut it.
Terminal encoding is a known complex problem, and by now people developed a treasure trove of knowledge in dealing with it.
Why PHP has to "invent" its own ways? The "one size fits all" approach didn't quite worked for several last centuries only on my memory. I don't want to see anyone trying to walk that road only to be covered in shame. Yet again.
@anrdaemon, it would be nice, if you could put your language down to the technical level from your shame theory. You say " half the extensions can't be loaded" - could you post some reproduce case for this, that can be debugged?
Get Far manager (+FAR Commands plugin, which is part of standard package - farmanager.com/download.php ), start PHP 7.1 with internal redirect…
view:<? php -d internal_encoding=utf-8 -r "sleep(3);"
…observe console corruption for the duration of script execution.
May be not the most self-container demonstration, but very much visual.
Works equally well with CP437.
So it doesn't load extension, while under Far manager? What is internal redirect? Preferably it should be reproducible with the pure cmd.
"Extensions not loaded" with subsequent process interruption was how I first stumbled upon this issue. A worst case scenario, if you wish.
I since found the way to reliable trigger it under normal circumstances.
Asking for "reproduction with cmd" isn't fair, CMD itself doesn't update console buffer in realtime (nor is CMD strictly required to be present, to begin with… 99% of the time I don't use it at all, running PHP scripts straight from other programs though its association).
> console corruption the moment I try this config with newly installed PHP 7.1. Since half the extensions can't be loaded.
1) I guess you installed half of extensions under multibyte pathname. Is this correct?
2) I suppose if you locate these extensions to single byte pathname, then they are loaded. Is this correct?
3) I use Windows version on occasion so I could be wrong, but doesn't multibyte pathname cause file access problem on older PHPs?
4) Your problem is code page change on Windows, so you are suggesting enable it like
php --chcp some.php
on Windows? (or --chcp-off to disable)
5) In case of abnormal exit, user may execute "chcp CODEPAGE". This could be documentation problem. Is this enough for you?
@anrdaemon, i'm really not sure about Far manager - it is a good software, and it also can support Unicode. Any program can update codepage at runtime, but that's not going to work well, especially if no TrueType font is used. Nevertheless, I've pushed a patch to support output_encoding different from the internal one, please test latest snapshots. There is no default behavior change. You can experience any kind of mojibake, if you work internally with the encoding different from the output, but that is your responsibility then.
Yasuo, I've also asked teh reporter in bug #73716 to check the usecase with Japanese locale. To the times I was testing the multibyte implementation, I was able to verify it and it seems a quite weird issue with both codepage and the font. I guess, this can also fix issues with similar multibyte codepages, however i saw not reports yet. I'm going to document the crashing case in the UPGRADING, and also the changes done in this patch, if it goes well.
Related To: Bug #73716
Ups, bug #72555 is what i wanted to mention.
It is known for Japanese that console (cmd.exe and powershell) font should be "MS ゴシック" (MS Gothic) to work with UTF-8/UTF-16. So I tried to set it to "MS ゴシック" on my Windows 10/7, but there is no choice for "MS ゴシック" only "Consolas"(TrueType), "Lucida Console"(TrueType) and "ラスターフォント"(Raster Font - Non TrueType). Recent versions of Windows seems only predefined and associated fonts for the codepage are selectable.
Raster Font doesn't work (got "The system cannot write to the specified device." error with Japanese file name), TrueType fonts work but no Japanese font(Glyph) and got □ (Tofu - glyph not found) for Japanese characters. However, it is encoded correctly so copy&pasted from console is readable(correct, not broken text) in UTF aware editor/etc.
Even if there is font issue for console, but it seems PHP 7.1 should work well. (Much better than older PHP at least for Japanese users)
On my side - when i switch to "Japanese (Japan)" as system codepage, so it's 932, "MS ゴシック" is the default console font. The empty boxes would indeed appear on cmd, if the system codepage is an incompatible one and the font contains no glyphs. For example, with chcp 437 it even won't let to chcp 932 or others multibyte codepages, and with 65001 the Japanese glyphs are still missing.
For that case however, a custom font can be installed for the console, that contains more glyphs. Or, some more advanced console emulator can be used. Probably can be more of effort with Far, as to see, but ConEmu showed me the most of Glyphs I reqired while writing tests on the machine with 437 system codepage. There are many other good term emulators out there, but we still should be cmd.exe oriented.
Thanks for checking, Yasuo. @algo13 was also supporting the development by checks and hints, that's where #72555 came from. One or another not critical issue is still present unfortunately, but the main goal is still to improve and extend the UTF-8 support, not moving back into the stone age of ANSI.
@yohgaki, no, the answer is way simpler.
The path to PHP binaries is not only non-unicode, it is rather short and not containing spaces.
I just didn't bothered to properly install extensions for the first run. Extensions relied on external DLL's crashed the party and I noticed broken console.
@ab, you are not sure about Far Manager because it's a good program? :)
Jokes aside, I checked 7.1.r33e96c9 and it no longer corrupts console when changing internal_encoding.
There's some related behavior I'm chasing, but present issue is now resolved.
As a fallout of our discussion, I feel the issue needs a deeper revision and a clearer definition of meaning for each of the four settings.
Ideally, an automatic conversion between encodings, where it is sensible.
Where it is more appropriate to discuss these parts of PHP behavior?
> As a fallout of our discussion, I feel the issue needs a deeper revision
> and a clearer definition of meaning for each of the four settings.
> Ideally, an automatic conversion between encodings, where it is sensible.
input_encoding/otput_encoding are for web environment now.
mbstring/iconv may convert input encoding except uploaded files.
mbstring/iconv may convert output encoding if output's MIME type is text/*.
Under web environment, we can safely determine what's text and what's not.
Unlike web environment, we cannot assume/determine what are input and output under CLI environment. e.g. pipe. We may add optional conversion feature for CLI, but I don't feel it is mandatory. It should be an option at least. However, if encoding conversion is optional, we may use other tools like "iconv", "nkf" or "lv" commands for pipes.
IIRC, someone is trying to detect if STDIN/STDOUT if tty or not. I don't know the status. If STDIN/STDOUT device (and locale) can be detected with reliable manner, we may do something for STDIN/STDOUT.
BTW, I didn't know the details about Windows CLI changes until now, but I think new CLI is a lot nicer for multibyte char users. I haven't tried it yet, though.
> Where it is more appropriate to discuss these parts of PHP behavior?
If you have concrete idea about encoding handling improvement, firstname.lastname@example.org would be the place.
Sounds good then. Thanks for the test, @anrdaemon. The console codepage is now handled separately, but is still same as internal if input_encoding and output_encoding are empty - default case. I'd still like to know more about the crashes you mentioned, as that should be fixed in first place, for more stability at least. If you're an active Windows user, please test the release candidates and give feedback, that's essential.
With Far - i was actively using it in the old dark times, when moved away from DOS and Norton Commander. Really didn't touch it for a while, but I know the community is still active and the tool itself is very powerful.
There is an accepted RFC, coincidentally written by Yasuo https://wiki.php.net/rfc/default_encoding :) The change for console encoding is IMO in the scope of this RFC. Though, the Windows console itself is likely behave unpredictably if only one of in or out codepage is changed, it's likely to require both. Plus, the mojibake mentioned. The most of other platforms have UTF-8 by default on console, anyway. I still recommend to use one encoding for all the things on console, even not obligatory UTF-8, as that's the most modern and stable variant. For 7.2 - yeah, there's already a portable stream_isatty() in userspace, so redirection is detectable. That also gives the base for the further console improvements on Windows, at least.
I will wait for the other ticket yet, and put some docs and news then.
Marking as duplicate, as it is basically same fix as bug #73594 which came first.
Mistake, dup of #72555.