php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #30800 UNICODE support to name variables and other PHP labels
Submitted: 2004-11-15 20:03 UTC Modified: 2005-03-07 09:35 UTC
Votes:7
Avg. Score:2.0 ± 1.6
Reproduced:2 of 7 (28.6%)
Same Version:1 (50.0%)
Same OS:1 (50.0%)
From: jmmolina at free dot fr Assigned:
Status: Wont fix Package: Feature/Change Request
PHP Version: 5.0.2 OS: Windows 2000 Pro
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2004-11-15 20:03 UTC] jmmolina at free dot fr
Description:
------------
From the variables chapter we can read :

? Variable names follow the same rules as other labels in PHP. A valid variable name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'

    Note: For our purposes here, a letter is a-z, A-Z, and the ASCII characters from 127 through 255 (0x7f-0xff). ?

As many languages use other character sets, I was wondering if there was any plan to support UNICODE to name variables and other PHP labels. As I told the Zend support : ? It would allow french and asian developers to write their scripts in their own language. Not sure about the impact on performances though. ?. For example in french we often use the ? ? ? characters, we call it ? e in o ? because it looks like ? oe ?, it's used in words like ? c?ur ? (heart) or ? n?ud ? (node). The problem is that some characters are supported, some others are not. For example the ? ? ? character, ? e in a ? is part of the ASCII character set, its ASCII character code is 0xE6. It means PHP does support scripts using this character, because 0xE6 is between 0x7f and 0xff. But ? ? ? is not an ASCII character, it's the ? Latin Small Lagature Oe ? UNICODE character.

To sum things up, the idea is to allow developers to write PHP scripts using their natural language. French developers would be able to write scripts in french, using our weird ? Latin Small Lagature ? characters, chinese and japanese developers would be able to use their favourite KANJI to name their variables, classes...

I think the PHP team decided to choose this regular expression to improve the script parsing performance, but I'm sure there's a solution to support UNICODE. It could be an option to enable from the PHP configuration file for example, or using a Apache .htaccess file. Beside the performance penalty there might be an other problem. Allowing us to use the whole UNICODE character set means we would be able to name our variables ? c?ur♥ ? (last character code is x2665, it represents a black heart) or ? ♀♂ ? (male and female symbols) instead of ? human_class ?. I'm sure the PHP team will point out other issues but as I'm not a hardcore Zend engine developer, It's all what I can think of :).

I join the ? n?ud ? class script if you want to try it. The PHP parser returns a parse error at line 9 (? private $nom; ?).

Jean-Marc Molina.

Reproduce code:
---------------
<?php

/**
Classe n?ud.
*/

class n?ud
{
	private $nom;
	
	public function __construct ()
	{
		$this->nom = "N?ud sans nom";
	}
}

?>


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-11-15 20:39 UTC] derick@php.net
Something like this is under consideration, most likely for PHP 5.2.
 [2005-03-07 09:35 UTC] derick@php.net
You can already do this, just encode your script as UTF-8.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Apr 23 23:01:29 2024 UTC