| Bug #49687 | utf8_decode xml_utf8_decode vuln | ||||
|---|---|---|---|---|---|
| Submitted: | 27 Sep 11:20am UTC | Modified: | 19 Oct 9:37am UTC | ||
| From: | sird at rckc dot at | Assigned to: | scottmac | ||
| Status: | Assigned | Category: | *Unicode Issues | ||
| Version: | 5.2.11 | OS: | * | ||
| Votes: | 5 | Avg. Score: | 5.0 ± 0.0 | Reproduced: | 5 of 5 (100.0%) |
| Same Version: | 3 (60.0%) | Same OS: | 2 (40.0%) | ||
[27 Sep 11:20am UTC] sird at rckc dot at
[28 Sep 7:38pm UTC] sjoerd@php.net
Is this a bug in PHP or in scripts which do utf8_decode(addslashes()) instead of addslashes(utf8_decode())? What do you propose to solve this bug?
[29 Sep 1:58am UTC] sird at rckc dot at
it is a PHP bug, the function is not decoding correctly, check the ppt and the acunetix blog for details. there are several bugs in the code, one of them is that a variable holding the value of the char is overflowed (trying to put 21 bits in a 16 bits int), also the code is not checking if it is a valid unicode char (reading unicode specification should explain it). the example root@80sec gave you was an overlong utf representation of a single quote. that is forbidden by unicode, and should transform the char to ?. also, the code is not checking if the chars are valid UTF, so stuff like: <img alt="\x90" title=" src=x:x onerror=alert(1)//"> are going to be transformed to <img alt="? title=" src=x:x onerror=alert(1)//"> this is a very serious vulnerability and there are several bugs in the same function (there's even unreachable code). you can check the implementation of utf by Mozilla or Webkit, they do it right. dont use java as a reference since they are also flawed. due to the fact that PHP is for web applications and utf is widely used, and it allows an attacker to do all type of attacks (from sql injection to xss) its imperative to fix that function. Greetings!!
[29 Sep 4:56am UTC] rasmus@php.net
> there are several bugs in the code, one of them is that a variable holding the value of the char is overflowed (trying to put 21 bits in a 16 bits int) That was fixed in 5.2.11
[29 Sep 5:29am UTC] sird at rckc dot at
the rest is still dangerous.. eating chars without the 10xx xxxx is against the spec, and overlong UTF.
[16 Oct 1:36am UTC] sird at rckc dot at
: rasmus@php.net It has come to my attention that this hasn't been fixed.. unsigned int has a size of 16 bits, don't take my word for it http://www.acm.uiuc.edu/webmonkeys/book/c_guide/1.2.html Section: 1.2.2 Variables unsigned int 16 bits I just downloaded PHP 5.2.11, and I quote the code: // php-5.2.11.tar.bz2/php-5.2.11/ext/xml/xml.c#558 PHPAPI char *xml_utf8_decode( // ... { int pos = len; char *newbuf = emallo // ... unsigned int c; // sizeof(unsigned int)==16 bits char (*decoder)(unsig // ... xml_encoding *enc = x // ... // ... // #580 c = (unsigned char)(*s); if (c >= 0xf0) { /* four bytes encoded, 21 bits */ if(pos-4 >= 0) { c = ((s[0]&7)<<18) | ((s[1]&63)<<12) | ((s[2]&63)<<6) | (s[3]&63); } else { c = '?'; } s += 4; pos -= 4; // ... Also no checking at ALL is made on the leading bytes (they should be in the form: 10xx xxxx, a check is very easy, to check if s[0] has the correct form: you do an AND with 1100 0000 and then compare it with 1000 0000. s[0]&0xC0==0x80 Also, Overlong UTF is not being taken care of, that's yeah, yet another vulnerability. Greetings!!
[16 Oct 3:32am UTC] scottmac@php.net
On a 16-bit processor an int might be 16-bit, if you can get PHP to compile then well done :-) Did you even try running the test code?
[16 Oct 3:41am UTC] sird at rckc dot at
oops! you are right, :) the code before was unsigned short. still, the other vulnerabilities remain. I've made a blogpost that explains the other issues ;) http://sirdarckcat.blogspot.com/2009/10/couple-of-unicode-issues-on-php- and.html I updated the post to note the last bug was fixed on 5.2.11 Greetings!!
[16 Oct 4:01am UTC] scottmac@php.net
PHP 5 has binary strings, not utf-8 strings. It does not attempt to do any validation on input, so expecting addslashes to magically validate things as utf-8 is wrong, simple as. I agree that utf8_decode should do proper validation here though the overhead of doing that validation is going to be slow. So I've coded up a utf8_validate function. Still need to sort out some of the behaviour first.
[16 Oct 4:41am UTC] sird at rckc dot at
I disagree.. how slow can it be to add 2 bit operations..
} else if (c < 0x800) {
change to
} else if (c < 0x800) {
if ( (s[1]&0xC0!=0x80) ){ // this is a new operation
newbuf[(*newlen)++] = '?'; // this are not new operations
pos--; // this are not new operations
s++; // this are not new operations
continue;
}
}
Besides, considering all real implementations do what the spec say they
should do (it's not validate it's valid UNICODE, is that UNICODE says
that the algorithm SHOULD do the check).. not doing it on PHP is just
nuts.
[16 Oct 4:45am UTC] sird at rckc dot at
oh, my mistake:
else if (c < 0x800) {
newbuf[(*newlen)++] = (0xc0 | (c >> 6));
newbuf[(*newlen)++] = (0x80 | (c & 0x3f));
}
should be:
else if (c < 0x800) {
if ( (s[1]&0xC0!=0x80) ){
newbuf[(*newlen)++] = '?';
}else{
newbuf[(*newlen)++] = (0xc0 | (c >> 6));
newbuf[(*newlen)++] = (0x80 | (c & 0x3f));
}
}
[16 Oct 4:52am UTC] sird at rckc dot at
Oh, duh! I'm reading the wrong function.. :( Sorry
if(pos-2 >= 0 || s[1]&0xC0!=0x80) {
c = ((s[0]&7)<<18) | ((s[1]&63)<<12) | ((s[2]&63)<<6) | (s[3]&63);
} else {
c = '?';
}
[16 Oct 4:53am UTC] sird at rckc dot at
My last post, I promise.. it should say: c = ((s[0]&63)<<6) | (s[1]&63); Greetz!
