php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #65544 mb title case conversion-first word in quotation isn't capitalized
Submitted: 2013-08-25 01:25 UTC Modified: 2017-07-28 12:09 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:0 (0.0%)
From: ww dot galen at gmail dot com Assigned:
Status: Closed Package: mbstring related
PHP Version: 5.4 and later OS:
Private report: No CVE-ID: None
 [2013-08-25 01:25 UTC] ww dot galen at gmail dot com
Description:
------------
When `mb_convert_case()` is used in MB_CASE_TITLE mode, the first letter character in a quotation immediately 
following the quotation mark isn't uppercased.

According to the Unicode 4 standard, ยง 3.13, when title-casing a string the first *cased character* in a word is 
supposed to be converted to title case. In the PHP implementation (`php_unicode_convert_case()` in 
php_unicode.c), the first *character* is converted to title case. Coupled with the fix for #46626 (which makes 
characters in "Punctuation, Other" word characters), quotation marks are converted to title case, rather than 
the first letter following.


Tested with PHP 5.4, but the same code is present in every version of PHP since  5.2.7 (including 5.5) as well.

Test script:
---------------
<?php
echo mb_convert_case("\"or else it doesn't, you know. the name of the song is called 'haddocks' eyes.'\"\n", MB_CASE_TITLE);


Expected result:
----------------
"Or Else It Doesn't, You Know. The Name Of The Song Is Called 'Haddocks' Eyes.'"


Actual result:
--------------
"or Else It Doesn't, You Know. The Name Of The Song Is Called 'haddocks' Eyes.'"


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-08-25 09:49 UTC] yohgaki@php.net
-Status: Open +Status: Verified -PHP Version: 5.4.19 +PHP Version: 5.4 and later
 [2013-08-25 09:49 UTC] yohgaki@php.net
verified with PHP-5.5 and master branch
 [2013-08-25 09:51 UTC] yohgaki@php.net
Question is "is this supposed to work?"
There is " before 'or' and no space before word.

Expected result:
----------------
"Or Else It Doesn't, You Know. The Name Of The Song Is Called 'Haddocks' Eyes.'"

Any comments?
 [2017-07-28 12:09 UTC] nikic@php.net
Unicode specifies the following algorithm for title-casing:

toTitlecase(X): Find the word boundaries in X according to Unicode Standard Annex #29, "Unicode Text Segmentation." For each word boundary, find the first cased character F following the word boundary. If F exists, map F to Titlecase_Mapping(F); then map all characters C between F and the following word boundary to Lowercase_Mapping(C).

The part about title-casing the first cased character would be simple enough to implement. However, correct detection of word boundaries is pretty tricky and I don't think we're going to support this.

There is also an older obsolete algorithm from UTR #21:

    For each character C, find the preceding character B.
        ignore any intervening case-ignorable characters when finding B.
    If B exists, and is cased
        map C to UCD_lower(C)
    Otherwise,
        map C to UCD_title(C)

Using this algorithm would also resolve the issue, because:

 * At the "o" after '"' the previous non-case-ignorable character is '"', which is not cased, so we convert to title-case.
 * At the "t" after "'" the previous non-case-ignorable character is "n" (because "'" is case-ignorable), which is cased, so we convert to lower-case.
 * At the "h" after "'" the previous non-case-ignorable character is " " (because "'" is case-ignorable), whic his not cased, so we convert to title-case.

This would also fix bug #71298, because U+2019 is case-ignorable.

It would not resolve the "5th" issue in #36311, because at the "t" the previous non-case-ignorable character is "5", which is not cased, so this would be converted to "5Th". However, the result would be the same using the current title-casing algorithm: Under UAX #29 rule WB 10 there is no word-boundary inside "5th", so that the first cased character after the word boundary before "5" would be "t", giving the same result. As such, I would say that the current result in bug #36311 is correct.
 [2017-07-28 12:58 UTC] nikic@php.net
Automatic comment on behalf of nikita.ppv@gmail.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=f4a1d9c8211fa7878af14d0bd94b2deaab19ae21
Log: Fixed bug #65544 and #71298
 [2017-07-28 12:58 UTC] nikic@php.net
-Status: Verified +Status: Closed
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 11:01:29 2024 UTC