PHP :: Bug #65544 :: mb title case conversion-first word in quotation isn't capitalized

Bug #65544

mb title case conversion-first word in quotation isn't capitalized

Submitted:

2013-08-25 01:25 UTC

Modified:

2017-07-28 12:09 UTC

Votes:	1
Avg. Score:	5.0 ± 0.0
Reproduced:	1 of 1 (100.0%)
Same Version:	1 (100.0%)
Same OS:	0 (0.0%)

From:

ww dot galen at gmail dot com

Assigned:

Status:

Closed

Package:

mbstring related

PHP Version:

5.4 and later

OS:

Private report:

CVE-ID:

None

View Developer Edit

[2013-08-25 01:25 UTC] ww dot galen at gmail dot com

Description:
------------
When `mb_convert_case()` is used in MB_CASE_TITLE mode, the first letter character in a quotation immediately 
following the quotation mark isn't uppercased.

According to the Unicode 4 standard, § 3.13, when title-casing a string the first *cased character* in a word is 
supposed to be converted to title case. In the PHP implementation (`php_unicode_convert_case()` in 
php_unicode.c), the first *character* is converted to title case. Coupled with the fix for #46626 (which makes 
characters in "Punctuation, Other" word characters), quotation marks are converted to title case, rather than 
the first letter following.


Tested with PHP 5.4, but the same code is present in every version of PHP since  5.2.7 (including 5.5) as well.

Test script:
---------------
<?php
echo mb_convert_case("\"or else it doesn't, you know. the name of the song is called 'haddocks' eyes.'\"\n", MB_CASE_TITLE);


Expected result:
----------------
"Or Else It Doesn't, You Know. The Name Of The Song Is Called 'Haddocks' Eyes.'"


Actual result:
--------------
"or Else It Doesn't, You Know. The Name Of The Song Is Called 'haddocks' Eyes.'"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2013-08-25 09:49 UTC] yohgaki@php.net

-Status: Open +Status: Verified -PHP Version: 5.4.19 +PHP Version: 5.4 and later

[2013-08-25 09:49 UTC] yohgaki@php.net

verified with PHP-5.5 and master branch

[2013-08-25 09:51 UTC] yohgaki@php.net

Question is "is this supposed to work?"
There is " before 'or' and no space before word.

Expected result:
----------------
"Or Else It Doesn't, You Know. The Name Of The Song Is Called 'Haddocks' Eyes.'"

Any comments?

[2017-07-28 12:09 UTC] nikic@php.net

Unicode specifies the following algorithm for title-casing:

toTitlecase(X): Find the word boundaries in X according to Unicode Standard Annex #29, "Unicode Text Segmentation." For each word boundary, find the first cased character F following the word boundary. If F exists, map F to Titlecase_Mapping(F); then map all characters C between F and the following word boundary to Lowercase_Mapping(C).

The part about title-casing the first cased character would be simple enough to implement. However, correct detection of word boundaries is pretty tricky and I don't think we're going to support this.

There is also an older obsolete algorithm from UTR #21:

For each character C, find the preceding character B.
ignore any intervening case-ignorable characters when finding B.
If B exists, and is cased
map C to UCD_lower(C)
Otherwise,
map C to UCD_title(C)

Using this algorithm would also resolve the issue, because:

* At the "o" after '"' the previous non-case-ignorable character is '"', which is not cased, so we convert to title-case.
* At the "t" after "'" the previous non-case-ignorable character is "n" (because "'" is case-ignorable), which is cased, so we convert to lower-case.
* At the "h" after "'" the previous non-case-ignorable character is " " (because "'" is case-ignorable), whic his not cased, so we convert to title-case.

This would also fix bug #71298, because U+2019 is case-ignorable.

It would not resolve the "5th" issue in #36311, because at the "t" the previous non-case-ignorable character is "5", which is not cased, so this would be converted to "5Th". However, the result would be the same using the current title-casing algorithm: Under UAX #29 rule WB 10 there is no word-boundary inside "5th", so that the first cased character after the word boundary before "5" would be "t", giving the same result. As such, I would say that the current result in bug #36311 is correct.

[2017-07-28 12:11 UTC] nikic@php.net

Related To: Bug #36311

[2017-07-28 12:58 UTC] nikic@php.net

Automatic comment on behalf of nikita.ppv@gmail.com
Revision: http://git.php.net/?p=php-src.git;a=commit;h=f4a1d9c8211fa7878af14d0bd94b2deaab19ae21
Log: Fixed bug #65544 and #71298

[2017-07-28 12:58 UTC] nikic@php.net

-Status: Verified +Status: Closed

[2017-07-28 12:59 UTC] nikic@php.net

Related To: Bug #71298

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Fri Nov 21 13:00:01 2025 UTC