|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2013-08-25 01:25 UTC] ww dot galen at gmail dot com
Description:
------------
When `mb_convert_case()` is used in MB_CASE_TITLE mode, the first letter character in a quotation immediately
following the quotation mark isn't uppercased.
According to the Unicode 4 standard, ยง 3.13, when title-casing a string the first *cased character* in a word is
supposed to be converted to title case. In the PHP implementation (`php_unicode_convert_case()` in
php_unicode.c), the first *character* is converted to title case. Coupled with the fix for #46626 (which makes
characters in "Punctuation, Other" word characters), quotation marks are converted to title case, rather than
the first letter following.
Tested with PHP 5.4, but the same code is present in every version of PHP since 5.2.7 (including 5.5) as well.
Test script:
---------------
<?php
echo mb_convert_case("\"or else it doesn't, you know. the name of the song is called 'haddocks' eyes.'\"\n", MB_CASE_TITLE);
Expected result:
----------------
"Or Else It Doesn't, You Know. The Name Of The Song Is Called 'Haddocks' Eyes.'"
Actual result:
--------------
"or Else It Doesn't, You Know. The Name Of The Song Is Called 'haddocks' Eyes.'"
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Thu Oct 30 05:00:01 2025 UTC |
Unicode specifies the following algorithm for title-casing: toTitlecase(X): Find the word boundaries in X according to Unicode Standard Annex #29, "Unicode Text Segmentation." For each word boundary, find the first cased character F following the word boundary. If F exists, map F to Titlecase_Mapping(F); then map all characters C between F and the following word boundary to Lowercase_Mapping(C). The part about title-casing the first cased character would be simple enough to implement. However, correct detection of word boundaries is pretty tricky and I don't think we're going to support this. There is also an older obsolete algorithm from UTR #21: For each character C, find the preceding character B. ignore any intervening case-ignorable characters when finding B. If B exists, and is cased map C to UCD_lower(C) Otherwise, map C to UCD_title(C) Using this algorithm would also resolve the issue, because: * At the "o" after '"' the previous non-case-ignorable character is '"', which is not cased, so we convert to title-case. * At the "t" after "'" the previous non-case-ignorable character is "n" (because "'" is case-ignorable), which is cased, so we convert to lower-case. * At the "h" after "'" the previous non-case-ignorable character is " " (because "'" is case-ignorable), whic his not cased, so we convert to title-case. This would also fix bug #71298, because U+2019 is case-ignorable. It would not resolve the "5th" issue in #36311, because at the "t" the previous non-case-ignorable character is "5", which is not cased, so this would be converted to "5Th". However, the result would be the same using the current title-casing algorithm: Under UAX #29 rule WB 10 there is no word-boundary inside "5th", so that the first cased character after the word boundary before "5" would be "t", giving the same result. As such, I would say that the current result in bug #36311 is correct.