WordPress.org

Ready to get started?Download WordPress

Forums

Taxonomy Short Description
[resolved] Multibyte support and breaks within words (7 posts)

  1. toscho
    Member
    Plugin Contributor

    Posted 3 years ago #

    Hi,

    I want to suggest three enhancements.

    1. The current version 1.0 breaks words in the middle and it is using strlen() which is not multibyte aware.
    Test it with a description with an uneven amount of bytes like:

    1öööööööööööööööööööööööööööööööööööööööööö

    The result will contain invalid UTF-8.

    2. Languages like Chinese don’t have white space (or just rarely). In all other languages, I would rather not cut off words and break at the last position of a white space. Otherwise you may end up with one character from the last word, which isn’t very useful.

    3. The last two characters should be a non breaking white space and a real ellipsis (…), not just three dots (...).
    Details matter. :)

    So I made some changes to the function taxonomy_short_description_shorten():

    function taxonomy_short_description_shorten( $string, $length = 23, $append = '…' ) {
    	$string = strip_tags( $string );
    	$string = trim( $string );
    	$string = html_entity_decode( $string, ENT_QUOTES, 'UTF-8' );
    	$string = rtrim( $string, '-' );
    
    	// toscho edit
    
    	if ( ! function_exists( 'mb_substr' ) )
    	{// original return call
    		return ( strlen( $string ) > absint( $length ) )
    			? substr_replace( $string, $append, absint( $length ) ) : $string;
    	}
    	// enhancements
    	// count the real characters
    	$s_length = strlen( utf8_decode( $string ) );
    
    	if ( $s_length <= $length )
    	{
    		return $string;
    	}
    	// shorten the string to max-length
    	$string = mb_substr( $string, 0, $length, 'utf-8' );
    
    	// avoid breaks within words
    	// find the last white space
    	$pos = mb_strrpos( $string, ' ', 'utf-8' );
    
    	// No space? One long word. Or chinese/korean/japanese text.
    	if ( $pos !== FALSE )
    	{
    		// shorten the string to the last space
    		$string = mb_substr( $string, 0, $pos, 'utf-8' )
    			// no break space, verbose notation for readability.
    			// plus a real ellipsis
    			. "\xC2\xA0" . $append;
    	}
    
    	return $string;
    }

    Regards
    Thomas Scholz

    http://wordpress.org/extend/plugins/taxonomy-short-description/

  2. Michael Fields
    Themer
    Plugin Author

    Posted 3 years ago #

    toscho,
    Hi thanks for the suggestions! I'm going to test this out and include this into the plugin + add you as a contributer. As you probably already know, I'm not very experienced with coding for other languages and I really appreciate your insight.

  3. Michael Fields
    Themer
    Plugin Author

    Posted 3 years ago #

    Do you think that it's worth including the original return call? Currently the minimum required version of php to run WordPress is 4.3 and it looks like php has supported mb_substr() since 4.0.6. It seems to me that if a user can run WordPress, they will have access to mb_substr(), but I could be wrong.

    if ( ! function_exists( 'mb_substr' ) )
    	{// original return call
    		return ( strlen( $string ) > absint( $length ) )
    			? substr_replace( $string, $append, absint( $length ) ) : $string;
    	}
  4. Michael Fields
    Themer
    Plugin Author

    Posted 3 years ago #

    Here are my proposed edits to your modifications. I noticed that while using the string you provided:

    1öööööööööööööööööööööööööööööööööööööööööö

    that the ellipses were not appended so this had to be moved outside the conditional.

    I also moved a few things around and reduced the number of return statements to 1. Please let me know your thoughts. Thanks again for your contribution!

    function taxonomy_short_description_shorten( $string, $max_length = 23, $append = '…' ) {
    	$string = strip_tags( $string );
    	$string = trim( $string );
    	$string = html_entity_decode( $string, ENT_QUOTES, 'UTF-8' );
    	$string = rtrim( $string, '-' );
    
    	/* Count how many characters are in the string. */
    	$length = strlen( utf8_decode( $string ) );
    
    	/* String is longer than max-length. It needs to be shortened. */
    	if ( $length > $max_length ) {
    
    		/* Shorten the string to max-length */
    		$string = mb_substr( $string, 0, $max_length, 'utf-8' );
    
    		/* avoid breaks within words - find the last white space */
    		$pos = mb_strrpos( $string, ' ', 'utf-8' );
    
    		/* No space? One long word or chinese/korean/japanese text.
    		shorten the string to the last space
    		no break space, verbose notation for readability. */
    		if ( false !== $pos ) {
    			$string = mb_substr( $string, 0, $pos, 'utf-8' );
    		}
    
    		/* Append shortened string with the value of $append preceeded by a non-breaking space. */
    		$string.= "\xC2\xA0" . $append;
    	}
    
    	return $string;
    }
  5. toscho
    Member
    Plugin Contributor

    Posted 3 years ago #

    Mbstring is a non-default extension. I had cases where it was missing, even on PHP 5. As far as I know WordPress doesn’t require the extension. There is a fallback for mb_substr() in /wp-includes/compat.php, but mb_strrpos() may be a little bit risky.

    I noticed that while using the string you provided […] that the ellipses were not appended so this had to be moved outside the conditional.

    I wouldn’t append an ellipsis to such a string. More than 20 characters without white space occur mostly in languages like Chinese or Korean where typography follows other rules. You could try to be very cool and look into WPLANG to set the correct ellipsis (and white space). There was an interesting discussion in a Mozilla group on this topic.
    But doing nothing is probably better than doing it wrong.

    I also moved a few things around and reduced the number of return statements to 1.

    I prefer early exits to avoid unnecessary indentations. But that’s a matter of style.

  6. Michael Fields
    Themer
    Plugin Author

    Posted 3 years ago #

    Mbstring is a non-default extension. I had cases where it was missing, even on PHP 5. As far as I know WordPress doesn’t require the extension. There is a fallback for mb_substr() in /wp-includes/compat.php, but mb_strrpos() may be a little bit risky.

    That's really good to know. Thanks for the info.

    I wouldn’t append an ellipsis to such a string. More than 20 characters without white space occur mostly in languages like Chinese or Korean where typography follows other rules. You could try to be very cool and look into WPLANG to set the correct ellipsis (and white space). There was an interesting discussion in a Mozilla group on this topic.
    But doing nothing is probably better than doing it wrong.

    Really didn't know this could be such an involved thing. I think it just might be a good idea to have $append default to a translatable string. I'll have to dig through core and see if I can figure out language detection... I just downloaded the Japanese version of WordPress and took a peak at /wp-content/languages/ja.po and it uses three periods in all instances of "more...". Not sure what the best way to handle this is. I definitely want it included for all languages that support it.

  7. toscho
    Member
    Plugin Contributor

    Posted 3 years ago #

    I think it just might be a good idea to have $append default to a translatable string.

    Unfortunately, WP doesn’t offer the […] as a separate string in the language files. The core uses some … and, worse, sometimes [...] (three dots). Adding your own POT file for this one string looks like overkill for me.
    In most languages, U+2026 will be the correct character. Three separate dots are probably always wrong (at least in the languages I can read). Use just […] and encode the PHP file in UTF-8.

    I just downloaded the Japanese version of WordPress and took a peak at /wp-content/languages/ja.po and it uses three periods in all instances of "more...".

    Japanese text may use […] or [……]; but [...] looks wrong. The language files aren’t written by typographers. ;)
    In Chinese, it is [……]. I couldn’t find a reference for Korean. Maybe they don’t cut their sentences? :)

Topic Closed

This topic has been closed to new replies.

About this Plugin

About this Topic