WordPress.org

Support

Support » Plugins and Hacks » Taxonomy Short Description » [Resolved] [Plugin: Taxonomy Short Description] Multibyte support and breaks within words

[Resolved] [Plugin: Taxonomy Short Description] Multibyte support and breaks within words

  • Plugin Contributor toscho

    @toscho

    WCEU 2016 Contributor

    Hi,

    I want to suggest three enhancements.

    1. The current version 1.0 breaks words in the middle and it is using strlen() which is not multibyte aware.
    Test it with a description with an uneven amount of bytes like:

    1öööööööööööööööööööööööööööööööööööööööööö

    The result will contain invalid UTF-8.

    2. Languages like Chinese don’t have white space (or just rarely). In all other languages, I would rather not cut off words and break at the last position of a white space. Otherwise you may end up with one character from the last word, which isn’t very useful.

    3. The last two characters should be a non breaking white space and a real ellipsis (…), not just three dots (…).
    Details matter. 🙂

    So I made some changes to the function taxonomy_short_description_shorten():

    function taxonomy_short_description_shorten( $string, $length = 23, $append = '…' ) {
    	$string = strip_tags( $string );
    	$string = trim( $string );
    	$string = html_entity_decode( $string, ENT_QUOTES, 'UTF-8' );
    	$string = rtrim( $string, '-' );
    
    	// toscho edit
    
    	if ( ! function_exists( 'mb_substr' ) )
    	{// original return call
    		return ( strlen( $string ) > absint( $length ) )
    			? substr_replace( $string, $append, absint( $length ) ) : $string;
    	}
    	// enhancements
    	// count the real characters
    	$s_length = strlen( utf8_decode( $string ) );
    
    	if ( $s_length <= $length )
    	{
    		return $string;
    	}
    	// shorten the string to max-length
    	$string = mb_substr( $string, 0, $length, 'utf-8' );
    
    	// avoid breaks within words
    	// find the last white space
    	$pos = mb_strrpos( $string, ' ', 'utf-8' );
    
    	// No space? One long word. Or chinese/korean/japanese text.
    	if ( $pos !== FALSE )
    	{
    		// shorten the string to the last space
    		$string = mb_substr( $string, 0, $pos, 'utf-8' )
    			// no break space, verbose notation for readability.
    			// plus a real ellipsis
    			. "\xC2\xA0" . $append;
    	}
    
    	return $string;
    }

    Regards
    Thomas Scholz

    http://wordpress.org/extend/plugins/taxonomy-short-description/

Viewing 6 replies - 1 through 6 (of 6 total)
  • Plugin Author Michael Fields

    @mfields

    toscho,
    Hi thanks for the suggestions! I’m going to test this out and include this into the plugin + add you as a contributer. As you probably already know, I’m not very experienced with coding for other languages and I really appreciate your insight.

    Plugin Author Michael Fields

    @mfields

    Do you think that it’s worth including the original return call? Currently the minimum required version of php to run WordPress is 4.3 and it looks like php has supported mb_substr() since 4.0.6. It seems to me that if a user can run WordPress, they will have access to mb_substr(), but I could be wrong.

    if ( ! function_exists( 'mb_substr' ) )
    	{// original return call
    		return ( strlen( $string ) > absint( $length ) )
    			? substr_replace( $string, $append, absint( $length ) ) : $string;
    	}
    Plugin Author Michael Fields

    @mfields

    Here are my proposed edits to your modifications. I noticed that while using the string you provided:

    1öööööööööööööööööööööööööööööööööööööööööö

    that the ellipses were not appended so this had to be moved outside the conditional.

    I also moved a few things around and reduced the number of return statements to 1. Please let me know your thoughts. Thanks again for your contribution!

    function taxonomy_short_description_shorten( $string, $max_length = 23, $append = '…' ) {
    	$string = strip_tags( $string );
    	$string = trim( $string );
    	$string = html_entity_decode( $string, ENT_QUOTES, 'UTF-8' );
    	$string = rtrim( $string, '-' );
    
    	/* Count how many characters are in the string. */
    	$length = strlen( utf8_decode( $string ) );
    
    	/* String is longer than max-length. It needs to be shortened. */
    	if ( $length > $max_length ) {
    
    		/* Shorten the string to max-length */
    		$string = mb_substr( $string, 0, $max_length, 'utf-8' );
    
    		/* avoid breaks within words - find the last white space */
    		$pos = mb_strrpos( $string, ' ', 'utf-8' );
    
    		/* No space? One long word or chinese/korean/japanese text.
    		shorten the string to the last space
    		no break space, verbose notation for readability. */
    		if ( false !== $pos ) {
    			$string = mb_substr( $string, 0, $pos, 'utf-8' );
    		}
    
    		/* Append shortened string with the value of $append preceeded by a non-breaking space. */
    		$string.= "\xC2\xA0" . $append;
    	}
    
    	return $string;
    }
    Plugin Contributor toscho

    @toscho

    WCEU 2016 Contributor

    Mbstring is a non-default extension. I had cases where it was missing, even on PHP 5. As far as I know WordPress doesn’t require the extension. There is a fallback for mb_substr() in /wp-includes/compat.php, but mb_strrpos() may be a little bit risky.

    I noticed that while using the string you provided […] that the ellipses were not appended so this had to be moved outside the conditional.

    I wouldn’t append an ellipsis to such a string. More than 20 characters without white space occur mostly in languages like Chinese or Korean where typography follows other rules. You could try to be very cool and look into WPLANG to set the correct ellipsis (and white space). There was an interesting discussion in a Mozilla group on this topic.
    But doing nothing is probably better than doing it wrong.

    I also moved a few things around and reduced the number of return statements to 1.

    I prefer early exits to avoid unnecessary indentations. But that’s a matter of style.

    Plugin Author Michael Fields

    @mfields

    Mbstring is a non-default extension. I had cases where it was missing, even on PHP 5. As far as I know WordPress doesn’t require the extension. There is a fallback for mb_substr() in /wp-includes/compat.php, but mb_strrpos() may be a little bit risky.

    That’s really good to know. Thanks for the info.

    I wouldn’t append an ellipsis to such a string. More than 20 characters without white space occur mostly in languages like Chinese or Korean where typography follows other rules. You could try to be very cool and look into WPLANG to set the correct ellipsis (and white space). There was an interesting discussion in a Mozilla group on this topic.
    But doing nothing is probably better than doing it wrong.

    Really didn’t know this could be such an involved thing. I think it just might be a good idea to have $append default to a translatable string. I’ll have to dig through core and see if I can figure out language detection… I just downloaded the Japanese version of WordPress and took a peak at /wp-content/languages/ja.po and it uses three periods in all instances of “more…”. Not sure what the best way to handle this is. I definitely want it included for all languages that support it.

    Plugin Contributor toscho

    @toscho

    WCEU 2016 Contributor

    I think it just might be a good idea to have $append default to a translatable string.

    Unfortunately, WP doesn’t offer the […] as a separate string in the language files. The core uses some … and, worse, sometimes […] (three dots). Adding your own POT file for this one string looks like overkill for me.
    In most languages, U+2026 will be the correct character. Three separate dots are probably always wrong (at least in the languages I can read). Use just […] and encode the PHP file in UTF-8.

    I just downloaded the Japanese version of WordPress and took a peak at /wp-content/languages/ja.po and it uses three periods in all instances of “more…”.

    Japanese text may use […] or [……]; but […] looks wrong. The language files aren’t written by typographers. 😉
    In Chinese, it is [……]. I couldn’t find a reference for Korean. Maybe they don’t cut their sentences? 🙂

Viewing 6 replies - 1 through 6 (of 6 total)
  • The topic ‘[Resolved] [Plugin: Taxonomy Short Description] Multibyte support and breaks within words’ is closed to new replies.
Skip to toolbar