• Hi. I’d like to introduce problems related to ASIAN text. I noticed WP 2.0 and 2.0.1 started using mb_substr instead of regular substr, but such modification is not good enough to treat ASIAN text well, so I’d like to describe those ASIAN text problems I noticed here.

    First problem is the use of white spaces. WP is designed to use white spaces as word separators. WP extracts the excerpts automatically from the contents using white spaces as word delimitors. However, you know, ASIAN text doesn’t use white spaces at all. Therefore, ASIAN text archives appear like this bad example http://www25.big.jp/~jam/unix/wordpress/example-bad-excerpt.html instead of this better example http://www25.big.jp/~jam/unix/wordpress/example-good-excerpt.html.

    Second problem is the difference of mb_substr and substr. Substr extracts n-bytes data. mb_substr does n-characters data. It may be equal to n*4 bytes. I don’t know the specification of RSS and others, so that I’m not sure whether “BYTES” is important or not in those specifications. I just thought it is better that you know the difference between substr and mb_substr since you started using latter. So, just letting you know it.

    I made my own substr clone for ASIAN text using UTF-8 and several modification on WP 2.0.1 to use it. Those are available http://www25.big.jp/~jam/unix/wordpress/#patches to download. (Note: my own substr works well only with UTF-8. mb_substr works well with everything).

    I’m not sure how to present issue/patches so I’m simply describing them here with a hope this might help WP development little. Please let me know if you need any kind of deeeper descriptions. Thanks.

  • The topic ‘ASIAN text supporing problem’ is closed to new replies.