Some sticky points on encoding in WP and elsewhere. Any experts around? Here's my current understanding.
I receive various international emails that I would like to copy and paste to WP. I thought I was doing this well in the past, until I restored a recent backup - the posts were filled with question marks in lieu of diacriticals and even quotemarks.
First, I realized the problem had to do with non-Unicode encoding of backups. So I tried using the iconv utility on a mysql command-line backup, specifying utf-8 encoding. It worked for the most part, but not perfectly.
Then I realized there was another piece. I think this is it. Maybe WP wasn't really converting the pasted emails into Unicode, but leaves them as they were unless actually typed in manually? Then emails that were actually Latin1, get converted to Unicode only on a backup, and then restored that way. Unicode should convert Latin1-ISO or Latin1-Windows schemes nicely, I would think, but confirming this problem is my test with sample emails.
Emails coming in coded as Latin1 (either Windows or ISO versions) would convert to Unicode properly, even right at the email screen!
How can Latin1 and other schemes be moved UTF-8 properly before being brought into WordPress.
Help much appreciated.
WP went from Latin-1 encoding (ISO-8859-xx) early in '04 over to Unicode 8 bit (UTF-8) basically to allow for multiple foreign language character encoding in a universal way. Fine.
Using command-line syntax to backup the database, UTF-8 can be specified (with iconv), and then this encoding maintained. If not, mySQL, phpMyAdmin, and the 1-ClickBackup all default to Latin-1 encoding, and apparently will convert all the data on backup, and WP will not change it back on a restore. This apparently won't affect regular characters, as they're encoded the same way in each scheme. (Am I correct so far?)
With international characters, assuming WP was left with its default UTF-8 encoding, and posts were either written inside the WP editor, or pasted there after being in UTF-8, then a non-command line backup (and restore) would mess up any character not within the Latin-1 subset, noticable after the restore. Solution: go back and use command-line utils to do the backup in Unicode, or edit all the problem posts?! (How's my arithmetic?).
What about attempting to paste non UTF-8 ('Unicode') docs into the WP editor (WP with default Unicode)? This is where I had problems. I have emails that I just noticed came encoded in Latin1-Windows, Latin1-ISO, and a few other schemes. If I grabbed those texts and pasted them to WP as is, they would generally post okay. After doing a backup though, and restoring the database with these posts, most of the diacriticals and even some quotemarks turned into question marks. If these were proper Unicode posts, and I did a non-command line backup, then I could understand the regression. There's something else here though.