Support » Fixing WordPress » Character encoding hell

  • Hello everyone.

    I’ve been having lots of problems with setting up a new WordPress installation. The problem is that i want to switch to utf-8 for character encoding. Unfortunately, most of my MySQL database is latin-1 encoded, so i needed to convert my existing posts.

    I found a GNU utility called iconv which can transform encodings, unfortunately it was only then that i found out that some parts of my postings are encoding in latin-1, and some others are already encoded in utf-8. This might be because i switched from another webhost who probably had another default encoding.

    I’m getting scared of the thought that i have to go manually over all of my 300 postings to restore any problems with weird characters, so please, if anyone has any suggestions to solve this problem it would make me very happy.

Viewing 12 replies - 1 through 12 (of 12 total)
  • Just a thought: if you convert it, would the process affect the posts that already are utf-8 encoded? I’d try it anyway – of course, after making a backup copy in case anything goes wrong.

    Yeah, i tried that. The problem is that encoding utf-8 again makes it double-encoded, meaning that every single character is encoded again, making it garbage in the process.

    Oh, I see.
    How comes you didn’t notice the encoding mismatch during your previous host migration? If it wasn’t OK, all your posts should have shown a lot of garbage code…

    I’m out of my depth here and admit it. However, I don’t…get it.
    My database is latin-1 and my blog is UTF-8 and everything works fine.
    Should it not be?

    Moshu, i’m a bit riddled by that as well. It’s just after i exported the SQL file and stepped trough it in an editor that i noticed the differences. Some posts are encoded in UTF-8 (with weird characters), while others are in latin-1 (where the accented characters remain ok).

    Whenever i test it on a local installation of WP and set my character encoding to UTF-8, those firsts postings appear correct and the latin-1 encoded ones do not, and vice versa. It could also have something to do with a difference in how i set things up locally in MySQL and how things are set up at my host.

    And samboll, i guess that works fine for most of us (it did for me too), but it seems weird to me that the database is latin-1 and the character encoding is utf-8.. I don’t know, maybe i still don’t know enough about character encodings 🙂

    Well, in MySQL there are two things that are related to the character set:
    – the charset
    – the connection collation
    (quite often they are mixed…)
    Furthermore, I have 4 DBs with charset utf-8 (standard setup by host) that says on the entry page of phpMyadmin:
    charset utf-8 and the collation varies, but it is mostly “latin1_swedish_ci”. It works with all kind of accented latin characters and even non-latin alphabets.

    Ok. I’m getting a bit confused now 🙂 What is the difference between a ‘collation’ and a ‘charset’ in MySQL? And which of the two relates to the character encoding in the HTML file?

    This page explains it better than I ever could…
    http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

    Thanks for that link, moshu. Now I have a huge headache.

    Well, i think i fixed the bug. Here’s what i did:

    1) change the default collation and character encoding of MySQL to utf8_general_ci
    2) Export the whole wp_posts table, remove all references to latin1 (so that the db automatically uses the now default utf8 encoding)
    3) Re-import the whole thing again (make sure to DROP the table ‘wp_posts’ first)

    This lead to two problems: the ë (e with an ‘umlaut’) still displayed as ? so i did a search-and-replace across all ë characters, converting them to ë. Furthermore, the text editor i was using (Programmer’s Notepad) had some problems with UTF-8 too, so i used MadEdit instead to do the search and replace. Everything seems to be working fine now. Thanks for your input!

    Sam, I *think* that if one’s only generally using English for posting, then latin-1 in the db and utf-8 in the blog isn’t a problem.

    Thing I want to know is, why have these later versions of mysql done this with the collation? Somewhere back maybe 8 months to a year, the collation was utf-8 in the db if the blog was set to utf-8. Then it changed.

    I’m as likely to see latin1_swedish_ci in the db collation now as anything…. VERY weird, but it works so I guess I shouldn’t complain.

    My turn to raise this issue as I’m seeing in my error logs where folks (well, spammers) leaving comments in UTF8 and they being kicked out with errors complaining how the charsets don’t match up.

    Should I be concerned? Granted its spammers having these issues but still I would hate to see actual folks getting hit by this error.

    Thnaks,
    -drmike

Viewing 12 replies - 1 through 12 (of 12 total)
  • The topic ‘Character encoding hell’ is closed to new replies.