Support » Plugin: Slimstat Analytics » Foreign languages Encoding support issue.

  • Resolved JakeM

    (@jakem)


    Hi,

    I’ve got tons of referrers from Asian countries that don’t seem to use utf-8 encoding, so their URL are all showing like, for example, that one:

    `http://www.baidu.com/s?wd=%3F%3F¶«%3F%3F%3F%3F%3F«%3F%3F»%3F%3Fµ%3F%3F%3F%3F%3F«%3F%3F’

    I’ve done some investigation and found out that the problem occurs mostly with Big5 encoding.

    Is it possible to detect and display the proper encoding instead of this broken text?

    Thanks!

    http://wordpress.org/extend/plugins/wp-slimstat/

Viewing 15 replies - 1 through 15 (of 28 total)
  • Just to add that Google Analytics display the referrers without a problem.

    Plugin Author Jason Crouse

    (@coolmann)

    Hi JakeM,

    I thought I had already addressed this issue, but apparently there are still some modules that need to be fixed. Where did you notice the issue? Could you post a screenshot? Does Spy View display the URL correctly?

    Camu

    Hi Camu,

    You can check the screenshots here:

    1st: Right Now | 2nd: Spy View = http://imgur.com/a/L6kIk
    3rd: Recent Search Terms = http://imgur.com/6rmIp
    Let me know if you need me to help you debug and track down that issue.

    PS: Thanks for your great plugin!

    Plugin Author Jason Crouse

    (@coolmann)

    Oh I see, so it’s the keywords that are not translated, not the actual URL, right?

    I don’t have any Big5 strings in my database to run some tests, would you be able to provide some rows from your wp_slim_stats table to recreate the issue on my test environment?

    Contact me at

    la buca delle lettere

    Thank you,
    Camu

    http://www.baidu.com/s?wd=%3F%3F%3F%3F%3F%3F%3F%3F%3F«%3F%3F%3F%3F%3F%3F%3F«%3F%3F

    %3F are in fact question marks (?)

    The same referrer is sometimes correctly displayed. The reason is that it contains the utf-8 extension at the end of the link, for which I have no idea why.

    http://www.baidu.com/baidu?word=公司&ie=utf-8

    On your test website, you can try to create a page that contains big5 encoded paragraphs, and try to search some through baidu.com.

    When the link works it shows up like that:

    http://www.baidu.com/s?wd=%E5%B9%BF%E4%B8%9C%E6%98

    And shows up like that when it doesn’t:

    http://www.baidu.com/s?wd=%3F%3F%3F%3F%3F%3F%3F%3F</&gt;

    Plugin Author Jason Crouse

    (@coolmann)

    Yes, but I can’t wait for Baidu to index my test site 🙂 If you could export some rows from your database, or give me access to your site to do some tests, I could see the problem right where it happens.

    Thanks,
    Camu

    I just send you some rows from wp_slim_stats on your email.

    Thanks!

    Plugin Author Jason Crouse

    (@coolmann)

    Thank you. Also, what is the encoding of your pages?

    Strictly UTF-8

    Plugin Author Jason Crouse

    (@coolmann)

    Thank you.

    Unfortunately my experience with encodings and charsets is limited to mostly latin and russian charsets. For what I understand by looking at PHP’s manual pages, it looks like there’s no way to detect the original charset of a string like, for example

    http://www.baidu.com/s?wd=俄文翻译

    In my case, for example, my test environment tells me that this string is compatible with EUC-JP (Japanese) but the right charset is EUC-CN. If you or someone could point me into the right direction to reliably detect the charset of a string, then I can definitely implement that into WP SlimStat’s code.

    The only way I see, at the moment, is to check for EVERY search engine out there what charset they use to encode the URL of their result pages. And this list should be kept up-to-date. This is the list of search engines currently detected by WP SlimStat (if none matches, an heuristic match is done to find the search string)

    daum
    eniro
    naver
    google
    http://www.google
    yahoo
    msn
    bing
    aol
    lycos
    ask
    cnn
    about
    mamma
    voila
    virgilio
    baidu
    yandex
    najdi
    seznam
    search
    onet
    yam
    pchome
    kvasir
    mynet
    rambler

    Baidu works with UTF-8 only and convert big5 characters on the fly when you click search.

    copy paste this 聯 in baidu and hit search. You’ll see this Traditional Chinese (big5) cchanging to 联 for Simplified Chinese (handled by UTF-8 ).

    Check the last row and you’ll see that link is encoded with -utf-8 at it’s end but still doesn’t show up correctly in slimstat.

    The problem doesn’t seem to occur on Baidu’ side.

    http://www.j4.com.tw/big-gb/

    Plugin Author Jason Crouse

    (@coolmann)

    If I put 聯 in the search field, the URL of the result page has a blank space, for me, right after wd 🙁 Probably because my computer cannot handle this charset correctly, or something like that?

    From the data you sent me, I see for example

    http://www.baidu.com/s?wd=%B9%E3%D6%DD%D0%C7%C1%AA%BE%AB%C3%DC%D3%D0%CF%DE%B9%AB%CB%BE

    which is coming directly from Baidu, if I’m not mistaken. It should be displayed as

    http://www.baidu.com/s?wd=广州星联精密有限公司

    (charset: EUC-CN), but PHP thinks it’s EUC-JP, and messes it up 🙂

    Is there something I’m missing somewhere?

    Camu

    Plugin Author Jason Crouse

    (@coolmann)

    It looks like Baidu is using two different encodings for the result page URLs, one has ie=utf-8 and the other one hasn’t. The latter is encoded as EUC-CN, for what I can see.

    That’s correct.

    The 1st link is what is sent as the referrer when you click one of the result in baidu. Then it is supposed to be translated back into Chinese characters (same as in the 2nd link) when received by wp and/or slimstat.

    I’m not really sure which one of them two is in charge of the conversion.

Viewing 15 replies - 1 through 15 (of 28 total)
  • The topic ‘Foreign languages Encoding support issue.’ is closed to new replies.