WordPress.org

Ready to get started?Download WordPress

Forums

WP SlimStat
[resolved] Foreign languages Encoding support issue. (29 posts)

  1. JakeM
    Member
    Posted 1 year ago #

    Hi,

    I've got tons of referrers from Asian countries that don't seem to use utf-8 encoding, so their URL are all showing like, for example, that one:

    `http://www.baidu.com/s?wd=%3F%3F¶«%3F%3F%3F%3F%3F«%3F%3F»%3F%3Fµ%3F%3F%3F%3F%3F«%3F%3F'

    I've done some investigation and found out that the problem occurs mostly with Big5 encoding.

    Is it possible to detect and display the proper encoding instead of this broken text?

    Thanks!

    http://wordpress.org/extend/plugins/wp-slimstat/

  2. JakeM
    Member
    Posted 1 year ago #

    Just to add that Google Analytics display the referrers without a problem.

  3. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Hi JakeM,

    I thought I had already addressed this issue, but apparently there are still some modules that need to be fixed. Where did you notice the issue? Could you post a screenshot? Does Spy View display the URL correctly?

    Camu

  4. JakeM
    Member
    Posted 1 year ago #

    Hi Camu,

    You can check the screenshots here:

    1st: Right Now | 2nd: Spy View = http://imgur.com/a/L6kIk
    3rd: Recent Search Terms = http://imgur.com/6rmIp
    Let me know if you need me to help you debug and track down that issue.

    PS: Thanks for your great plugin!

  5. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Oh I see, so it's the keywords that are not translated, not the actual URL, right?

    I don't have any Big5 strings in my database to run some tests, would you be able to provide some rows from your wp_slim_stats table to recreate the issue on my test environment?

    Contact me at

    http://www.duechiacchiere.it/contatto

    Thank you,
    Camu

  6. JakeM
    Member
    Posted 1 year ago #

    http://www.baidu.com/s?wd=%3F%3F%3F%3F%3F%3F%3F%3F%3F«%3F%3F%3F%3F%3F%3F%3F«%3F%3F

    %3F are in fact question marks (?)

    The same referrer is sometimes correctly displayed. The reason is that it contains the utf-8 extension at the end of the link, for which I have no idea why.

    http://www.baidu.com/baidu?word=公司&ie=utf-8

  7. JakeM
    Member
    Posted 1 year ago #

    On your test website, you can try to create a page that contains big5 encoded paragraphs, and try to search some through baidu.com.

    When the link works it shows up like that:

    http://www.baidu.com/s?wd=%E5%B9%BF%E4%B8%9C%E6%98

    And shows up like that when it doesn't:

    http://www.baidu.com/s?wd=%3F%3F%3F%3F%3F%3F%3F%3F</&gt;

  8. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Yes, but I can't wait for Baidu to index my test site :) If you could export some rows from your database, or give me access to your site to do some tests, I could see the problem right where it happens.

    Thanks,
    Camu

  9. JakeM
    Member
    Posted 1 year ago #

    I just send you some rows from wp_slim_stats on your email.

    Thanks!

  10. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Thank you. Also, what is the encoding of your pages?

  11. JakeM
    Member
    Posted 1 year ago #

    Strictly UTF-8

  12. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Thank you.

    Unfortunately my experience with encodings and charsets is limited to mostly latin and russian charsets. For what I understand by looking at PHP's manual pages, it looks like there's no way to detect the original charset of a string like, for example

    http://www.baidu.com/s?wd=俄文翻译

    In my case, for example, my test environment tells me that this string is compatible with EUC-JP (Japanese) but the right charset is EUC-CN. If you or someone could point me into the right direction to reliably detect the charset of a string, then I can definitely implement that into WP SlimStat's code.

    The only way I see, at the moment, is to check for EVERY search engine out there what charset they use to encode the URL of their result pages. And this list should be kept up-to-date. This is the list of search engines currently detected by WP SlimStat (if none matches, an heuristic match is done to find the search string)

    daum
    eniro
    naver
    google
    http://www.google
    yahoo
    msn
    bing
    aol
    lycos
    ask
    cnn
    about
    mamma
    voila
    virgilio
    baidu
    yandex
    najdi
    seznam
    search
    onet
    yam
    pchome
    kvasir
    mynet
    rambler

  13. JakeM
    Member
    Posted 1 year ago #

    Baidu works with UTF-8 only and convert big5 characters on the fly when you click search.

    copy paste this 聯 in baidu and hit search. You'll see this Traditional Chinese (big5) cchanging to 联 for Simplified Chinese (handled by UTF-8 ).

    Check the last row and you'll see that link is encoded with -utf-8 at it's end but still doesn't show up correctly in slimstat.

    The problem doesn't seem to occur on Baidu' side.

    http://www.j4.com.tw/big-gb/

  14. camu
    Member
    Plugin Author

    Posted 1 year ago #

    If I put 聯 in the search field, the URL of the result page has a blank space, for me, right after wd :( Probably because my computer cannot handle this charset correctly, or something like that?

    From the data you sent me, I see for example

    http://www.baidu.com/s?wd=%B9%E3%D6%DD%D0%C7%C1%AA%BE%AB%C3%DC%D3%D0%CF%DE%B9%AB%CB%BE

    which is coming directly from Baidu, if I'm not mistaken. It should be displayed as

    http://www.baidu.com/s?wd=广州星联精密有限公司

    (charset: EUC-CN), but PHP thinks it's EUC-JP, and messes it up :)

    Is there something I'm missing somewhere?

    Camu

  15. camu
    Member
    Plugin Author

    Posted 1 year ago #

    It looks like Baidu is using two different encodings for the result page URLs, one has ie=utf-8 and the other one hasn't. The latter is encoded as EUC-CN, for what I can see.

  16. JakeM
    Member
    Posted 1 year ago #

    That's correct.

    The 1st link is what is sent as the referrer when you click one of the result in baidu. Then it is supposed to be translated back into Chinese characters (same as in the 2nd link) when received by wp and/or slimstat.

    I'm not really sure which one of them two is in charge of the conversion.

  17. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Nope, for what your DB export tells me, WP SlimStat receives the URL encoded as EUC-CN, not UTF-8 (unless ie=ut8 is present). And there's no reliable way, for what I know, to tell what's the charset of

    http://www.baidu.com/s?wd=%B9%E3%D6%DD%D0%C7%C1%AA%BE%AB%C3%DC%D3%D0%CF%DE%B9%AB%CB%BE

    What I can do is to see if the charset for a given search string is UTF-8. If not, I'll have to hardcode 'alternative' charsets used by that specific search engine, and test against those until I find the right one. Then I need to convert to UTF-8 and save that into the databse :)

    It will take me a while to implement this.

    Camu

  18. JakeM
    Member
    Posted 1 year ago #

    WP SlimStat receives the URL encoded as EUC-CN, not UTF-8 (unless ie=ut8 is present).

    I'm confused now, because If you take a look at the 2 last rows, the referrers contain the -utf-8 at the end, but still their search terms are corrupted ('????????????').

  19. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Yes, you are correct for those. Like I said here above, if the URL contains ie=utf-8, their charset is indeed utf8. For those strings it's a bug in my code :) which I'm already fixing.

    Camu

  20. JakeM
    Member
    Posted 1 year ago #

    Let me know If you want me to test your bug fix. ;)

  21. camu
    Member
    Plugin Author

    Posted 1 year ago #

    I just sent you the new version which should fix the issue.

    Best
    Camu

  22. JakeM
    Member
    Posted 1 year ago #

    Got it installed a few hours ago, and it looks like you solved the bug.

    I haven't encountered a single corrupted referrer since.

    Thanks!

    P.S: I'll change the status of this thread to resolved by tomorrow.

  23. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Thank you for reporting back! This is great news, indeed.

    Best,
    Camu

  24. JakeM
    Member
    Posted 1 year ago #

    Camu,

    Please remove the baidu links you posted earlier in this thread.

    In the post that starts with:

    If I put 聯 in the search field

    and the next one:

    your DB export tells me

    I have encountered similar broken links in the last couple hours and I'm not sure If they are from baidu results or from people who are clicking those ones. :)

    This referrer showed up in the last hour and the ip of the source was from Cambodia (different encoding?):

    http://www.baidu.com/s?wd=%3F%3F弘%3F诀瘪

  25. camu
    Member
    Plugin Author

    Posted 1 year ago #

    I'm not allowed to edit my earlier messages, you will need to ask a moderator to do that.

    As for the link, are you saying that Baidu may use multiple charsets for the URLs? Not just UTF8 and euc-cn? If you can identify the original charset of that referrer URL, I'll add it to the detection engine I've developed to address the issue. Right now Baidu is only associated to euc-cn (and UTF8 of course)

    Thank you

  26. JakeM
    Member
    Posted 1 year ago #

    I'm gonna try to catch it through GA. I let you know.

    Can you contact a mod to remove the links?

    Thanks!

  27. camu
    Member
    Plugin Author

    Posted 1 year ago #

    I'll see what I can do about the moderator. Please keep me posted on the charset thing :)

    Cheers
    Camu

  28. JakeM
    Member
    Posted 1 year ago #

    Your fix is working.

    The problem mentioned in my last post is related to older IE versions. They don't support utf-8 encoding in the URL.

    Earlier, I found the same broken links in GA logs. So it's not related to slimstat anymore.
    In the meantime, we're gonna have to wait for older IE versions to vanish. :D

    Thanks for your support!

  29. camu
    Member
    Plugin Author

    Posted 1 year ago #

    Thank you so much for your feedback and for helping me improve WP SlimStat!

    Best,
    Camu

Topic Closed

This topic has been closed to new replies.

About this Plugin

About this Topic