Forum Replies Created

Viewing 6 replies - 1 through 6 (of 6 total)
  • Thread Starter dpromies

    (@dpromies)

    That’s a good question. I see two possible enhancements. You could check within the punctuation function if the incoming string (being not empty) is getting empty after going through the punctuation regex. This should not happen whatever the punctuation filtering is instructed to do. In this case there could be used a workaround to handle non-unicode characters.

    And in general you could give the user an information about indexed posts without terms, as a stat on the indexing tab (after index building) or as a function on the debugging tab (“check index status of posts”). I think this could quite easily be checked in the database. Of course this will lead to further questions by the users but in my case it would have been a help to see that something is going wrong.

    Thread Starter dpromies

    (@dpromies)

    I have checked the encoding of the strings – it’s utf-8. But somehow there are some non utf-8 chars making their way through the string handling functions. I think they may result from editors working on a Mac, and the server settings are not suitable to cope with it.

    Thank you for the recommendation to use a filter – I will try this. Maybe in a future update you could integrate an additional error handling within the relevanssi_remove_punct-function? In my case the function just returned an empty string without giving any hints that the indexed post will not have any terms. Thanks for your help

    David

    Thread Starter dpromies

    (@dpromies)

    No, it’s the unicode modifier that is preventing the string from being processed. When I take it away it nearly works correctly. But it can’t handle the German char “ö” (replacing it with a question mark).

    With $a = preg_replace( '/:punct:+/', apply_filters( 'relevanssi_default_punctuation_replacement', ' ' ), $a); it’s better. There is just a questionmark in the string now where the non breaking space had been replaced by an � after $a = html_entity_decode( $a, ENT_QUOTES );

    Thread Starter dpromies

    (@dpromies)

    I think I have found the bug now. It seems to be a server related charset problem causing strings not to be handled correctly within the function relevanssi_remove_punct.

    I could reproduce these steps going through this function:

    1) String in Post:
     Media

    2) String after $a = html_entity_decode( $a, ENT_QUOTES ):
    <p>�Media

    3) String after $a = preg_replace( '/:punct:+/u', apply_filters( 'relevanssi_default_punctuation_replacement', ' ' ), $a ):
    empty

    When I use another regular expression instead of ':punct:+/u' the function does not fail:

    4) String after $a = preg_replace('/\p{P}/', '', $a):
    <p>?Media

    • This reply was modified 4 years, 3 months ago by dpromies.
    Thread Starter dpromies

    (@dpromies)

    Honestly I don’t think that you can reproduce this behaviour. Could you give me a hint where to insert debugging code in the indexing functions to get more information about what is happening with the content?

    Thread Starter dpromies

    (@dpromies)

    Hi Mikko,

    thanks for your response. The error log doesn’t contain any Relevanssi-related errors. But I took a closer look at the source code of the unindexed pages. It seems that in some cases a hardcoded &nbsp; put in by the editors is causing the trouble. I can reproduce it only on my live server. Maybe there is a problem writing this to the database?

Viewing 6 replies - 1 through 6 (of 6 total)