Support » Plugin: Relevanssi - A Better Search » Get results with accentuated or non accentuated characters highlighted

  • Hello,
    I saw that when entring a word with accentuated version or not, the results are the same and that’s a good behaviour (for isntance, it don’t matter you type in modèle or modele in a french search, the results will be the same). But the searched word isn’t highlighted if it doesn’t match exactly. For instance, if I make a search on “ingénieur”, the word will be highlighted in excerpt and title. If I search for “ingenieur” the word (“ingénieur”) won’t be highlighted.
    Is there a way to handle this? (I can imagine a JS script on search page handling this but I would prefer a server-side solution)

    Thanks in advance.

Viewing 12 replies - 1 through 12 (of 12 total)
  • Plugin Author Mikko Saari

    (@msaari)

    No easy solution here… The database is flexible, the regexp used for highlighting less so.

    There are no filters in the highlighting code; I’m sure it could use some. But how exactly? That’s a good question. Any suggestions?

    OK, the function where you can find the regexp is the relevanssi_highlight_terms() function in the /lib/excerpts-highlight.php file, is that right? (by the way, is that function also used for terms in a title?)
    I think in the foreach ($terms as $term) loop you should handle both cases: the accented and non-accented characters on each $term and $excerpt by using a function that converts accented characters to non-accented characters. An example of such a function I picked on a Web page (not sure it is complete, you can also avoid replace uppercase accented characters with uppercase characters):

    $unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                                'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                                'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                                'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                                'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
    $str = strtr( $str, $unwanted_array );

    Maybe there is a better way, with iconv for instance…

    • This reply was modified 3 years, 9 months ago by jojaba.
    • This reply was modified 3 years, 9 months ago by jojaba.

    Hey, WordPress have already a function to remove accents: remove_accents()
    How WordPress handle this (more robust I guess): formatting.php file on Trac

    • This reply was modified 3 years, 9 months ago by jojaba.
    • This reply was modified 3 years, 9 months ago by jojaba.

    Concerning filtering, I’m not sure it is possible. I think you have to use the filter API to add a filter. Or, is this the answer: Custom Hooks | Plugin Developer Handbook | WordPress Developer Resources ?

    • This reply was modified 3 years, 9 months ago by jojaba.
    Plugin Author Mikko Saari

    (@msaari)

    relevanssi_highlight_terms() is the function that does all the highlighting, yes.

    I can add any number of filters at any point in the code, just tell me if there’s some data or value you want filtered, and it can be done. No problems about that.

    remove_accents() sounds useful, but as we don’t really want to remove the accents from the excerpt, it doesn’t automatically solve the problem. I think the best solution might be to remove the accents from the search term and the excerpt, then do the highlighting, then somehow get the highlighting at the same spot in the excerpt with the accents intact – but that starts to get a little bit complicated.

    The regex that handles the matching is /(\b$pr_term|$pr_term\b)/iu or /(\b$pr_term\b)/iu where $pr_term is the search term. As far as I can tell there’s no simple way to tell the regex matching engine that we don’t want to care about accents, it’s quite literal (making it case-insensitive, in the other hand, is really simple).

    Im not sure we need a filter here.
    I think that the regexp should be changed to match accented terms either. I see it that way:

    1. First get an accented version of the term, let’s say $a_pr_term
    2. Then change the handling regexp to /(\b$pr_term|$pr_term\b|\b$a_pr_term|$a_pr_term\b)/iu

    Simple, but, the problem is to get the accented term… I’m not sure, but maybe a “didyoumean” search would be interesting here… Otherwise, it is indeed a bit complicated.
    You said it would be easier to get this from the database (mysql), that would be also a way to investigate…
    An article form a french developper: http://patisserie.keensoftware.com/en/pages/gerer-les-accents-dans-les-recherches-textes
    I didn’t read everything, but that would be a good source of inspiration, no?

    • This reply was modified 3 years, 9 months ago by jojaba.
    • This reply was modified 3 years, 9 months ago by jojaba.
    • This reply was modified 3 years, 9 months ago by jojaba.
    Plugin Author Mikko Saari

    (@msaari)

    Sorry for the delay, the forum doesn’t lift the old threads on top of the list anymore when there are new replies, so I don’t notice them. Blah.

    In any case, it’s easy to make both accented and non-accented search terms match to non-accented terms in database. Making non-accented search terms match the accented words in the database is the problem.

    One approach is to use regexp to change the $pr_term so that it matches both accents and non-accents, ie. ingénieur becomes ing[é|e]nieur, but that’s also somewhat complicated process. But I think this is the most feasible solution, and it would probably be fairly easy to cover most obvious cases.

    (Though there we run into a linguistical minefield – what can be considered equivalent? In Finnish, for example, “ä” and “a” should never be considered equivalent.)

    Sorry for the delay, the forum doesn’t lift the old threads on top of the list anymore when there are new replies, so I don’t notice them. Blah.

    No problem, I was busy anyway…
    By looking on a chars entities table (here for example: https://dev.w3.org/html5/html-author/charref), I think this could be a nice way to find out wich non-accented character will match the accented one.
    First, you convert the term to htmlentities, then you look after the &xxxxx; patterns and remove all these patterns except the second chars of each pattern (wich are the non accented char) in the term. For example, the “é” matches the “é”, you can see that just after the &, you have the “e”…
    About the ä, what do you use then in your urls? “ae” like in german?

    • This reply was modified 3 years, 9 months ago by jojaba.
    Plugin Author Mikko Saari

    (@msaari)

    No, replacing “ä” with “ae” is even worse, it looks absolutely ridiculous in Finnish. In URLs, it’s “a”, but nobody would ever replace “ä” with “a” in a search query, unless they’re using a non-Finnish keyboard.

    That HTML entities has the same issue – based on that, you could replace “ä” with “a”, and it’d still be wrong.

    But the proper solution would be some kind of translation table and a filter hook that lets users adjust it.

    I think this is feasible, at least to some extent.

    Plugin Author Mikko Saari

    (@msaari)

    Please give this version of lib/excerpts-highlights.php a go: https://dl.dropboxusercontent.com/u/9585896/excerpts-highlights.php

    Oh, thanks a lot!
    I tried it out with my “ingenieur” word and it worked fine, all “ingénieur” were highlighted!
    I paste below the function you created to get the accents variations, I’m sure other people of other countries will be able to add some more variations…
    Here’s the function Saari added:

    function relevanssi_add_accent_variations($word) {
       $word = str_ireplace(
          ['a', 'c', 'e', 'i', 'o', 'u', 'n', 'ss'],
          ['(a|á|à|â)', '(c|ç)', '(e|é|è|ê|ë)', '(i|í|ì|î|ï)', '(o|ó|ò|ô|õ)', '(u|ú|ù|ü|û)', '(n|ñ)', '(ss|ß)'],
          $word);
       return $word;
    }

    For those who don’t have any programming skill, a little explanation. Each character in the first brackets (['a', 'c', 'e', 'i', 'o', 'u', 'n', 'ss']), will be replaced by the accented variations. For instance, 'a' will be replaced by (a|á|à|â)(the | means ‘or’, so this expression means a or á or à or â in the regexp (a pattern that should fit) that will find the words to be highlighted (I’m not sure, that was a clear explanation :/). So what you should do is give other accented variations in your language. For example, ‘ä’ would be a variation for ‘ae’ in german.

    Plugin Author Mikko Saari

    (@msaari)

    Great. I’ll add this to the next version, and will include a filter that lets users adjust the list if they find it necessary. I think this list of accents covers most of the European language needs.

Viewing 12 replies - 1 through 12 (of 12 total)
  • The topic ‘Get results with accentuated or non accentuated characters highlighted’ is closed to new replies.