Non-keyboard (unicode) characters
-
I inherited control of a website and, sadly, my predecessor passed away.
As well as my main site, I have a sandbox and the two sites are exhibiting different behaviours regarding non-keyboard characters. For example, the word Māori (New Zealand native) should be spelt with a macron over the “ā” – unicode 257 (hex 0101).
If I code it on my sandbox, the character shows correctly whereas on my live site, it displays as a question mark.
Clearly there must be a setting somewhere but I can’t work out where.
Does anyone have any ideas?
The page I need help with: [log in to see the link]
-
It would have been useful if you could have directed us to a specific place in the site where you’re seeing a question mark in place of the proper Unicode glyph.
Usually the reason for such behaviour is because the font in use does not have the correct glyph defined. Usually switching to a font with broader Unicode support resolves the problem.
Your site’s regular body text has a long cascade list of possible fonts to use. It’s unclear which is the one you’re actually seeing. It depends on what system fonts you have installed. My system font that’s used for your site displays “Māori” just fine as body text. But I surely have different system fonts than you.
You can use your browser’s element inspector tool to limit which font can be used for the font-family in the current view. Changes in the tool will not persist, feel free to experiment. Determine which font has the problem, then eliminate it from the cascade list. Body text font-family is declared as inline styling in a style block with ID “generate-style-inline-css” (line 46 in HTML source view). It’s likely something your theme provides for. If you cannot determine where to edit this, I suggest seeking further guidance through your theme’s dedicated support channel.
It might be possible to override the style without editing the source by adding an override rule to the Additional CSS section of the customizer or style book.
Many thanks for the reply. It’s my first post here and I wasn’t sure of the protocol.
Take a look at comment #1 (and #2) here:And here’s an example from my sandbox.
I hope this helps
M?ori
in comment 1 is encoded with a?
, not anā
. Similar for comment 2,Maori
is encoded with a normala
. This was likely typed in this way since people without the proper keyboard often ignore diacritics since using the right diacritic is cumbersome. I don’t know why?
got encoded instead ofā
. It may have been inadvertently copied from elsewhere that way.Using added CSS in my browser’s element inspector tool, I added
Māori
to the end of all of your site’s paragraphs with this:p:after { content: 'Māori'; }
The
ā
displays correctly. I think the font used in my case is my browser’s default sans-serif since I don’t have any of the system fonts named in the cascade before “sans-serif”.There is a chance there’s something in your site that replaces
ā
with?
, but it’s abnormal behavior. Using Māori in a comment on my site (2024 theme, no plugins) accurately displays the correctā
. Try deactivating all plugins and switching to 2024 theme. Add a test comment. If the problem still persists I’m fairly certain nothing in WP is doing something odd and the issue is with the system font your browser is using. It could be any one of the ones listed in this CSS cascade:
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol";
You can edit this list in your browser’s element inspector tool. Try specifying each alone in turn to see which is problematic. Only test fonts you know are installed in your computer or the browser will fallback to its default. If the issue is with your browser’s default font, you should select a different default in your browser settings. If you find a non-default font in the cascade list that’s an issue, I recommend permanently removing it from the source CSS.Thanks bcworkz for your continued interest. I created a test post.
In the “text” edit panel, I entered 'Māori' as you suggested and it showed correctly in the “visual” edit panel.
But as soon as I clicked the “preview” button it changed to M?ori. And as you can see, it shows M?ori in my post.
Does “Māori” still appear correctly in the editor? I expect so, which means something is replacing the
ā
with a?
on output. This is behavior I cannot replicate on my site, which indicates something in your theme or plugins is doing this.We’re back to my earlier suggestion:
Try deactivating all plugins and switching to 2024 theme.
If “Māori” appears correctly in the editor, it should now also appear correctly on your site. Restore your normal theme and plugins, one at a time, until the problem returns. The last activated module would be the cause.
Be aware that caching can confuse this investigation. Either disable all caching or clear caches for each test step.
If by chance it’s now also wrong in the editor, then the substitution is being done on save, so the above investigation would involve re-editing the post instead of just checking the output.
It’s a complete mystery why any code would do this substitution on a UTF-8 page, yet it’s happening. It’s less of a mystery why
é
is OK butā
is not. Theā
occurs in a lesser used code block than the fairly common block whereé
appears. We used to see code like this back in the days before UTF-8 was common and many fonts had limited character support beyond basic Latin. There’s no place for this behavior with UTF-8 and modern fonts.I tried disabling all plugins and it made no difference.
I then wondered if the character set and/or collation name were different, so I ran the following queries but the results are the same.
=======LIVE=SITE=======
Signon test/ https://www.fifteensquared.net/wp-content/plugins/wp-phpmyadmin-extension/lib/phpMyAdmin_J0g4m7YbnTB6RPXLEpOVizH/index.php?route=/server/sqlShowing rows 0 – 0 (1 total, Query took 0.0002 seconds.)
SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = ‘fifteens_wrdp1’;latin1 latin1_swedish_ci
========SANDBOX========
Signon test/ https://sandbox.fifteensquared.net/wp-content/plugins/wp-phpmyadmin-extension/lib/phpMyAdmin_rntL6JXzaQlk7e8mSUCDW0B/index.php?route=/server/sqlShowing rows 0 – 0 (1 total, Query took 0.0002 seconds.)
SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = ‘fifteens_wp01’;latin1 latin1_swedish_ci
That was a good thought! Latin1 actually is the wrong charset, but I think what has happened is WP is saving data as a UTF-8 byte stream and the DB engine is simply saving it verbatim as Latin1 text. Any characters greater than the basic ASCII values likely do not display correctly in phpMyAdmin, but when the data is sent back to WP, it’s treated like a UTF-8 byte stream by WP so any encoding errors cancel out.
If the problem persists with a default theme and no plugins, there’s apparently something off with your installation’s core code. I recommend performing a manual update, even if it’s to the same version. The point is to replace all core files with those from a fresh download.
Once everything is resolved and working correctly, you might consider switching to utf8mb4 charset. The collation is much less important, but typically we’d want to see utf8mb4_unicode_ci. Collations can be freely altered, but be aware there are separate collation settings at the DB level, the table level, and column level. They don’t have to all be the same but unless you have clear reason to do otherwise they probably should all be the same.
Changing charsets is a different issue altogether. If WP saved UTF-8 byte streams into a Latin1 charset, you absolutely must not attempt to convert the data. That’ll make a big mess of everything. Instead, you want to change the charset setting without converting the data. I’m not sure how that is done, you’ll need to do some research. Most importantly, fully backup what you have prior to making changes like this.
The plot thickens. I am now able to add special characters in replies. There are so many variables here that I can’t be sure if it was always so. Take a look at this post, where you can see one reply with an a-macron and a couple of copyright symbols. These were entered with ā and © and ©
-
This reply was modified 1 year ago by
kenmac54. Reason: Better grammar
I think I can explain that behavior. I’m going on the theory that some undesirable PHP code somewhere is replacing diacritic chars with a
?
. When you use a HTML entity code like ā, this undesirable PHP code does not see it as a diacritic char. It’s just a string of normal chars. Eventually your browser replaces this entity code with aā
, long after this undesirable PHP code had done any replacements. The problem only occurs when you try to use a literalā
char.You’ve discovered a usable workaround to use until the root cause is resolved, but obviously using entity codes is far from ideal.
Curiously, this undesirable PHP code seems to be allowing some diacritic chars like
é
. I think it’s allowing anything with an entity code less than Ā, which coincidentally is the limit of the Latin1 charset. But as your sandbox site demonstrates, this is not a real limitation. But I think it points to what the intent of this code was.Thanks for your help bcworkz.
I’ve finally got things sorted. It was all down to how some tables had been defined. Post and comment tables were defined as “Latin1” and, using PhpMySQL these were changed to “utf8”.
We took a backup before the update and posted a notice warning that any changes made in a given two-hour timeframe could be lost.
All looks good now.
-
This reply was modified 1 year ago by
- The topic ‘Non-keyboard (unicode) characters’ is closed to new replies.