[tech-vi Announce List] Unicode Roman Numerals and Screen Readers – Terence Eden’s Blog


farhan israk
 



---------- Forwarded message ---------
From: David Goldfield <david.goldfield@...>
Date: Wed, Mar 15, 2023 at 9:42 PM
Subject: [tech-vi Announce List] Unicode Roman Numerals and Screen Readers – Terence Eden’s Blog
To: List <tech-vi@groups.io>


Unicode Roman Numerals and Screen Readers

How would you read this sentence out aloud?

"In Hamlet, Act Ⅳ, Scene Ⅸ..."

Most people with a grasp of the interplay between English and Latin would say "In Hamlet, Act four, scene nine". And they'd be right! But screen-readers - computer programs which convert text into speech - often get this wrong.

Why? Well, because I didn't just type "Uppercase Letter i, Uppercase Letter v". Instead, I used the Unicode symbol for the Roman numeral 4 - . And, it turns out, lots of screen-readers have a problem with those characters.

Unicode contains the range of Roman numbers from 1 - 10, plus a couple of compound numbers, 50, 100, 500, and 1000 - in a variety of forms.

Screenshot of a Table of Roman numerals in Unicode.

Why does Unicode contain these number which, to most people, are just squashed together Latin letter? As ever with Unicode, it is a mix of legacy and practicality.

The Unicode standard says:

Roman Numerals. For most purposes, it is preferable to compose the Roman numerals from sequences of the appropriate Latin letters. However, the uppercase and lowercase variants of the Roman numerals through 12, plus L, C, D, and M, have been encoded for compatibility with East Asian standards. Unlike sequences of Latin letters, these symbols remain upright in vertical layout. Additionally, in certain locales, compact date formats use Roman numerals for the month, but may expect the use of a single character.

Far be it for me to disagree with the learned authors of the spec, but I think they may have erred slightly on this one. While it may be preferable to re-use Latin letters, it leads to ambiguity which can be confusing for a screen-reader.

Let's write out the numbers using regular letters. Suppose you were talking about "Romeo and Juliet, Act III, Scene I". Most screen readers will see the "III" and correctly speak aloud "Roman three" or similar. But when they get to the "I" it becomes ambiguous. Most will read out "Eye".

Screen-readers rarely look at the whole sentence for context. Which means they get confused. It's fairly obvious that XIV should be "fourteen" as there's no English word "xiv"1. But what about "MIX" - is that 1009 or the word "mix"?

Anyone who has watched the BBC knows about their fondness for displaying in Latin the year a programme was made. MCMXCVI is particularly challenging for a screen-reader!

I took the following sample sentence - using both letters and Roman numerals.

Text. In Hamlet, Act I, Scene XI the year is MCMXCVI and they are watching Rocky V.
Roman. In Hamlet, Act Ⅰ, Scene Ⅺ the year is ⅯⅭⅯⅩⅭⅥ and they are watching Rocky Ⅴ.

Here's how various services coped:

First, the good news. Amazon's Polly read the Roman numerals perfectly. It even pronounced ⅯⅭⅯⅩⅭⅥ as "nineteen ninety six".

But it gets rather confused with the ambiguous English text.

I tried with Microsoft Edge's Read Aloud TTS.

It and makes a bit of a hash of the English and just skips the Roman numerals.

The same was also true with Google's TTS products.

The venerable Linux utility came out with this.

It gets the "Capital i" incorrect, and reads the Roman numerals as their Unicode code points.

My good friend Léonie Watson who writes extensively about accessibility was kind enough to record some other samples for me.

Here are Jaws' "Expressive":

And Jaws' "Eloquence:

Léonie also provided a recording of NVDA Microsoft One Core

And here's Narrator making a right mess of it.

If you know of any other screen-readers, or text-to-speech engines which can cope with this, please let me know!

On Linux, I raised a Pull Request to fix espeak-ng.

The rest of the services don't seem to have a way to easily report bugs to them. If you know a way to raise issues with these screen readers - please do so!


  1. I'm sure there's some obscure Scrabble word, but we're talking everyday use here. 



     David Goldfield
Assistive Technology Specialist


Subscribe to the Tech-VI announcement list to receive news and updates regarding the blindness assistive technology space
Email: Tech-VI+subscribe@groups.io


Brian's Mail list account
 

I doubt any resolution on this one as unless the system can on the fly code the unicode as you type, it will never get read, if screenreaders are altered


Speech synths can also alter the whole thing of course.
Abbreviations also screw us up. I mean the UK post Code of KT34 2NY often comes out as 2 New York.
Brian
--
bglists@...
Sent via blueyonder.(Virgin media)
Please address personal E-mail to:-
briang1@..., putting 'Brian Gaff'
in the display name field.

----- Original Message -----
From: "farhan israk" <fahim.net.2014@...>
To: <chat@nvda.groups.io>
Sent: Saturday, March 18, 2023 6:24 PM
Subject: [NVDA Chat] [tech-vi Announce List] Unicode Roman Numerals and Screen Readers – Terence Eden’s Blog


---------- Forwarded message ---------
From: David Goldfield <david.goldfield@...>
Date: Wed, Mar 15, 2023 at 9:42 PM
Subject: [tech-vi Announce List] Unicode Roman Numerals and Screen Readers
– Terence Eden’s Blog
To: List <tech-vi@groups.io>



https://shkspr.mobi/blog/2023/03/unicode-roman-numerals-and-screen-readers/

Unicode Roman Numerals and Screen Readers

- By @edent <https://edent.tel/> on 2023-03-15
- a11y <https://shkspr.mobi/blog/tag/a11y/> accessibility
<https://shkspr.mobi/blog/tag/accessibility/> Latin
<https://shkspr.mobi/blog/tag/latin/> romans
<https://shkspr.mobi/blog/tag/romans/> tts
<https://shkspr.mobi/blog/tag/tts/> unicode
<https://shkspr.mobi/blog/tag/unicode/>
- 3 comments
<https://shkspr.mobi/blog/2023/03/unicode-roman-numerals-and-screen-readers/#comments>
- 800 words

How would you read this sentence out aloud?

"In Hamlet, Act Ⅳ, Scene Ⅸ..."

Most people with a grasp of the interplay between English and Latin would
say "In Hamlet, Act four, scene nine". And they'd be right! But
screen-readers - computer programs which convert text into speech - often
get this wrong.

Why? Well, because I didn't just type "Uppercase Letter i, Uppercase Letter
v". Instead, I used the Unicode symbol for the Roman numeral 4 - Ⅳ. And, it
turns out, lots of screen-readers have a problem with those characters.

Unicode contains the range of Roman numbers from 1 - 10, plus a couple of
compound numbers, 50, 100, 500, and 1000 - in a variety of forms.

[image: Screenshot of a Table of Roman numerals in Unicode.]

Why does Unicode contain these number which, to most people, are just
squashed together Latin letter? As ever with Unicode, it is a mix of legacy
and practicality.

The Unicode standard says
<https://www.unicode.org/versions/Unicode6.0.0/ch15.pdf>:

*Roman Numerals.* For most purposes, it is preferable to compose the Roman
numerals from sequences of the appropriate Latin letters. However, the
uppercase and lowercase variants of the Roman numerals through 12, plus L,
C, D, and M, have been encoded for compatibility with East Asian standards.
Unlike sequences of Latin letters, these symbols remain upright in vertical
layout. Additionally, in certain locales, compact date formats use Roman
numerals for the month, but may expect the use of a single character.

Far be it for me to disagree with the learned authors of the spec, but I
think they may have erred slightly on this one. While it may be *preferable*
to re-use Latin letters, it leads to ambiguity which can be confusing for a
screen-reader.

Let's write out the numbers using regular letters. Suppose you were talking
about "Romeo and Juliet, Act III, Scene I". Most screen readers will see
the "III" and correctly speak aloud "Roman three" or similar. But when they
get to the "I" it becomes ambiguous. Most will read out "Eye".

Screen-readers rarely look at the whole sentence for context. Which means
they get confused. It's fairly obvious that XIV should be "fourteen" as
there's no English word "xiv"1. But what about "MIX" - is that 1009 or the
word "mix"?

Anyone who has watched the BBC knows about their fondness for displaying in
Latin the year a programme was made. MCMXCVI is particularly challenging
for a screen-reader!

I took the following sample sentence - using both letters and Roman
numerals.

Text. In Hamlet, Act I, Scene XI the year is MCMXCVI and they are watching
Rocky V.
Roman. In Hamlet, Act Ⅰ, Scene Ⅺ the year is ⅯⅭⅯⅩⅭⅥ and they are watching
Rocky Ⅴ.

Here's how various services coped:

First, the good news. Amazon's Polly read the Roman numerals perfectly. It
even pronounced ⅯⅭⅯⅩⅭⅥ as "nineteen ninety six".

But it gets rather confused with the ambiguous English text.

I tried with Microsoft Edge's Read Aloud TTS
<https://pypi.org/project/edge-tts/>.

It and makes a bit of a hash of the English and just skips the Roman
numerals.

The same was also true with Google's TTS products
<https://cloud.google.com/text-to-speech/>.

The venerable Linux utility <https://github.com/espeak-ng/espeak-ng> came
out with this.

It gets the "Capital i" incorrect, and reads the Roman numerals as their
Unicode code points.

My good friend Léonie Watson <https://tink.uk/about-leonie/> who writes
extensively about accessibility <https://tink.uk/> was kind enough to
record some other samples for me.

Here are Jaws' "Expressive":

And Jaws' "Eloquence:

Léonie also provided a recording of NVDA Microsoft One Core

And here's Narrator making a right mess of it.

If you know of any other screen-readers, or text-to-speech engines which
can cope with this, please let me know!

On Linux, I raised a Pull Request to fix espeak-ng
<https://github.com/espeak-ng/espeak-ng/pull/1672>.

The rest of the services don't seem to have a way to easily report bugs to
them. If you know a way to raise issues with these screen readers - please
do so!
------------------------------

1. I'm sure there's some obscure Scrabble word, but we're talking
everyday use here. ↩

------------------------------


David Goldfield
Assistive Technology Specialist


Subscribe to the Tech-VI announcement list to receive news and updates
regarding the blindness assistive technology space
Email: Tech-VI+subscribe@groups.io

WWW.DavidGoldfield.com


 

I really don't quite know why this particular blog post is suddenly getting so much attention and traction, particularly since it involves obscure unicode characters that are very seldom used in actual practice.  There are scads of unicode characters that have no "speakable name" and in actual text almost no one uses unicode for Roman numerals.

Someone posted a link to this blog post on the main group, and my reply is in message  , and largely mirrors your own.
--

Brian Virginia, USA Windows 11 Pro, 64-Bit, Version 22H2, Build 22621; Office 2016, Version 16.0.15726.20188, 32-bit; Android 12 (MIUI 13)

Let me hasten to add that I *do* like cologne.  I just much prefer it as a subtle hint instead of an aromachete.

        ~ Clay Colwell