regular expression and speech dic


Mr. Wong Chi Wai, William <cwwong.pro@...>
 

Hi,
I have the following problem with speech dictionary.
I created a speech dic entry with the regular expression rule "lv(^a-z)" and replace with a speech of "level".

it turn out that if there is something like lv12, it will say level 2 but not level 12.
Would anyone advice me of what am I doing wrong here?
Thanks.
William


 

William,

          Your problem is that the "lv" matches those two characters exactly, but "(^a-z)" captures ONE character that is not lower case a through lower case z.

           If you want something of the form "lv12" pronounced as "level 12" you would use the regular expression:   lv([0-9]+)
and the substitution would be:   level \2

           The regular expression I gave says, "Match the characters 'lv' literally, then match one or more repetitions of the numeric digit characters, saving them for later use, which is what the parentheses around that part of the expression does.   Since you know the "lv" is to be pronounced "level," there's no need to capture it for later use, but you need the digit sequence, whatever it may be, to be spoken later, and the parentheses allow it to be referred to later, as a unit as "\2" [the second match, where "lv" was the first match, but not saved because there were no parentheses around it].

--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


Mr. Wong Chi Wai, William <cwwong.pro@...>
 

Brian, you are so smart for helping me, many thanks.

In fact why I created such a regular expression is because what I have created in the speech dic didn't work.
I first created a whole word speech dic of "lv" and replace by a speech "level"
and found that most of the time lv will be spoken as level instead, except when lv is followed by numbers immediately.
Therefore I want to create some like a regular expression to include all "lv" that is not followed by any english word.
 


 
 

Brian Vogel 於 23/8/2018 10:18 寫道:

William,

          Your problem is that the "lv" matches those two characters exactly, but "(^a-z)" captures ONE character that is not lower case a through lower case z.

           If you want something of the form "lv12" pronounced as "level 12" you would use the regular expression:   lv([0-9]+)
and the substitution would be:   level \2

           The regular expression I gave says, "Match the characters 'lv' literally, then match one or more repetitions of the numeric digit characters, saving them for later use, which is what the parentheses around that part of the expression does.   Since you know the "lv" is to be pronounced "level," there's no need to capture it for later use, but you need the digit sequence, whatever it may be, to be spoken later, and the parentheses allow it to be referred to later, as a unit as "\2" [the second match, where "lv" was the first match, but not saved because there were no parentheses around it].

--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 



 

William,

           You are quite welcome.  There was a time, now a bit over 20 years past, when I became a "regular expression guru" of necessity and was using them for complex search matches on a routine basis.   I probably couldn't read and understand some of what I actually created back then, but for the "simpler stuff" I can still generally churn out a regular expression that matches what's desired and nothing else.
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


JM Casey <crystallogic@...>
 

I’m really interested in learning and mastering this myself. It’s one of my many computer-related ongoing projects. *grins*

 

 

From: nvda@nvda.groups.io <nvda@nvda.groups.io> On Behalf Of Brian Vogel
Sent: August 23, 2018 9:41 AM
To: nvda@nvda.groups.io
Subject: Re: [nvda] regular expression and speech dic

 

William,

           You are quite welcome.  There was a time, now a bit over 20 years past, when I became a "regular expression guru" of necessity and was using them for complex search matches on a routine basis.   I probably couldn't read and understand some of what I actually created back then, but for the "simpler stuff" I can still generally churn out a regular expression that matches what's desired and nothing else.
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


 

On Thu, Aug 23, 2018 at 01:52 PM, JM Casey wrote:
I’m really interested in learning and mastering this [regular expressions] myself.
Good luck with that!!    And I say that both sincerely and with a huge dose of snark, because almost everyone who delves into regular expression syntax comes to believe, and very quickly, that mastery in any conventional sense of the word is unattainable!    You can become incredibly proficient and know more than anyone else you know, and there's still more you don't know.

I would love to know what perverse genius first thought up the concept of regular expressions, which come as close as any formal pattern matching syntax I've seen to being able to be "almost human" in the way alternatives in the original text can all be recognized, but those regular expressions are often nightmarish to look at and understand.
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


Mr. Wong Chi Wai, William <cwwong.pro@...>
 

But, I don't understand why mine regular expression  rule didn't work.
lv[^a-z]
where "lv" has no problem.
"[]" is to match the ^a-z in a group; where ^a-z is to match anything except a-z.
so, the rule is intended to match any "lv" that is not follow by any letters.
However, it turn out that if the rule is replace by "level"
say lv12 should read as level 1 2;but, it reads level 2 and missed the number 1.
if it is lv2, it reads only "level" and missed the number 2.

I don't understand what is wrong with my rule.
Thanks.
William


 

William,

           You need to read up more on regular expression syntax.  As I explained earlier, the "lv" matches the string "lv" and [^a-z] matches one, and only one, character that is not in the range lowercase a to z.  So your expression, when the string being matched is "lv12", matches "lv" as the first match, and "1" as the second match.  You must use a quantifier after a character range if you want it to match zero or more, which would be '.', zero or 1, which would be '?', or one or more, which would be '+'.

           Your regular expression matches the lv and eats it then matches a SINGLE character that follows "lv" that is not a character between lowercase a and z, and that's all.  In addition it "eats" those characters as part of the match, leaving you only with 2.

           You also need to understand that regular expressions "eat" the things they match, and if you want to use those things later you MUST enclose the matching sequence in parentheses to refer to them in the replacement.  In the case of 'lv' you can safely toss that away because you know you want "level" for that regardless of what follows it.  You can't say the same of the digit sequence that follows the lv, which you most likely want to have read out as the number it represents.

Again, I offer what I said last night:
--------------------------------------------------------------------------
          Your problem is that the "lv" matches those two characters exactly, but "(^a-z)" captures ONE character that is not lower case a through lower case z.

           If you want something of the form "lv12" pronounced as "level 12" you would use the regular expression:   lv([0-9]+)
and the substitution would be:   level \2

           The regular expression I gave says, "Match the characters 'lv' literally, then match one or more repetitions of the numeric digit characters, saving them for later use, which is what the parentheses around that part of the expression does.   Since you know the "lv" is to be pronounced "level," there's no need to capture it for later use, but you need the digit sequence, whatever it may be, to be spoken later, and the parentheses allow it to be referred to later, as a unit as "\2" [the second match, where "lv" was the first match, but not saved because there were no parentheses around it].
--------------------------------------------------------------------------
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


Mr. Wong Chi Wai, William <cwwong.pro@...>
 

Haha, oh that is the problem.
I refined your suggestion as
(lv)([^\u4e00-\u9fa5a-z0-9]+)
where \u4e00-\u9fa5 I found from the web is the code for Chinese word; and replace as level\2
it turn out that
the lv in 你的lv是? will read as level
but the lv in lv121 will not.

I wish to adopt your suggestion, but I have to include the option of no Chinese, no english and no numbers follow by the "lv"

 
lv 121





Brian Vogel 於 24/8/2018 11:02 寫道:

William,

           You need to read up more on regular expression syntax.  As I explained earlier, the "lv" matches the string "lv" and [^a-z] matches one, and only one, character that is not in the range lowercase a to z.  So your expression, when the string being matched is "lv12", matches "lv" as the first match, and "1" as the second match.  You must use a quantifier after a character range if you want it to match zero or more, which would be '.', zero or 1, which would be '?', or one or more, which would be '+'.

           Your regular expression matches the lv and eats it then matches a SINGLE character that follows "lv" that is not a character between lowercase a and z, and that's all.  In addition it "eats" those characters as part of the match, leaving you only with 2.

           You also need to understand that regular expressions "eat" the things they match, and if you want to use those things later you MUST enclose the matching sequence in parentheses to refer to them in the replacement.  In the case of 'lv' you can safely toss that away because you know you want "level" for that regardless of what follows it.  You can't say the same of the digit sequence that follows the lv, which you most likely want to have read out as the number it represents.

Again, I offer what I said last night:
--------------------------------------------------------------------------
          Your problem is that the "lv" matches those two characters exactly, but "(^a-z)" captures ONE character that is not lower case a through lower case z.

           If you want something of the form "lv12" pronounced as "level 12" you would use the regular expression:   lv([0-9]+)
and the substitution would be:   level \2

           The regular expression I gave says, "Match the characters 'lv' literally, then match one or more repetitions of the numeric digit characters, saving them for later use, which is what the parentheses around that part of the expression does.   Since you know the "lv" is to be pronounced "level," there's no need to capture it for later use, but you need the digit sequence, whatever it may be, to be spoken later, and the parentheses allow it to be referred to later, as a unit as "\2" [the second match, where "lv" was the first match, but not saved because there were no parentheses around it].
--------------------------------------------------------------------------
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 



 

William,

          First, try the regular expression for the match:     (lv\s*)([0-9]+)
with the same substitution:  level \2

          If that does not work, then please provide at least 5 examples of strings that should be matched, including ones with Chinese characters if those are to be matched, and I'll have to figure out the refinements.
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


 

On Fri, Aug 24, 2018 at 12:18 AM, Mr. Wong Chi Wai, William wrote:
I wish to adopt your suggestion, but I have to include the option of no Chinese, no english and no numbers follow by the "lv"
Here you are saying that you need to match something followed by "lv" rather than prefixed with "lv", which is best done with a second regular expression.  Please provide a number of examples, including Chinese characters, of things you'd like to have matched that would be followed by lv.   If there can be intervening whitespace between the prefixing characters and the following "lv" then definitely include examples of such.

The ability to do regular expression matching and substitution is one of NVDA's most powerful features "under the hood," and very few have any idea that it exists or how it's used, so this is a worthwhile discussion to have.  I can't recall one about regular expressions occurring on the group before.
 
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


Mr. Wong Chi Wai, William <cwwong.pro@...>
 

Thanks.
I adopted your rule and the following cannot match:

你的lv?
我的lv是12
我maxlv了


Brian Vogel 於 24/8/2018 21:02 寫道:

William,

          First, try the regular expression for the match:     (lv\s*)([0-9]+)
with the same substitution:  level \2

          If that does not work, then please provide at least 5 examples of strings that should be matched, including ones with Chinese characters if those are to be matched, and I'll have to figure out the refinements.
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 



 

William,

I'm not asking for a phonetic transcription, but a "how would you like those broken apart" description for how the three examples you provided, which follow, should be read (with my guesses afterward):

你的lv?                 你的 level  (and is that question mark the end of a sentence)
我的lv是12           我的 level  是 12
我maxlv了            我 max level 了

This gets very tricky if the first example is the end of a sentence while the others are not.  Also, do you happen to know if the unicode for the entire Chinese character set is sequential, like the unicode for, say, a-z or 0-9 is?

In that last example, would it only be "max" and "min" in the three character positions before lv when used?

The lack of the use of whitespace in these forms makes them trickier to accurately snag and parse apart for substitution for the voice synthesizer.  They look like something that would be unpronouncable without being broken into several units.
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


 

William,

           If my presumptions in the last message are correct, give the following a try for the regular expression:

                                  ([\u4e00-\u9fa5]+)(max|min)*(lv)([\u4e00-\u9fa5]*)

           and the following for the substitution:
                                  
                                  \1 \2 level \4


Note 1:   The above is only for the style you showed that includes Chinese characters with an optional "max" or "min" and "lv".    It will not capture a simple lv followed by numbers.  You should keep the earlier regular expression and substitution for those instances.

Note 2:  The above regular expression presumes Python regex syntax.  Since I know that NVDA add-ons are coded in Python I'm presuming this is the regex syntax NVDA would use.  If not, then I would need to know which version of regex syntax is being used.
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 


Mr. Wong Chi Wai, William <cwwong.pro@...>
 

Brian
Thanks.
Well, this is how Chinese work. Chinese normally don't have spaces between words. Unlike English there are lots of spaces, Chinese almost have no spaces.

For the suggestions what I have given, mainly I want to capture all the "lv" that is alone from other English, i.e. not preceded or followed by any other English word; and at the same time capture those "lv" that is in between Chinese word, immediately before numbers, that is what the three examples I meant. Sorry for the lack of explaination in the previous email.


 
Brian Vogel 於 25/8/2018 21:18 寫道:

William,

I'm not asking for a phonetic transcription, but a "how would you like those broken apart" description for how the three examples you provided, which follow, should be read (with my guesses afterward):

你的lv?                 你的 level  (and is that question mark the end of a sentence)
我的lv是12           我的 level  是 12
我maxlv了            我 max level 了

This gets very tricky if the first example is the end of a sentence while the others are not.  Also, do you happen to know if the unicode for the entire Chinese character set is sequential, like the unicode for, say, a-z or 0-9 is?

In that last example, would it only be "max" and "min" in the three character positions before lv when used?

The lack of the use of whitespace in these forms makes them trickier to accurately snag and parse apart for substitution for the voice synthesizer.  They look like something that would be unpronouncable without being broken into several units.
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    A little kindness from person to person is better than a vast love for all humankind.

           ~ Richard Dehmel

 

 



 

William,

        Everything's good, your command of English is far, far, far, far [. . .] better than my command of Chinese.

        If the last regular expression I offered does not catch what you need it to catch and/or the substitution does not break it into the appropriate units, what would be best next is to have examples of what it did not capture and how those examples should be split apart into words to be fed to the synthesizer.  I will always presume that "lv" is meant to be replaced with the English word "level."
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    The psychology of adultery has been falsified by conventional morals, which assume, in monogamous countries, that attraction to one person cannot co-exist with a serious affection for another.  Everybody knows that this is untrue. . .

           ~ Bertrand Russell

 

 


Mr. Wong Chi Wai, William <cwwong.pro@...>
 

I finally got it working like this:
(lv)([0-9\u4e00-\u9fa5]+|[^a-zA-Z])

Brian Vogel 於 26/8/2018 11:46 寫道:

William,

        Everything's good, your command of English is far, far, far, far [. . .] better than my command of Chinese.

        If the last regular expression I offered does not catch what you need it to catch and/or the substitution does not break it into the appropriate units, what would be best next is to have examples of what it did not capture and how those examples should be split apart into words to be fed to the synthesizer.  I will always presume that "lv" is meant to be replaced with the English word "level."
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    The psychology of adultery has been falsified by conventional morals, which assume, in monogamous countries, that attraction to one person cannot co-exist with a serious affection for another.  Everybody knows that this is untrue. . .

           ~ Bertrand Russell

 

 



 

Glad to hear you've got it solved.

That second capture group is interesting, that's for sure.

--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    The psychology of adultery has been falsified by conventional morals, which assume, in monogamous countries, that attraction to one person cannot co-exist with a serious affection for another.  Everybody knows that this is untrue. . .

           ~ Bertrand Russell

 

 


Mr. Wong Chi Wai, William <cwwong.pro@...>
 

Really thanks for your advice.
For what you have taught me, e.g. the use (), I couldn't found such information from the internet.



Brian Vogel 於 26/8/2018 11:56 寫道:

Glad to hear you've got it solved.

That second capture group is interesting, that's for sure.

--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    The psychology of adultery has been falsified by conventional morals, which assume, in monogamous countries, that attraction to one person cannot co-exist with a serious affection for another.  Everybody knows that this is untrue. . .

           ~ Bertrand Russell

 

 



 

You're quite welcome.

I love regular expressions, and finding "just the right one" to fit a variety of patterns is a really fun puzzle to me!
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

    The psychology of adultery has been falsified by conventional morals, which assume, in monogamous countries, that attraction to one person cannot co-exist with a serious affection for another.  Everybody knows that this is untrue. . .

           ~ Bertrand Russell