The Inside Story of NVDA: parts that make up NVDA screen reader #NVDA_Internals


 

Hi all,

Note: parts of this story include actual NVDA source code. For most of you, this might be the first time you are actually seeing source code of a screen reader (written in Python programming language). I think it is important to let you see some source code so you can better understand what’s going on and for me to use it as a base for explaining several concepts and NVDA mechanics. Also, parts of this story come from technical design overview document which is included in NVDA source code repository (github.com/nvaccess/nvda).

This is the first story where we get to see NVDA in action: components that make up a screen reader, specifically, NVDA. After going deep into some reminders and definitions in last Inside Story, I think a story about NVDA components can help us transition to more practical matters and lay the foundation for talking about feature internals. But to recap what we saw so far:

  • A screen reader is not a productivity tool (we saw why last time).
  • In essence, a screen reader (or I should say, today’s screen reader) is a sophisticated information processor that gathers, interprets, and presents screen content the user is focused on. The last part (focused on) will be important when we get into events, more specifically selective UIA event registration and how it works and helps (both really). I would like to add “represents” to the steps performed as we will see in this story.
  • Developing a screen reader is an acknowledgement of social and cultural possibilities and constraints of disability and assistive technology. As I explained in the second story (what screen reader is and is not), accessibility advocacy still matters because writing software is designed to convince humans, not machines (this is why people talk about videos, icons, buttons, and even assistive tech such as screen readers as digital artifacts).

 

I will come back to what I wrote above throughout the Inside Story series. With that out of the way…

 

NVDA is an information processor. As the name implies, a processor is software or hardware that processes something. As an immediate example, on a computer or a smartphone, there is a hardware (or a chip) made up of millions to billions of silicon transistors that calculates all sorts of things in a small fraction of a second. Without this calculating chip, computers become just a collection of metals and wires. This is why we call this chip “central processing unit” or CPU for short (there is an ongoing discussion on another forum that compares Intel versus AMD processors, but in essence, processors from both brands operate in the same principle: silicon transistors used to calculate many things at once); the actual internals of CPU’s is beyond the scope of this forum, but suffice to say that CPU’s consist of billions of silicon switches that are turned on or off based on specific rules it understands.

 

Or let’s take a software example: media players. Media player programs such as VLC are experts in encoding and decoding multimedia file formats. These programs read an audio or video file, processes it (determines file format, uses rules to turn that into something that can be played on a screen and/or speakers), then plays them. Just like a CPU hardware, software such as media players consist of many components designed to work together to accomplish something.

 

In the same way, NVDA comes with many components designed to accomplish its goal of information processing. One component is responsible for representing screen content in a way that users and the screen reader can understand. Another set of components represent how users will enter things to their computers, still others respond to events happening on screen, others help NVDA interpret screen content based on built-in rules and customizations by users, some are responsible for dealing with special situations such as web browsers, and another set of modules are responsible for output such as speech synthesis and braille display output. NVDA comes with its own graphical widgets powered by a GUI toolkit, it needs a way to talk to Windows and other apps using operating system facilities, and provides mechanisms to install, run, enable or disable, and remove add-ons (and one day update add-ons).

 

The specific components included in NVDA are:

  • Representing GUI controls: a collection of modules named “NVDAObjects” defines NVDA’s representation of a screen control. At the bottommost layer is an idealized screen control named “NVDA object” (NVDAObjects.NVDAObject), defining properties such as name and role and event handling basics that are common to all screen controls. This object is also responsible for recording developer info upon request (NVDA+F1), with other layers appending more specific information. Not all NVDA object mechanics are defined in the base NVDA object because NVDA needs to handle accessibility API’s (see below). Another basic NVDA object is the window object (NVDAObjects.window.Window) that is the base representation of actual screen control (buttons, checkboxes, document fields, and even app windows are all windows) and provides window functions (such as recording window handle for the screen control) that accessibility API’s and other components rely on.
  • Accessibility API support: NVDA can handle accessibility API’s such as Microsoft Active Accessibility (MSAA/IAccessible), IAccessible2, Java Access Bridge (JAB) and UI Automation (UIA). These API support modules are scattered throughout NVDA source code, divided into API and handler definitions and NVDA objects representing controls powered by these API’s. For example, to support MSAA, a package named “IAccessibleHandler” comes with event definitions and workarounds for both MSAA and IAccessible2, along with objects representing MSAA controls housed inside NVDAObjects.IAccessible package. Inside NVDAObjects.IAccessible package is the idealized MSAA object (NVDAObjects.IAccessible.IAccessible) that extends the “idealized” and window NVDA object and adds procedures to obtain information from MSAA API. Similarly, UIA support comes from the handler module named “UIAHandler” and UIA objects stored in NVDAObjects.UIA. NVDA objects that are part of NVDA and serve as representations of accessibility API controls are collectively termed “API class”.
  • Custom controls from applications and add-ons: if NVDA was limited to recognizing controls that are powered by accessibility API’s alone, the screen reader would be severely limited. Therefore, NVDA allows app modules and global plugins to extend built-in objects or create new ones. These are called “overlay classes” and is specific to the task at hand. A special type of object, called “tree interceptor”, is used to let one NVDA object handle events and changes coming from controls contained within it (treating a tree of NVDA objects as single NVDA object), used in places such as browse mode. Tree interceptor is the main mechanism by which browse mode and other complex document interactions work.
  • Input mechanisms: in theory, NVDA can work with all sorts of input tools such as keyboards, mice, touchscreens, joysticks, you name it. But since there are a limited number of “universal” input tools (keyboards, for example), NVDA can handle common input sources, namely keyboard, mouse, touchscreen, and braille display hardware (there was an attempt to let NVDA respond to voice input). Every piece of input is termed a “gesture”, with a related term being “scripts” (a script, in NVDA world, is a command that gets performed when a specific gesture is performed such as a key press; now you know why there is a dialog named “input gestures” in NVDA). Just like NVDA objects, an idealized input gesture (source) named “input gesture” (inputCore.InputGesture) represents an input gesture, with tools such as keyboard and touchscreen deriving from it. For braille displays, a separate braille and braille input modules are responsible for input from display hardware.
  • Output mechanics: NVDA can send output to speech synthesizers and braille displays (tones can be generated and wave files can be played). There is no idealized output representation (actually, there are two, one sitting between data interpretation and output, the other being a driver base class, explained below). A speech synthesizer is represented by a “synth driver” (synthDriverHandler.SynthDriver), and a braille display by “braille display driver” (braille.BrailleDisplayDriver), both of which derive from a base “driver” (driverHandler.Driver). Each output mechanism is then further divided into driver settings collection and specific output driver. For example, a speech synthesizer driver can include settings such as rate, pitch, and other instructions (if required), and each synthesizer (built-in or from add-ons) are housed inside “synthDrivers” collection. Similarly, braille display drivers located in brailleDisplayDrivers package (built-in and add-ons) provide ways to let NVDA communicate with the hardware (or software in case of NVDA’s braille viewer). For example, Windows OneCore synthesizer is housed in synthDrivers.oneCore module, providing methods to obtain a list of voices installed on a Windows 10 or 11 system, rate boost setting, and handling speech processing.
  • Talking to Windows and other apps: as a Windows screen reader, NVDA comes with modules to communicate with Windows and other apps through Windows API, Component Object Model (COM), and in some cases, inject parts of itself to other programs. Windows API support is provided by some modules starting with the letters “win” i.e. “winUser” handles user32.dll, winKernel is for kernel32.dll, among others. COM support is provided by an external module named “comtypes” with workarounds and specific COM interfaces provided by NVDA (notably, COM is used to communicate with UI Automation). Injecting parts of NVDA into other programs is provided by two modules: NVDA helper for talking to certain programs, and NVDA remote loader (not to be confused with NVDA Remote add-on), a 64-bit program designed to let NVDA, a 32-bit program, talk to 64-bit programs (because NVDA runs on top of 32-bit Python runtime).
  • Event handling: in order to gather and interpret data, NVDA needs to handle events coming from Windows and apps. An appropriately named module, “event handler,” is included to define how events should be processed (perhaps this should be covered in detail in a future Inside Story post). What makes event handling actually work and effective is the fact that various parts of NVDA, notably NVDA objects, can respond to events and take appropriate action (data processing/interpretation).
  • Data interpretation: if events is a big part of data gathering phase, many modules work together to provide interpretation services, and sometimes users can offer suggestions through NVDA settings. First, NVDA notes where the gathered data came from, then asks relevant components (app modules, NVDA objects, accessibility API handlers, to name a few) to create a representation of the gathered data. In most cases this involves creating an NVDA object corresponding to the screen control where the data came from. The just created object in turn uses whatever it knows about the gathered data (event, for example) to process and interpret data, with the interpretation phase depending on accessibility API, overlay classes, user settings (such as announcing tooltips and playing spelling error sound), consulting add-ons, among other factors. If told to output whatever it has, the interpreted data in turn gets fed into presentation preparation routines such as speech and braille data preparation (very complex topic which includes, among other things, character and speech dictionary processing, object caching for speech output, braille table lookup, etc.). A key piece of code that gets used to transition from data processing and interpretation to output is ui.message, responsible for outputting whatever text it gets to speech and braille.
  • Configuration facility: this consist of an external module named “configobj” to read and write config files, configuration profiles (including profile triggers), and NVDA settings interface.

 

There are other components that are crucial for NVDA’s operation, including GUI toolkit (wxWidgets/wxPython), update check, add-ons management, browse mode and virtual buffers, global commands such as OCR and screen curtain, and many others. These are not mentioned in this story due to length.

 

At the heart of all the components (and others not mentioned) is a main event loop, sometimes shortened to event loop or main loop. Think of event loops as a taxi driver waiting to pick up passengers (I think this is the closest analogy to how event loops work behind the scenes). The job of a taxi driver, among other things, is to await a message from the taxi company or a dispatch operator, informing the driver to pick up a passenger. The passenger, after (or before) paying money, would tell the driver the destination location, and after a few minutes (or hours), the passenger will be dropped off (try having a conversation with the driver and see how time flies). After dropping off the passenger, the taxi driver returns to the starting location or the closest garage run by the company or the dispatch operator, awaiting new dispatch messages. If you have ride share accounts, think of how drivers from companies such as Uber perform their duties.

 

At the technical side of things, an event loop consists of, you guessed it, a loop that awaits messages (dispatch) from someone (or something) until told to exit. Dispatch (event) messages are sent mostly by the operating system (Windows in our case) on behalf of the program the user is interacting with or not, including messages raised by Windows on behalf of the program itself (this is how NVDA can announce its own GUI controls using accessibility API’s). Once received, a dispatch message is interpreted by the event loop somehow (in Windows, one calls TranslateMessage function), then based on its interpretation, informs app components (technically, application windows) to process the dispatch message (this is called event handling).

 

As a graphical application (with its own GUI), NVDA does include an event loop. This comes from wxWidgets (better known in this community as wxPython via wx.App.MainLoop() function. This loop is entered in core.main function (see below for its content).

 

To summarize, the components listed above (and other not listed) is managed by the main loop as follows:

  1. NVDA starts and prepares modules for its operation.
  2. The first module, named “NVDA launcher” (nvda.pyw) checks startup information and calls core.main (defined in core.py module).
  3. Inside the core module, the main function initializes components such as synthesizer support and accessibility API handlers, then obtains wxWidgets app representation of NVDA, then calls the main loop (ap.MainLoop()).
  4. As the main loop runs, it “pumps” (dispatches) event messages for accessibility API’s and other events coming from Windows and apps, which in turn causes the components listed above to perform their tasks.
  5. The main loop exits, NVDA terminates various components, and finally leaves core.main function (exiting NVDA).

 

As a bonus, to clarify a point raised a few weeks ago on math expression announcement, a screen reader is not the same as speech synthesizer. It is possible to let NVDA users customize speech dictionaries and thus affect data interpretation and output, but issues with speech synthesizers should be raised with synthesizer vendors, not with NV Access. In the case of that discussion, even if NVDA has gathered and interpreted data so it can announce math expressions correctly, it is then up to speech synthesizers to take whatever data from NVDA and speak it using whatever rules THEY ship with. In other words, unless the screen reader vendor is intimately familiar with the synthesizer being used at that moment (perhaps influence some speech output rules at the time of data interpretation and processing), NVDA cannot influence everything related to speech output as that job is left to speech synthesizer driver maintainers; even after users customize speech dictionary entries and influence character processing steps somewhat, what synthesizers get in the end is text and speech commands. The question on math expression announcement is one of the reasons for writing an Inside Story on NVDA components, both to demonstrate what gets called and why in data gathering/representing/interpreting/presenting steps and to showcase separation between screen readers and TTS engines.

 

Hope this helps in clarifying many things. Up next: is there any feature you’d like me to talk about its internals?

 

Cheers,

Joseph