I get what you're saying, but just imagine what it would be like to have the dialog, the background noise that's part of the scene, and synthesized subtitles all being churned out at the same time.
I understand what you're trying to solve, but I don't think that adding "a third layer" that's also presented auditorily will actually do that. I guess it can't hurt to try, if it's possible, but I suspect a "making it worse, not better" outcome.
Brian - Windows 10 Pro, 64-Bit, Version 1909, Build 18363
Most of the change we think we see in life is due to truths being in and out of favor.
~ Robert Frost, The Black Cottage (1914)