It shouldn’t be hard to do.
Well, and not for the purpose of being nasty or contrary, nothing could be further from the truth.

This is a particular instance where no matter what the accessibility workaround it is taking a medium meant to be accessed by sight, really only accessed by sight, that is a transcription of sound.  It is particularly and peculiarly unsuited to any methods currently known and the only thing I can think of would be if one could get a synthesizer to read the IPA graphemes as phonemes, one-by-one, and even that doesn't come close to connected speech.

I have said, on many occasions, that sometimes there is no substitute for sight and that all accessibility is a workaround which imposes specific limitations not posed by media being "consumed" in the sensory modality for which it was explicitly designed.

