A Frame-Synchronous Prosodic Decoder for Text-Independent Dialog Act Recognition
Kornel Laskowski, Carnegie Mellon University
Dialog act (DA) recognition is an important intermediate task is speech understanding systems. Although past research has demonstrated that prosody can improve the performance of recognizers relying primarily on words, how prosody fares on its own is not well understood. The current work continues an ongoing investigation into settings in which both words and word boundaries are unavailable, whether for privacy, security, speed, or availability of technology reasons. A system is presented with long acoustic frames, which renders the modeling of prosodic context tractable. The system is then extended by concatenating features computed for temporally proximate frames, from both the target speaker and from non-target interlocutors. Experiments indicate that the increased frame size and target-speaker prosodic context improve recognition performance, in particular for floor holders, accepts, and DA termination types. Non-target-speaker prosodic context is shown to have a large positive impact on the detection of DA interruption. These results suggest that the improved framework holds promise for the general decoding of prosodic phenomena in spontaneous speech, independently of speech recognition.