Date:

Sequence Modeling with CTC

Introduction

Consider speech recognition. We have a dataset of audio clips and corresponding transcripts. Unfortunately, we don’t know how the characters in the transcript align to the audio. This makes training a speech recognizer harder than it might at first seem.

The Problem

Without this alignment, the simple approaches aren’t available to us. We could devise a rule like “one character corresponds to ten inputs”. But people’s rates of speech vary, so this type of rule can always be broken. Another alternative is to hand-align each character to its location in the audio. From a modeling standpoint this works well — we’d know the ground truth for each input time-step. However, for any reasonably sized dataset this is prohibitively time-consuming.

Hand Alignment

This problem doesn’t just turn up in speech recognition. We see it in many other places. Handwriting recognition from images or sequences of pen strokes is one example. Action labeling in videos is another.

Connectionist Temporal Classification (CTC)

Connectionist Temporal Classification (CTC) is a way to get around not knowing the alignment between the input and the output. As we’ll see, it’s especially well-suited to applications like speech and handwriting recognition.

CTC Score

A common question when using a beam search decoder is the size of the beam to use. There is a trade-off between accuracy and runtime. We can check if the beam size is in a good range. To do this, first compute the CTC score for the inferred output and then compute the CTC score for the ground truth output. If the two outputs are not the same, we should have c_g < c_i. If c_i << c_g, then the ground truth output actually has a higher probability under the model and the beam search failed to find it. In this case, a large increase to the beam size may be warranted.

Conclusion

In this article, we have discussed the problem of not knowing the alignment between the input and output in speech recognition and other areas. We have also introduced Connectionist Temporal Classification (CTC) as a way to get around this problem. CTC is a well-suited approach for applications like speech and handwriting recognition.

FAQs

  • Q: What is the problem with not knowing the alignment between the input and output?
    A: Not knowing the alignment makes it difficult to train a speech recognizer or other models that require a direct mapping between the input and output.
  • Q: How does CTC address this problem?
    A: CTC allows us to train a model without knowing the alignment between the input and output by using a beam search decoder.
  • Q: What are the trade-offs of using a beam search decoder?
    A: The trade-offs are between accuracy and runtime. A larger beam size can improve accuracy but increase runtime.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here