Introduction

Consider speech recognition. We have a dataset of audio clips and corresponding transcripts. Unfortunately, we don’t know how the characters in the transcript align to the audio. This makes training a speech recognizer harder than it might at first seem.

The Problem

Without this alignment, the simple approaches aren’t available to us. We could devise a rule like “one character corresponds to ten inputs”. But people’s rates of speech vary, so this type of rule can always be broken. Another alternative is to hand-align each character to its location in the audio. From a modeling standpoint this works well — we’d know the ground truth for each input time-step. However, for any reasonably sized dataset this is prohibitively time-consuming.

Hand Alignment

This problem doesn’t just turn up in speech recognition. We see it in many other places. Handwriting recognition from images or sequences of pen strokes is one example. Action labeling in videos is another.

Connectionist Temporal Classification (CTC)

Connectionist Temporal Classification (CTC) is a way to get around not knowing the alignment between the input and the output. As we’ll see, it’s especially well-suited to applications like speech and handwriting recognition.

CTC Score

A common question when using a beam search decoder is the size of the beam to use. There is a trade-off between accuracy and runtime. We can check if the beam size is in a good range. To do this, first compute the CTC score for the inferred output and then compute the CTC score for the ground truth output. If the two outputs are not the same, we should have c_g < c_i. If c_i << c_g, then the ground truth output actually has a higher probability under the model and the beam search failed to find it. In this case, a large increase to the beam size may be warranted.

Conclusion

In this article, we have discussed the problem of not knowing the alignment between the input and output in speech recognition and other areas. We have also introduced Connectionist Temporal Classification (CTC) as a way to get around this problem. CTC is a well-suited approach for applications like speech and handwriting recognition.

FAQs

Q: What is the problem with not knowing the alignment between the input and output?
A: Not knowing the alignment makes it difficult to train a speech recognizer or other models that require a direct mapping between the input and output.
Q: How does CTC address this problem?
A: CTC allows us to train a model without knowing the alignment between the input and output by using a beam search decoder.
Q: What are the trade-offs of using a beam search decoder?
A: The trade-offs are between accuracy and runtime. A larger beam size can improve accuracy but increase runtime.

Post Views: 77

Sequence Modeling with CTC

Introduction

The Problem

Hand Alignment

Connectionist Temporal Classification (CTC)

CTC Score

Conclusion

FAQs

Generate single title from this title Medicaroid receives CE marking for hinotori surgical robot system in 100 -150 characters. And it must return only...

Generate single title from this title Integrating Context-Aware Video AI Agents Into Enterprise Workflows in 100 -150 characters. And it must return only title...

Opinion: The end of inventory – how real-time supply chains are rewriting industrial real estate

1X unveils 25-degree-of-freedom humanoid robot hands for NEO

How AI agents are transforming industrial operations beyond manufacturing

Generate single title from this title Medicaroid receives CE marking for hinotori surgical robot system in 100 -150 characters. And it must return only...

Generate single title from this title Integrating Context-Aware Video AI Agents Into Enterprise Workflows in 100 -150 characters. And it must return only title...

Opinion: The end of inventory – how real-time supply chains are rewriting industrial real estate

1X unveils 25-degree-of-freedom humanoid robot hands for NEO

How AI agents are transforming industrial operations beyond manufacturing

Monumental raises $32 million to expand autonomous construction robots

Logistics firms, robotics startups and brands set up shop in Industry City, fueling a retail tech hub in Brooklyn

Enterprise AI Agents Are Taking Over – Is Your Infrastructure Built to Last?

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Medicaroid receives CE marking for hinotori surgical robot system in 100 -150 characters. And it must return only...

Generate single title from this title Integrating Context-Aware Video AI Agents Into Enterprise Workflows in 100 -150 characters. And it must return only title...

Opinion: The end of inventory – how real-time supply chains are rewriting industrial real estate

Categories

Useful Links

Our Newsletter