Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Jeff Donahue¹, Lisa Anne Hendricks¹, Marcus Rohrbach¹, Subhashini Venugopalan², Sergio Guadarrama¹, Kate Saenko³, Trevor Darrell¹

Affiliations

¹ Department of Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley, CA.
² Department of Computer Science, University of Texas at Austin, Austin, TX.
³ Department of Computer Science, University of Massachusetts Lowell, Lowell, MA.

PMID: 27608449
DOI: 10.1109/TPAMI.2016.2599174

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Jeff Donahue et al. IEEE Trans Pattern Anal Mach Intell. 2017 Apr.

. 2017 Apr;39(4):677-691.

doi: 10.1109/TPAMI.2016.2599174. Epub 2016 Sep 1.

Authors

Jeff Donahue¹, Lisa Anne Hendricks¹, Marcus Rohrbach¹, Subhashini Venugopalan², Sergio Guadarrama¹, Kate Saenko³, Trevor Darrell¹

Affiliations

¹ Department of Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley, CA.
² Department of Computer Science, University of Texas at Austin, Austin, TX.
³ Department of Computer Science, University of Massachusetts Lowell, Lowell, MA.

PMID: 27608449
DOI: 10.1109/TPAMI.2016.2599174

Abstract

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep" in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

PubMed Disclaimer

Publication types

Actions
Actions

LinkOut - more resources

Full Text Sources
- IEEE Computer Society
- IEEE Engineering in Medicine and Biology Society
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Affiliations

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Authors

Affiliations

Abstract

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources