Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct-Nov:2019:2580-2590.
doi: 10.1109/iccv.2019.00267. Epub 2020 Feb 27.

Scene Graph Prediction with Limited Labels

Affiliations

Scene Graph Prediction with Limited Labels

Vincent S Chen et al. Proc IEEE Int Conf Comput Vis. 2019 Oct-Nov.

Abstract

Visual knowledge bases such as Visual Genome power numerous applications in computer vision, including visual question answering and captioning, but suffer from sparse, incomplete relationships. All scene graph models to date are limited to training on a small set of visual relationships that have thousands of training labels each. Hiring human annotators is expensive, and using textual knowledge base completion methods are incompatible with visual data. In this paper, we introduce a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few' labeled examples. We analyze visual relationships to suggest two types of image-agnostic features that are used to generate noisy heuristics, whose outputs are aggregated using a factor graph-based generative model. With as few as 10 labeled examples per relationship, the generative model creates enough training data to train any existing state-of-the-art scene graph model. We demonstrate that our method outperforms all baseline approaches on scene graph prediction by 5.16 recall@ 100 for PREDCLS. In our limited label setting, we define a complexity metric for relationships that serves as an indicator (R2 = 0.778) for conditions under which our method succeeds over transfer learning, the de-facto approach for training with limited labels.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Our semi-supervised method automatically generates probabilistic relationship labels to train any scene graph model.
Figure 2.
Figure 2.
Visual relationships have a long tail (left) of infrequent relationships. Current models [49,54] only focus on the top 50 relationships (middle) in the Visual Genome dataset, which all have thousands of labeled instances. This ignores more than 98% of the relationships with few labeled instances (right, top/table).
Figure 3.
Figure 3.
Relationships, such as fly, eat, and sit can be characterized effectively by their categorical (s and o refer to subject and object, respectively) or spatial features. Some relationships like fly rely heavily only on a few features — kites are often seen high up in the sky.
Figure 4.
Figure 4.
We define the number of subtypes of a relationship as a measure of its complexity. Subtypes can be categorical — one subtype of ride can be expressed as while another is . Subtypes can also be spatial—carry has a subtype with a small object carried to the side and another with a large object carried overhead.
Figure 5.
Figure 5.
A subset of visual relationships with different levels of complexity as defined by spatial and categorical subtypes. In Section 5.3, we show how this measure is a good indicator of our semi-supervised method’s effectiveness compared to baselines like transfer learning.
Figure 6.
Figure 6.
For a relationship (e.g., carry), we use image-agnostic features to automatically create heuristics and then use a generative model to assign probabilistic labels to a large unlabeled set of images. These labels can then be used to train any scene graph prediction model.
Figure 7.
Figure 7.
(a) Heuristics based on spatial features help predict . (b) Our model learns that look is highly correlated with phone, (c) We overfit to the importance of chair as a categorical feature for sit, and fail to identify hang as the correct relationship, (d) We overfit to the spatial positioning associated with ride, where objects are typically longer and directly underneath the subject, (e) Given our image-agnostic features, we produce a reasonable label for . However, our model is incorrect, as two typically different predicates (sit and cover) share a semantic meaning in the context of .
Figure 8.
Figure 8.
A scene graph model [54] trained using our labels outperforms both using TRANSFER LEARNING labels and using only the BASELINE labeled examples consistently across scene graph classification and predicate classification for different amounts of available labeled relationship instances. We also compare to ORACLE, which is trained with 108× more labeled data.
Figure 9.
Figure 9.
Our method’s improvement over transfer learning (in terms of R@100 for predicate classification) is correlated to the number of subtypes in the train set (left), the number of subtypes in the unlabeled set (middle), and the proportion of subtypes in the labeled set (right).

References

    1. Alfonseca Enrique, Filippova Katja, Delort Jean-Yves, and Garrido Guillermo. Pattern learning for relation extraction with a hierarchical topic model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 54–59. Association for Computational Linguistics, 2012. 5
    1. Anderson Carolyn J, Wasserman Stanley, and Faust Katherine. Building stochastic blockmodels. Social networks, 14(1– 2):137–161, 1992. 2
    1. Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pages 382–398. Springer, 2016. 1
    1. Auer Sören, Bizer Christian, Kobilarov Georgi, Lehmann Jens, Cyganiak Richard, and Ives Zachary. Dbpedia: A nucleus for a web of open data In The semantic web, pages 722–735. Springer, 2007. 2
    1. Bollacker Kurt, Evans Colin, Paritosh Praveen, Sturge Tim, and Taylor Jamie. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008. 2

LinkOut - more resources