Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning
In order to answer a Visual question, it requires high-order reasoning about an image, which to follow complex directives is a fundamental capability needed by machine systems. Recently, modular networks for performing visual reasoning tasks have been shown to be an effective framework. While initially modular networks were designed with a degree of model transparency, their performance was lacking on complex visual reasoning benchmarks.
In order to understand the process of reasoning, an effective mechanism is not provided by the Current state-of-the-art approaches. In this paper, there is a closure of the gap in performance between interpretable models and state-of-the-art visual reasoning methods. There is a proposal of a set of visual-reasoning primitives which, when composed, manifest as a model that is capable of performing complex reasoning tasks in a manner that is explicitly-interpretable.
The fidelity, as well as interpretability of the outputs of the primitives’, enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, it is shown to you that these primitives are highly performant, and on the CLEVR dataset succeed in achieving a state-of-the-art accuracy of 99.1%. You are also shown that the model when, provided a small amount of data containing novel object attributes, is able to very effectively learn generalized representations. Making the use of the CoGenT generalization task, you are shown a point improvement of more than a 20 percent over the current state of the art.
The work presented here by designing a module network that has been explicitly built around a visual attention mechanism closes the gap between performant and interpretable models. This approach is referred to as Transparency by Design (TbD).
As Lipton notes, transparency and interpretability are often spoken of but hardly defined. Here, transparency refers to the ability to examine the intermediate outputs of each module and at a high level understand their behaviour. That is, if they visually highlight the correct regions of the input image, the module outputs are interpretable. This ensures that the reasoning process can be interpreted. This notion is concretely defined and also provides a quantitative analysis.
In this paper, the following takes place:
Proposal of a set of composable visual reasoning primitives that incorporate an attention mechanism, which allows for model transparency.
On the CLEVR  dataset, demonstration of state-of-the-art performance.
Bring ahead that compositional visual attention provides powerful insight into model behaviour.
Proposal of a method in order to quantitatively evaluate the interpretability of visual attention mechanisms.
Improve by 20 percentage points upon the current state-of-the-art performance on the CoGenT generalization task.
The structure of this paper is as mentioned:
In Section 2, there is a discussion of the related work in visual question answering and visual reasoning, which motivates the incorporation of an explicit attention mechanism in our model.
Section 3 presents to you the Transparency by Design networks.
In Section 4, you are presented with all the VQA experiments and results.
A discussion of all the contributions is presented to you in Section 5.
You are presented with Transparency by Design networks, which in order to perform reasoning operations compose visual primitives that leverage an explicit attention mechanism. The resulting neural module networks here are both highly performant as well as readily interpretable, unlike their predecessors.
This is a key advantage when it comes to utilizing TbD models: the ability to via the produced attention masks directly evaluate the model’s learning process is a powerful diagnostic tool. One can leverage this capability to inspect the semantics of a visual operation and redesign modules to address apparent aberrations in reasoning. Using these attentions as a means to improve performance, the state-of-the-art accuracy on the challenging CLEVR dataset and the CoGenT generalization task can be achieved. Such insight into a neural network’s operation when it comes to visual reasoning systems may also help build user trust.
Link To The PDF: Click Here
Example GitHub Code (PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"): GitHub