Fine-grained image labels are desirable for many computer vision applications, may it be a visual search or a mobile AI assistant for that matter. All these applications basically rely on image classification models that can further produce hundreds of thousands, for example, 100K of diversified fine-grained image labels on all the input images.
However, the training of a network at this vocabulary scale can be very challenging, and may also suffer from an intolerable large model size as well as slow training speed, which in return leads to an unsatisfactory classification performance. A straightforward solution to the same would be to train separate expert networks; specialists, with each specialist that would focus on learning only one specific vertical, say cars, birds etc.
However, again, the deployment of dozens of expert networks in a practical system would significantly increase the complexity of the system as well as inference latency on the same hand, and will thus also consume large amounts of computational resources.
In order to address all these challenges, there was finally a proposal of a Knowledge Concentration method, which aimed at the effective transferring of the knowledge from dozens of specialists that is multiple teacher networks into one single model or you can say one student network in order to classify 100K object categories. The intuition of the work mainly comes from daily experience.
The three-fold contributions that also act as salient features of the presented model are:
The designing of a novel multi-teacher single-student knowledge distillation method in order to transfer knowledge from the specialists to the generalist, and a self-paced learning mechanism that would allow the student to learn at different paces from different teachers.
The designing and exploration of different types of structurally connected layers in order to expand network capacity with limited number of extra parameters.
The evaluation of the proposed methods on EFT and OpenImage datasets, and in return also show the significant performance improvements.
The related work in the same context Includes the following:
Apart from the above, the validation of the presented method is done on Open Image and a newly collected dataset, that is called as Entity-Foto-Tree (EFT), with 100K categories, and therefore, show that the proposed model can perform significantly better than the baseline generalist model.
Link to the Document: Click Here