Skip to main content
Log in

Multimedia Systems - Call for Papers: Special Issue on Multi-modal Transformers

With the development of the Internet, social media, mobile apps, and other digital communication technologies, the world has stepped into a multimedia big data era. Millions of multimedia data, including image, text, audio, and video, are uploaded to the social platform every day. To make the artificial intelligence better understand the world around us, it is essential to teach machines to understand the multimodal messages. Multimodal machine learning, which aims to build models that can process and relate information from different modalities, has been a vibrant field with increasing importance and extraordinary potential.

In this novel and hopeful area, extensive efforts have been dedicated to seamlessly unifying computer vision and natural language processing, such as multimedia content recognition (e.g., multimodal affect recognition), matching (e.g., cross-modal retrieval), description (e.g., image captioning), indexing (e.g., multimedia event detection), summarization (e.g., video summarization), reasoning (e.g., visual question answering), and so on. Although fruitful progress has been made with deep learning-based methods, the performance of above tasks is still far from users’ expectations, given the heterogeneous data due to several well-known challenges: 

  • how to represent and summarize multimodal data;
  • how to identify and construct the connection and interaction between different modality data;
  • how to learn and infer adequate knowledge from multimodal data; 
  • how to translate data or knowledge from one modality to another; and 
  • how to understand and evaluate the heterogeneity in multimodal datasets.

Due to the superior capability of modeling long-range relations and learning efficient representations, Transformer is becoming popular in both natural language processing and computer vision. A set of Transformer-based approaches have been proposed to achieve state-of-the-art performance on a broad range of topics such as image classification, object detection, segmentation, video understanding, text summarization, and question answering. Besides achieving impressive performance in various language and vision tasks, the Transformer also provides an effective mechanism for multi-modal understanding but demands some adaptations and specific network designs. Despite attracting a surge of research interest, some core issues for multi-modal Transformers, such as how to design an efficient Transformer model and alleviate its computational burden, are still open problems. 

This special issue is focused on two main aspects: (1) exploration of multi-modal Transformers, especially the design for efficient Transform networks; and (2) investigation of how to find a compromise between the model capacity and complexity of multi-modal Transformers. We prospect high-quality and original contributions toward theories, algorithms, model architecture design, and novel applications of Transform networks for multimedia data understanding. 

The special issue will provide a timely collection of recent advances to benefit researchers and practitioners working in the broad research field of the multimedia community. Topics of interest include (but are not limited to): 

  • Novel Transformer-based models for multimedia understanding, such as multimedia content recognition, matching, description, indexing, summarization, reasoning, etc. 
  • Efficient Transformer architectures through model compression, distillation, or other novel mechanisms. 
  • Novel pre-trained multi-modal Transformers 
  • Novel network designs combine Transformer models' strengths with other networks (e.g., CNN, RNN, generative models, and graph-based models)
  • Unsupervised, weakly supervised, and semi-supervised learning for multimedia data with Transformer models. 
  • Novel network designs to transfer pre-trained multi-modal Transformers to downstream tasks. 
  • Theoretical insights into multi-modal Transformer-based networks
  • Open-set problems for multimedia understanding with Transformers 
  • Multimodal Continual learning with Transformers 
  • New dataset for multimedia understanding

Guest Editors: 

Feifei Zhang: Tianjin University of Technology, Tianjin, China

An-An Liu: Tianjin University, Tianjin, China 

Xiaoshan Yang: National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Min Xu: University of Technology Sydney (UTS), Sydney, Australia

Submission Guidelines:

Authors should prepare their manuscript according to the Instructions for Authors available from the Multimedia Systems website (this opens in a new tab). Authors should submit through the online submission site (this opens in a new tab) and select “SI Special Issue on Multi-modal Transformers" when they reach the “Article Type” step in the submission process. Submitted papers should present original, unpublished work, relevant to the topics of the special issue. All submitted papers will be evaluated on the basis of relevance, significance of contribution, technical quality, scholarship, and quality of presentation, by at least three independent reviewers. It is the policy of the journal that no submission, or substantially overlapping submission, be published or be under review at another journal or conference at any time during the review process. Final decisions on all papers are made by the Editor in Chief.


Navigation