Skip to main content
Log in

International Journal of Computer Vision - Call for Papers: Multimodal Visual Generative Models

Guest editors

  • Yang Long, Durham University, UK
  • Ling Shao, UCAS-Terminus AI Lab, University of Chinese Academy of Sciences, China
  • Juergen Schmidhuber, King Abdullah University of Science and Technology, Saudi Arabia


The field of Artificial Intelligence has been significantly revolutionized by the advent of generative AI. This has been substantially fueled by recent advancements in large generative models such as OpenAI's ChatGPT, Stable Diffusion, and text-image generation models. These transformative technologies are reshaping the world, influencing our daily lives, and creating new frontiers of research and application. As such, we believe the time is ripe to expand these technologies to incorporate more modalities, such as speech, sensors, thermals, point clouds, and among others.

It is important to recognize the impact of large generative models on our society. These models have democratized the creation of content, enabling individuals and businesses to generate text, images, and more with unprecedented ease and flexibility. They are employed in a wide range of applications, from automating customer service responses to aiding in the creation of artwork and literature. For instance, OpenAI's ChatGPT, a language model trained on diverse internet text, can draft emails, write code, answer questions, tutor in various subjects, translate languages, and even simulate characters for video games. Moreover, recent innovations such as Stable Diffusion and text-image generative models have further stretched the boundaries of what's possible with AI. Diffusion models have made it feasible to create high-quality samples in a wide variety of domains, from images to speech. On the other hand, text-image generation models can generate coherent and contextually relevant images from textual descriptions, opening up new possibilities in fields like graphic design, advertising, and entertainment. Despite these advances, the current state-of-the-art in generative AI primarily focuses on single modality - either text, image, or sound. However, real-world data is inherently multimodal. It comes from various sources and in different forms - text from reports, images from cameras, speech from microphones, readings from sensors, thermal signals, point clouds from LiDAR, and more. The ability to process and generate such multimodal data can provide a more holistic understanding of the world, leading to more robust and versatile AI systems.

Therefore, we believe that the next frontier in generative AI is the development and exploration of multimodal large generative models. However, the development of multimodal large generative models presents unique challenges. They require novel architectures, learning algorithms, and evaluation metrics. The high-dimensional and heterogeneous nature of multimodal data further compounds these challenges. Hence, this special issue on Multimodal Large Generative Models seeks to highlight the importance and potential of this rapidly evolving field. We believe that expanding the scope of large generative models to incorporate multiple modalities is a critical step towards building AI systems that better understand and interact with the world. We invite researchers worldwide to share their latest findings, breakthroughs, and perspectives, contributing to the advancement of this exciting field. We look forward to a wealth of novel insights and developments that will shape the future of multimodal AI and generative content.

Topics of interest: 

  • Multimodal Data Fusion and Representation Learning for large generative model: Novel architectures, learning algorithms, and techniques for integrating and learning from diverse data types such as text, images, audio, sensor data, and more.
  • Healthcare and Biomedical Applications: Utilizing multimodal generative models for synthetic data generation, disease progression prediction, medical imaging segmentation, personalized treatment planning, and more, leveraging medical imaging, genomic data, electronic health records, and other multimodal data.
  • Autonomous Systems: Enhancing perception and decision-making in self-driving vehicles, drones, and robotics through the synthetic and fusion of visual data, LiDAR, radar, GPS, and other sensor data.
  • Environmental Modelling and Climate Science: Leveraging multimodal data from various sensors (e.g., satellite imagery, atmospheric sensors, oceanographic data) to simulate complex environmental phenomena.
  • Advanced Human-Computer Interaction: Using multimodal generative models to improve interfaces and interactions between humans and computers, integrating speech, gestures, facial expressions, and other modalities. Furthermore, creating interactive media, video games, virtual reality experiences, and other forms of entertainment or embodied AI, and investigating the ethical implications of these technologies.
  • Generative Model in 3D Vision: Applying generative models to fuse and generate point cloud, mesh, NeRF and more for various applications, such as 3D reconstruction, object detection, and scene understanding, etc.
  • Evaluation Metrics, Benchmarks, and Fairness in Multimodal Large Generative Model: Establishing robust evaluation metrics and benchmarks for multimodal generative models, considering their complex and heterogeneous nature, and investigating fairness and bias issues.
  • Ethical, Societal, and Legal Implications of Multimodal Generative Models: Analyzing the impact of these powerful technologies on society, discussing responsible use strategies, and exploring potential legal frameworks.
  • Other directions that are related to multimodal large generative models, such as Thermal Imaging Applications, Education and Personalized Learning, Surveillance, and etc.

Important Dates

  • Submission deadline: 01 May 2024
  • Preliminary notification: 01 August 2024
  • Revisions due: 10 November 2024
  • Final notification: December 2024


Submission Guidelines
Please submit via IJCV Editorial Manager: www.editorialmanager.com/visi (this opens in a new tab)

Choose SI: Multimodal Visual Generative Models from the Article Type dropdown.

Submitted papers should present original, unpublished work, relevant to one of the topics of the Special Issue. All submitted papers will be evaluated on the basis of relevance, significance of contribution, technical quality, scholarship, and quality of presentation, by at least two independent reviewers. It is the policy of the journal that no submission, or substantially overlapping submission, be published or be under review at another journal or conference at any time during the review process. Manuscripts will be subject to a peer reviewing process and must conform to the author guidelines available on the IJCV website at: https://www.springer.com/11263 (this opens in a new tab).  

Author Resources
Authors are encouraged to submit high-quality, original work that has neither appeared in, nor is under consideration by other journals.  Springer provides a host of information about publishing in a Springer Journal on our Journal Author Resources (this opens in a new tab) page, including  FAQs (this opens in a new tab),  Tutorials  (this opens in a new tab)along with Help and Support (this opens in a new tab)

Other links include:


Navigation