Program

Tentative Program

Program S3P-2025

Collectionless AI and Nature-Inspired Learning

Marco Gori

He is a full professor of computer science at the University of Siena, where he leads the Siena Artificial Intelligence Lab. His work focuses on machine learning and neural computation, with a significant impact on the field of graph neural networks. Gori co-authored the pioneering paper "A New Model for Learning in Graph Domains" (IJCNN 2005), which introduced the term "Graph Neural Network" and laid the groundwork for this influential field. His later paper, "Graph Neural Networks" (IEEE-TNN 2009), expanded on these ideas and has garnered over 10,000 citations. He has also collaborated with the 3IA Côte d'Azur since 2019 and previously chaired the Italian Chapter of the IEEE Computational Intelligence Society. Gori is a Fellow of IEEE, EurAI, IAPR, and ELLIS.

Abstract: AI is revolutionizing not only the entire field of Computer Science, but nearly all fields of Science. However, while application contexts explode and LLMs display embarrassing cognitivequalities, the entire AI research field seems headed toward saturation of the fundamental ideas that have enabled today’s spectacular results of large companies. Is the infamous “AI winter” perhaps creeping into research? In this talk I argue that the time is ripe for a fundamental rethinking of AI methodologies with the purpose to migrate intelligence from the cloud to the growing global population of devices with on-board CPUs. To support learning schemes inspired by mechanisms found in nature, I propose developing intelligent systems within NARNIAN, a platform that enables social mechanisms and fosters learning processes over time, without the need for data storage.

The Foundation of Foundation Models for Adaptive Multimedia Agents

Amos Storkey

He is Professor of Machine Learning and Artificial Intelligence in the School of Informatics, University of Edinburgh. He completed his undergraduate studies in Mathematics at Trinity College, Cambridge, moving into Neural Networks for his PhD (1995-98). He has continued research in AI methodology and applications ever since, and is known for work in deep learning, continual learning, meta-learning, representation learning, domain shift, reinforcement learning, and efficient learning systems for edge-devices. Storkey founded and leads the Bayesian and Neural Systems Group in Edinburgh. He is a fellow of the ELLIS Society and has served as director of the EPSRC Centre for Doctoral Training in Data Science. He is current director of the EPSRC Centre for Doctoral Training in Machine Learning Systems.

Abstract: In this talk we will introduce four different types of foundation generative model, in the context of multimedia signal processing and control: autoregressive models, the class of diffusion, gradient flow and flow matching models, state space models, and basis-function approaches. We explain how and why each approach works, and what is necessary to train them. We then look at foundational representation learning models, why they work, and discuss how they can be used for multi-modal fusion. These pieces will be built together into a full tooling for foundational multimedia signal processing, along with demonstrations of multimedia foundation models in action. We will look at what is needed for combining signals, language, images, video and audio. Then in the second part we will look at the fundamental issues in using foundation models: domain shift and fine tuning, non-stationarity and adaptability, and finally computational cost and on-device implementation. Finally we will take a little time to review the state of the art development, and new understanding in the field.

From Chat GPT to a Semantic Transformer

Scott T. Acton

He is the Lawrence R. Quarles Professor and Chair of Electrical & Computer Engineering at the University of Virginia as well as the AI Advisor to the Provost. He is also appointed in Biomedical Engineering. He was Program Director (and then acting Deputy Division Director) in the Computer and Information Sciences and Engineering directorate of the National Science Foundation where he was given the Director’s Award for Superior Accomplishment. He received the M.S. and Ph.D. degrees at the University of Texas at Austin and the B.S. degree from Virginia Tech. Professor Acton is a Fellow of the IEEE “for contributions to biomedical image analysis.” He is also Fellow of American Institute for Medical and Biological Engineering, Fellow of the Asia-Pacific Artificial Intelligence Association, and Fellow of Artificial Intelligence Industry Alliance. Professor Acton’s laboratory at UVA is called VIVA - Virginia Image and Video Analysis. They specialize in biological/biomedical image analysis problems. Professor Acton has over 400 publications in the image analysis area including the books Biomedical Image Analysis: Tracking and Biomedical Image Analysis: Segmentation. He was the 2018 Co-Chair of the IEEE International Symposium on Biomedical Imaging. Professor Acton was Editor-in-Chief of the IEEE Transactions on Image Processing (2014-2018).

Abstract: This talk will set the stage with a discussion of recent activity in artificial intelligence research. First, the basic transformer model will be explained. Next, the contrast between natural language and video processing will be highlighted. The talk will culminate in the discussion of a multimodal video transformer that utilizes semantically meaningful query and key features in learning context for action recognition.

Multimodal AI for Advancing Instruction

Scott T. Acton

Abstract: In this talk, the application of computer vision and AI to video analysis of the elementary classroom will be detailed. The talk will explain data collection, video scoring, illustration of results on a custom dashboard as well as a novel video coach for teachers. The use of multimodal inputs including video, audio, and text will be explored. Studies showing the efficacy of the AI methodology will be summarized. The talk will conclude with a discussion of open directions in AI research as well as lessons learned in the broad area of AI.

Machine Learning Security: Lessons Learned and Future Challenges

Battista Biggio

He (MSc 2006, PhD 2010) is Full Professor of Machine Learning at the University of Cagliari, Italy, and research co-director of AI Security at the sAIfer lab (www.saiferlab.ai). He has provided pioneering contributions in machine-learning security, for which we received the 2022 ICML Test of Time Award and the 2021 Best Paper Award and Pattern Recognition Medal from Elsevier Pattern Recognition. He has managed more than 10 research projects, including national and EU-funded projects, totaling more than 1.5M€ in the last 5 years. He regularly serves as Area Chair for top-tier conferences in machine learning and computer security like NeurIPS and the IEEE Symposium on Security and Privacy. He is an Associate Editor-in-Chief of Pattern Recognition, and chaired IAPR TC1 (2016-2020). He is Fellow of IEEE, Senior Member of ACM, and member of IAPR and ELLIS.

Abstract: In this talk, I will briefly review some recent advancements in machine learning security with a critical focus on the main factors that are hindering progress in this field. These include the lack of an underlying, systematic, and scalable framework to properly evaluate machine-learning models under adversarial and out-of-distribution scenarios, along with suitable tools for easing their debugging. The latter may be helpful in unveiling flaws in the evaluation process, as well as the presence of potential dataset biases and spurious features learned during training. I will finally report concrete examples of what our laboratory has been recently working on to enable a first step towards overcoming these limitations, in the context of malware detection and web security.

Media Security From Watermarking to Forensics: The Story Keeps Repeating

Edward J. Delp

He was born in Cincinnati, Ohio. He received the B.S.E.E. (cum laude) and M.S. degrees from the University of Cincinnati, and the Ph.D. degree from Purdue University. In 2008 he was named a Distinguished Professor and is currently The Charles William Harrison Distinguished Professor of Electrical and Computer Engineering and Professor of Biomedical Engineering at Purdue University. His research interests include image analysis, computer vision, machine learning, image and video compression, multimedia security, medical imaging, multimedia systems, communication, and information theory. Dr. Delp is a Fellow of the IEEE, a Fellow of the ACM, a Fellow of the SPIE, a Fellow of the Society for Imaging Science and Technology (IS&T), and a Fellow of the American Institute of Medical and Biological Engineering.

Abstract: In this talk I will provide an overview of the history of multimedia security and how it evolved from photographers wanting to protect images they posted on the Web to the explosion in generated media. I will describe early approaches to media content security including watermarking and steganography. Media security then transitioned from content protection into examining problems in media forensics where the goal is to determine if a media element was altered or manipulated. In recent years this has been extended to the analysis of generated media (e.g., “deep fakes”). I will discuss modern approaches to forensics analysis of images, video, and speech. Finally, I will talk about how the forensics community has circled back to look at watermarking as a way to potentially label generated content and the issues involved in deploying this approach. I will also describe several open research problems.

Toward Continual Learning in the Foundation Model Era

Joost van de Weijer

He leads the Learning and Machine Perception (LAMP) team at the Computer Vision Center, Universitat Autònoma de Barcelona. He received his PhD from the University of Amsterdam in 2005, focusing on physics-based computer vision for improved color understanding in images. He was a Marie Curie Intra-European Fellow at INRIA Rhône-Alpes, France, and later held a Ramón y Cajal fellowship at the Universitat Autònoma de Barcelona. His current research centers on lifelong learning in deep neural networks, encompassing areas such as continual learning, transfer learning, and domain adaptation. In addition, he works on generative models for visual data, with recent efforts aimed at improving efficiency and text-based control in diffusion models. He is a Fellow of the ELLIS Society.

Abstract: Whereas biological systems continually learn throughout their lifetime, most artificial intelligence methods still operate in a rigid two-phase paradigm: a training phase followed by a deployment phase. When new data becomes available, these systems are typically retrained from scratch, discarding previous knowledge and requiring substantial computational resources. Continual learning proposes theory and methods to enable artificial intelligence to efficiently adapt to a non-i.i.d. data streams without forgetting previously acquired knowledge. In the first part of this talk, I will introduce the foundations of continual learning theory and discuss several methods to address catastrophic forgetting. In the second part of the talk, I will explore and present recent works on how continual learning can be applied in the foundation model era. I will conclude by discussing several open research questions.

Modality Alignment and Misalignment in Large Multimodal Foundation Models

Andrew D. Bagdanov

He is currently Associate Professor at the University of Florence, Italy. He received his PhD in computer science in 2004 from the University of Amsterdam, after which he held a postdoctoral position at the University of Florence and a senior development position at the FAO of the United Nations. In 2013 he became Senior Researcher and Ramón y Cajal Fellow at the Computer Vision Center, Barcelona, and in 2015 began his tenure as Associate Professor at the University of Florence. He is the coordinator of the Master's Program in Artificial Intelligence at the University of Florence, and he leads a small research group whose research spans a broad spectrum of computer vision, image processing, and deep learning.

Abstract: Large multimodal models map inputs from two or more modalities into an aligned representation space which facilitates cross-modal comparisons (e.g. text-to-image or image-to-text). A notable example of such multimodal models is OpenAI's CLIP, which was pre-trained on a web-scale collection of images and natural language. Through a contrastive alignment loss used during pre-training, CLIP has acquired a wide range of visual (textual) concepts that can be easily adapted to different tasks through textual (visual) prompts. In the first part of this talk I will introduce the fundamental concepts related to contrastive, multimodal pre-training and the wide variety of applications it enables -- applications made possibile due to the inter-modal alignment of multiple modalities. In the second part of this talk I will discuss the phenomenon known as the Modality Gap which occurs due to different input modalities mapping to different manifolds in the representation space. I will discuss how this misalignment manifests itself, demonstrate how the effects of misalignment depend on how the modality encoders are used, and point to ongoing work aiming to mitigate this misalignment inherent to contrastively-trained multimodal models.

Introduction to generative AI, VAE, GANs, and compositional generative AI

Justin Dauwels

He is an Associate Professor at the TU Delft (Signal Processing Systems, Department of Microelectronics), and serves as co-Director of the Safety and Security Institute at the TU Delft. He also is the scientific lead of the Model-Driven Decisions Lab (MoDDL), a first lab for the Knowledge Building program between the Netherlands police and the TU Delft. He also serves as Chairperson of the EE Board of Studies at the TU Delft, and is a board member of the Co van Ledden Hulsebosch Center (Netherlands Center for Forensic Science and Medicine). His research interests are in machine learning and generative AI with applications to autonomous systems, and analysis of human behavior and physiology. His academic lab has spawned four startups across a range of industries, ranging from AI for healthcare to autonomous vehicles.

Abstract: Generative AI refers to a category of artificial intelligence models that are designed to generate new content, such as text, images, audio, or other types of data. Probably the best known example of generative AI is ChatGPT, the fastest consumer application to hit 100 million monthly active users. Generative AI models use machine learning algorithms to learn patterns and structures from existing data and then produce new data that is similar in style or content to what they have been trained on. In this presentation, I will start with a tutorial on various approaches to generative AI, such as deep latent-variable models, VAE, and GANs. Next I will talk about projects in our group at the TU Delft on deep generative models, in particular, object-centric generative models. I will briefly present novel kinds of deep generative models that we are developing in our team. At last, I will talk various AI related initiatives that our team is involved in.

Synthetic Media Verification in the Era of Generative AI

Luisa Verdoliva

She is a Professor at University Federico II of Naples, Italy, where she leads the Multimedia Forensics Lab. Her scientific interests are in the field of image and video processing, with main contributions in the area of multimedia forensics. She has actively contributed to the academic community through service as General Chair of WACV 2024, Workshop Chair of CVPR 2024 and Technical Chair of IJCB 2023, IH&MMsec 2021, WIFS 2019. She is a former Chair of the IFS-TC for the 2021-2022 term and is Editor-in-Chief of the IEEE Transactions on Information Forensics (2025-2027). She is the recipient of the 2018 Google Faculty Research Award and a TUM-IAS Hans Fischer Senior Fellowship (2020-2024). She is an IEEE Fellow.

Abstract: With the rapid progress of recent years, techniques that generate and manipulate multimedia content can now provide a very advanced level of realism. Powered by large language models, text-driven synthesis tools allow the user to modify or create from scratch images and videos by means of simple text instructions. On the one hand, this opens the door to a series of exciting applications in different fields such as creative arts, advertising, film production, video games. On the other hand, it poses enormous security threats. Therefore, there is an urgent need for automated tools capable of detecting false multimedia content and avoiding the spread of dangerous false information. This talk aims to present an analysis of the methods for synthetic media verification. Special emphasis will be placed on the phenomenon of AI-generated content and on modern data-driven forensic methods to fight them. The analysis will help highlight the limits of current forensic tools, the most relevant issues, the upcoming challenges, and suggest future directions for research.

Open-World Intelligence: Generalization, Novelty Detection, and the Promise of Foundation Models

Tatiana Tommasi

She is Associate Professor at Politecnico di Torino (Italy), where she works at the intersection of machine learning and computer vision, with a focus on robotics applications. She earned her PhD from EPFL Lausanne in 2013 and subsequently held postdoctoral research positions in Belgium and the United States. Before her current role, she served as an Assistant Professor at Sapienza University of Rome and as a Research Fellow at the Italian Institute of Technology. Her research is centered on the development of theoretically grounded algorithms for automatic learning from images and multimodal data. She was among the early pioneers of transfer learning in computer vision and made substantial contributions to domain adaptation, generalization, multimodal learning, and open-set recognition. Her publication record includes over 50 papers in top-tier conferences and journals. Tatiana is the director of the ELLIS Unit in Turin and serves as Associate Editor for IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), IEEE Transactions on Emerging Topics in Computing (TETC), and the International Journal of Computer Vision (IJCV). She is also an active member of the research community, regularly contributing as Area Chair for major conferences (i.e. CVPR, ICCV, ECCV, ICRA, IROS) and as organizer of workshops in the field.

Abstract: This talk will provide an overview of the research literature addressing the challenges that arise when test conditions exhibit broader variability than what is observed during training, including novel visual styles, previously unseen object categories, and rare scene configurations or object defects that may be identified as anomalies. Although such scenarios are more the rule than the exception in real-world applications, they have long posed significant challenges for machine learning models, with extensive research efforts initially focused on developing tailored algorithmic solutions. The widespread availability of large-scale datasets and pre-trained foundation models now seems to reduce the need to explicitly account for the "unknown" at test time... but does it really? What kind of knowledge is actually embedded in these models? Are we leveraging it effectively to interact with a world that still presents unpredictable situations? And to what extent can we consider these models efficient and trustworthy? The final part of the presentation will explore these questions, highlighting the open challenges that remain on the path toward robust and adaptable AI for the open world.