Program
Tentative Program
Collectionless AI and Nature-Inspired Learning

Marco Gori
He is a full professor of computer science at the University of Siena, where he leads the Siena Artificial Intelligence Lab. His work focuses on machine learning and neural computation, with a significant impact on the field of graph neural networks. Gori co-authored the pioneering paper "A New Model for Learning in Graph Domains" (IJCNN 2005), which introduced the term "Graph Neural Network" and laid the groundwork for this influential field. His later paper, "Graph Neural Networks" (IEEE-TNN 2009), expanded on these ideas and has garnered over 10,000 citations. He has also collaborated with the 3IA Côte d'Azur since 2019 and previously chaired the Italian Chapter of the IEEE Computational Intelligence Society. Gori is a Fellow of IEEE, EurAI, IAPR, and ELLIS.
Abstract: AI is revolutionizing not only the entire field of Computer Science, but nearly all fields of Science. However, while application contexts explode and LLMs display embarrassing cognitivequalities, the entire AI research field seems headed toward saturation of the fundamental ideas that have enabled today’s spectacular results of large companies. Is the infamous “AI winter” perhaps creeping into research? In this talk I argue that the time is ripe for a fundamental rethinking of AI methodologies with the purpose to migrate intelligence from the cloud to the growing global population of devices with on-board CPUs. To support learning schemes inspired by mechanisms found in nature, I propose developing intelligent systems within NARNIAN, a platform that enables social mechanisms and fosters learning processes over time, without the need for data storage.
The Foundation of Foundation Models for Adaptive Multimedia Agents

Amos Storkey
He is Professor of Machine Learning and Artificial Intelligence in the School of Informatics, University of Edinburgh. He completed his undergraduate studies in Mathematics at Trinity College, Cambridge, moving into Neural Networks for his PhD (1995-98). He has continued research in AI methodology and applications ever since, and is known for work in deep learning, continual learning, meta-learning, representation learning, domain shift, reinforcement learning, and efficient learning systems for edge-devices. Storkey founded and leads the Bayesian and Neural Systems Group in Edinburgh. He is a fellow of the ELLIS Society and has served as director of the EPSRC Centre for Doctoral Training in Data Science. He is current director of the EPSRC Centre for Doctoral Training in Machine Learning Systems.
Abstract: In this talk we will introduce four different types of foundation generative model, in the context of multimedia signal processing and control: autoregressive models, the class of diffusion, gradient flow and flow matching models, state space models, and basis-function approaches. We explain how and why each approach works, and what is necessary to train them. We then look at foundational representation learning models, why they work, and discuss how they can be used for multi-modal fusion. These pieces will be built together into a full tooling for foundational multimedia signal processing, along with demonstrations of multimedia foundation models in action. We will look at what is needed for combining signals, language, images, video and audio. Then in the second part we will look at the fundamental issues in using foundation models: domain shift and fine tuning, non-stationarity and adaptability, and finally computational cost and on-device implementation. Finally we will take a little time to review the state of the art development, and new understanding in the field.
From Chat GPT to a Semantic Transformer

Scott T. Acton
He is the Lawrence R. Quarles Professor and Chair of Electrical & Computer Engineering at the University of Virginia as well as the AI Advisor to the Provost. He is also appointed in Biomedical Engineering. He was Program Director (and then acting Deputy Division Director) in the Computer and Information Sciences and Engineering directorate of the National Science Foundation where he was given the Director’s Award for Superior Accomplishment. He received the M.S. and Ph.D. degrees at the University of Texas at Austin and the B.S. degree from Virginia Tech. Professor Acton is a Fellow of the IEEE “for contributions to biomedical image analysis.” He is also Fellow of American Institute for Medical and Biological Engineering, Fellow of the Asia-Pacific Artificial Intelligence Association, and Fellow of Artificial Intelligence Industry Alliance. Professor Acton’s laboratory at UVA is called VIVA - Virginia Image and Video Analysis. They specialize in biological/biomedical image analysis problems. Professor Acton has over 400 publications in the image analysis area including the books Biomedical Image Analysis: Tracking and Biomedical Image Analysis: Segmentation. He was the 2018 Co-Chair of the IEEE International Symposium on Biomedical Imaging. Professor Acton was Editor-in-Chief of the IEEE Transactions on Image Processing (2014-2018).
Abstract: This talk will set the stage with a discussion of recent activity in artificial intelligence research. First, the basic transformer model will be explained. Next, the contrast between natural language and video processing will be highlighted. The talk will culminate in the discussion of a multimodal video transformer that utilizes semantically meaningful query and key features in learning context for action recognition.
Multimodal AI for Advancing Instruction

Scott T. Acton
He is the Lawrence R. Quarles Professor and Chair of Electrical & Computer Engineering at the University of Virginia as well as the AI Advisor to the Provost. He is also appointed in Biomedical Engineering. He was Program Director (and then acting Deputy Division Director) in the Computer and Information Sciences and Engineering directorate of the National Science Foundation where he was given the Director’s Award for Superior Accomplishment. He received the M.S. and Ph.D. degrees at the University of Texas at Austin and the B.S. degree from Virginia Tech. Professor Acton is a Fellow of the IEEE “for contributions to biomedical image analysis.” He is also Fellow of American Institute for Medical and Biological Engineering, Fellow of the Asia-Pacific Artificial Intelligence Association, and Fellow of Artificial Intelligence Industry Alliance. Professor Acton’s laboratory at UVA is called VIVA - Virginia Image and Video Analysis. They specialize in biological/biomedical image analysis problems. Professor Acton has over 400 publications in the image analysis area including the books Biomedical Image Analysis: Tracking and Biomedical Image Analysis: Segmentation. He was the 2018 Co-Chair of the IEEE International Symposium on Biomedical Imaging. Professor Acton was Editor-in-Chief of the IEEE Transactions on Image Processing (2014-2018).
Abstract: In this talk, the application of computer vision and AI to video analysis of the elementary classroom will be detailed. The talk will explain data collection, video scoring, illustration of results on a custom dashboard as well as a novel video coach for teachers. The use of multimodal inputs including video, audio, and text will be explored. Studies showing the efficacy of the AI methodology will be summarized. The talk will conclude with a discussion of open directions in AI research as well as lessons learned in the broad area of AI.
Machine Learning Security: Lessons Learned and Future Challenges

Battista Biggio
He (MSc 2006, PhD 2010) is Full Professor of Machine Learning at the University of Cagliari, Italy, and research co-director of AI Security at the sAIfer lab (www.saiferlab.ai). He has provided pioneering contributions in machine-learning security, for which we received the 2022 ICML Test of Time Award and the 2021 Best Paper Award and Pattern Recognition Medal from Elsevier Pattern Recognition. He has managed more than 10 research projects, including national and EU-funded projects, totaling more than 1.5M€ in the last 5 years. He regularly serves as Area Chair for top-tier conferences in machine learning and computer security like NeurIPS and the IEEE Symposium on Security and Privacy. He is an Associate Editor-in-Chief of Pattern Recognition, and chaired IAPR TC1 (2016-2020). He is Fellow of IEEE, Senior Member of ACM, and member of IAPR and ELLIS.
Abstract: In this talk, I will briefly review some recent advancements in machine learning security with a critical focus on the main factors that are hindering progress in this field. These include the lack of an underlying, systematic, and scalable framework to properly evaluate machine-learning models under adversarial and out-of-distribution scenarios, along with suitable tools for easing their debugging. The latter may be helpful in unveiling flaws in the evaluation process, as well as the presence of potential dataset biases and spurious features learned during training. I will finally report concrete examples of what our laboratory has been recently working on to enable a first step towards overcoming these limitations, in the context of malware detection and web security.
Media Security From Watermarking to Forensics: The Story Keeps Repeating

Edward J. Delp
He was born in Cincinnati, Ohio. He received the B.S.E.E. (cum laude) and M.S. degrees from the University of Cincinnati, and the Ph.D. degree from Purdue University. In 2008 he was named a Distinguished Professor and is currently The Charles William Harrison Distinguished Professor of Electrical and Computer Engineering and Professor of Biomedical Engineering at Purdue University. His research interests include image analysis, computer vision, machine learning, image and video compression, multimedia security, medical imaging, multimedia systems, communication, and information theory. Dr. Delp is a Fellow of the IEEE, a Fellow of the ACM, a Fellow of the SPIE, a Fellow of the Society for Imaging Science and Technology (IS&T), and a Fellow of the American Institute of Medical and Biological Engineering.
Abstract: In this talk I will provide an overview of the history of multimedia security and how it evolved from photographers wanting to protect images they posted on the Web to the explosion in generated media. I will describe early approaches to media content security including watermarking and steganography. Media security then transitioned from content protection into examining problems in media forensics where the goal is to determine if a media element was altered or manipulated. In recent years this has been extended to the analysis of generated media (e.g., “deep fakes”). I will discuss modern approaches to forensics analysis of images, video, and speech. Finally, I will talk about how the forensics community has circled back to look at watermarking as a way to potentially label generated content and the issues involved in deploying this approach. I will also describe several open research problems.
Toward Continual Learning in the Foundation Model Era

Joost van de Weijer
He leads the Learning and Machine Perception (LAMP) team at the Computer Vision Center, Universitat Autònoma de Barcelona. He received his PhD from the University of Amsterdam in 2005, focusing on physics-based computer vision for improved color understanding in images. He was a Marie Curie Intra-European Fellow at INRIA Rhône-Alpes, France, and later held a Ramón y Cajal fellowship at the Universitat Autònoma de Barcelona. His current research centers on lifelong learning in deep neural networks, encompassing areas such as continual learning, transfer learning, and domain adaptation. In addition, he works on generative models for visual data, with recent efforts aimed at improving efficiency and text-based control in diffusion models. He is a Fellow of the ELLIS Society.
Abstract: Whereas biological systems continually learn throughout their lifetime, most artificial intelligence methods still operate in a rigid two-phase paradigm: a training phase followed by a deployment phase. When new data becomes available, these systems are typically retrained from scratch, discarding previous knowledge and requiring substantial computational resources. Continual learning proposes theory and methods to enable artificial intelligence to efficiently adapt to a non-i.i.d. data streams without forgetting previously acquired knowledge. In the first part of this talk, I will introduce the foundations of continual learning theory and discuss several methods to address catastrophic forgetting. In the second part of the talk, I will explore and present recent works on how continual learning can be applied in the foundation model era. I will conclude by discussing several open research questions.
Modality Alignment and Misalignment in Large Multimodal Foundation Models

Andrew D. Bagdanov
He is currently Associate Professor at the University of Florence, Italy. He received his PhD in computer science in 2004 from the University of Amsterdam, after which he held a postdoctoral position at the University of Florence and a senior development position at the FAO of the United Nations. In 2013 he became Senior Researcher and Ramón y Cajal Fellow at the Computer Vision Center, Barcelona, and in 2015 began his tenure as Associate Professor at the University of Florence. He is the coordinator of the Master's Program in Artificial Intelligence at the University of Florence, and he leads a small research group whose research spans a broad spectrum of computer vision, image processing, and deep learning.
Abstract: Large multimodal models map inputs from two or more modalities into an aligned representation space which facilitates cross-modal comparisons (e.g. text-to-image or image-to-text). A notable example of such multimodal models is OpenAI's CLIP, which was pre-trained on a web-scale collection of images and natural language. Through a contrastive alignment loss used during pre-training, CLIP has acquired a wide range of visual (textual) concepts that can be easily adapted to different tasks through textual (visual) prompts. In the first part of this talk I will introduce the fundamental concepts related to contrastive, multimodal pre-training and the wide variety of applications it enables -- applications made possibile due to the inter-modal alignment of multiple modalities. In the second part of this talk I will discuss the phenomenon known as the Modality Gap which occurs due to different input modalities mapping to different manifolds in the representation space. I will discuss how this misalignment manifests itself, demonstrate how the effects of misalignment depend on how the modality encoders are used, and point to ongoing work aiming to mitigate this misalignment inherent to contrastively-trained multimodal models.