Navigating the Shift to Generative AI and Multimodal LLMs

May 7, 2024

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

Ten years ago, Yann LeCun gave an astonishing keynote at the Embedded Vision Summit, in which he demonstrated the power and practicality of deep neural networks (DNNs) for computer vision. As I left his talk, I recall thinking to myself, “If what he says about deep neural networks is true, that changes everything about computer vision.” Of course, DNNs have indeed revolutionized both how we do computer vision and what we can do with computer vision.

I never imagined that in the span of my career, there would be another discontinuity that would upend our established methods and open up vast new possibilities. But that is exactly what has happened with the emergence of practical transformer networks, large language models (LLMs), vision language models and large multimodal models. As we learn how to efficiently implement this new generation of models at the edge, they are opening up amazing possibilities—opportunities to create products and systems that are more capable, more adaptable, safer and easier to use—in applications that will impact virtually every industry.

The 2024 Embedded Vision Summit, the premier event for innovators adding computer vision and edge AI to products, is set to explore the transformative potential of multimodal language models at the edge. I thought it worth previewing some of the presentations focusing on this topic.

One of the highlights of the Summit will be the keynote address by Yong Jae Lee, associate professor at the University of Wisconsin-Madison. Lee will present groundbreaking research on creating intelligent systems that can learn to understand our multimodal world with minimal human supervision. He will focus on systems that can comprehend both images and text, while also touching upon those that utilize video, audio and LiDAR. Attendees will gain insights into how these emerging techniques can address neural network training bottlenecks, facilitate new types of multimodal machine perception and enable countless new applications.

By MRPeasy 05.01.2024

By Global Unichip Corp. 04.18.2024

The Summit will also feature a thought-provoking general session talk by Jilei Hou, VP of Engineering and head of AI Research at Qualcomm Technologies. Hou will share Qualcomm’s vision of the compelling opportunities enabled by efficient generative AI at the edge. He will identify the key hurdles that the industry must overcome to realize the massive potential of these technologies and highlight Qualcomm’s research and product development work in this area. This includes techniques for efficient on-device execution of LLMs, large vision models, and large multimodal models, as well as methods for orchestration of large models at the edge and approaches for adaptation and personalization.

A related and highly anticipated session will be a panel discussion: “Multimodal LLMs at the Edge: Are We There Yet?” The panel will bring together experts from Meta Reality Labs, EE Times, Qualcomm Technologies, Useful Sensors and academia to explore the rapidly evolving role of multimodal LLMs in machine perception applications at the edge. Panelists will discuss the extent to which multimodal LLMs will change how we approach computer vision and other types of machine perception, the challenges in running them at the edge, and whether today’s edge hardware is up to the task. Attendees can expect a lively and insightful discussion that will shed light on the future of multimodal LLMs in real-world applications.

The Summit will also feature a range of talks that showcase the practical applications of generative AI and LLMs. István Fehérvári, chief scientist at Ingram Technologies, will deliver a talk titled “Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI.” Fehérvári will explain the fundamentals of LLMs, explore how they have evolved to integrate visual understanding and examine the current landscape of multimodal LLMs. He will also delve into the applications that will be enabled by deploying these large models at the edge and identify the key barriers that must be overcome to make this a reality.

Mehrsan Javan, CTO at Sportlogiq, will present a case study on using vision systems, generative models and reinforcement learning for sports analytics. Javan will share the obstacles encountered in adapting advanced analytics originally developed for professional leagues to create a new product for the youth sports market. Attendees will learn how Sportlogiq leverages a combination of vision systems, generative models, and reinforcement learning techniques to develop compelling products for youth sports and the valuable lessons learned in the process.

As the 2024 Embedded Vision Summit approaches, it is clear that generative AI and multimodal language models will be at the forefront of the discussions. With a lineup of expert speakers and thought-provoking sessions, the Summit promises to provide attendees with a comprehensive understanding of the latest advancements, challenges and opportunities in this rapidly evolving field. Innovators, product creators, and engineers alike will have the chance to delve into cutting-edge technologies and gain insights that will shape the future of embedded vision and AI. I hope to see you in Santa Clara, Calif., in May!