A Spring 2026 course connecting the historical foundations of vision with modern deep learning,
transformers, and hands-on project work.
Abstract
Machine learning has profoundly changed computer vision, but the field's current methods build on a long
history of image formation, geometry, perception, recognition, and representation learning. This lecture
takes a holistic view of the task of vision.
Lectures and tutorials are accompanied by bi-weekly quizzes and project work. Assessment combines
bi-weekly quizzes, the project, and the final exam.
FormatLectures, tutorials, bi-weekly quizzes, project work, and final exam.
BibliographyFoundations of Computer Vision, Antonio Torralba, Phillip Isola, William T. Freeman, MIT Press, 2024.
ProgramsMSc Artificial Intelligence, MSc Computational Science, MSc Informatics, and Faculty of Informatics PhD students.
Student Project Ideas
Project proposals live here so classmates can quickly scan possible directions, compare ideas, and submit
additions by pull request.
Example project idea
Scene understanding, geometry, foundation models
Semantic Change Maps from Everyday Walks
Can a short phone video reveal how a campus route changes over time? This project combines monocular
depth, semantic segmentation, and feature matching to align walks recorded on different days, then
highlights moved objects, blocked paths, or new scene elements. The result would be a small visual demo
that connects 3D reasoning, human attention, and practical scene understanding for urban navigation.
Group J
Self-supervised video representations, foundation models, anomaly detection
Probing V-JEPA 2: What Does a Video Model Actually See?
V-JEPA 2 is Meta’s self-supervised video encoder, trained without labels to predict masked
spatio-temporal regions. We want to open up its latent space and understand how it reacts to the
visual world — and, more interestingly, to things that don’t belong in it. Starting from a frozen
pretrained encoder, we build an interactive demo that embeds short clips and surfaces structure,
similarity, and drift over time. On top of this, we explore anomaly detection as a concrete
application: can the embedding space tell a banana from a pair of scissors in a rock-paper-scissors
game, a boat on a highway, or an abnormal beat in an ECG recording?
One Photo, Many Aisles: Visual Search Across a Heterogeneous Marketplace
A single foundation model rarely knows a sofa, a smartphone and a floral dress equally well — yet a real
marketplace catalog mixes all of them on the same shelf. This project studies where general-purpose visual
encoders like CLIP and SigLIP 2 stop being enough for e-commerce retrieval, and what has to be rebuilt
around them when the catalog is not one domain but twenty. Each query is first stripped of its context —
mannequins, human models, living-room scenes, studio gradients — then routed to a category-specific
expert whose embeddings are reranked with the fine-grained color and texture cues that generic models
quietly discard. The whole pipeline is then compressed into an on-device Android demo, raising a second
question the paper versions of Google Lens rarely address in the open: how much of a foundation-model
retrieval system actually survives when it has to run on a phone?
Group R
Real-time segmentation, predictive architectures, AR visualization
AI-Powered "Jarvis" AR Assistant for Motorcyclists
Can we transform a standard motorcycle commute into a safer, futuristic experience? This project leverages
first-person view (FPV) footage to build an intelligent AR dashboard. By integrating SAM 3 for precise
real-time object "locking" and V-JEPA 2 to predict potential road hazards,
the system simulates a smart helmet interface showcasing 3D spatial awareness.
Group C
Video Search, Segmentation, Foundation Models
SemanticSpot: "Ctrl+F" for Videos
Ever wanted to find a specific object or action in a long video without scrubbing through it manually? SemanticSpot is an interactive web app that lets you search videos using natural language. You can simply type a query like "the person wearing a blue hat" or "the red coffee mug". Behind the scenes, the app uses CLIP to instantly locate the exact timestamp where your search appears, and SAM (Segment Anything) to dynamically track and highlight the object on the screen. It’s a smart, zero-shot visual search engine combined with automatic segmentation!
Interactive Real-Time Semantic Querying of Driving Scenes
What if you could search inside a video as it plays? This project transforms a driving scene into an interactive interface where users can type natural language queries such as "car", "truck", or "pedestrian" and instantly see matching objects highlighted and tracked in real time.
The demo emphasizes fluid interaction: queries can be added or removed on the fly without interrupting the video, multiple concepts can be explored simultaneously, and each query is visualized with distinct colors for clarity. This allows users to dynamically “interrogate” the scene and observe how the system adapts immediately to new inputs.
Rather than focusing purely on detection, the project showcases a new way of interacting with visual data, turning passive video into an active, query-driven exploration tool.
Group M
Emotion recognition, deep learning, CNN-RNN
Facial Expression Recognition with Hybrid Models
Can a model reliably read human emotions from a single image? This project explores facial expression recognition by classifying faces into categories such as happiness, sadness, anger, and surprise. We build a small interactive demo that visualizes predicted emotions on input images.
To achieve this, we combine CNNs for spatial feature extraction with RNNs to capture temporal patterns, comparing pretrained models (MobileNetV2, InceptionV3) with a custom CNN-RNN trained from scratch on FER2013 and CK+ datasets.
The project focuses on evaluating performance, robustness, and how transfer learning influences emotion recognition across different data conditions.
Group O
Segmentation, Object detection, tracking, foundation models
Smart Event Detection for Highlight Clips
Have you ever missed a highlight during a match? This system can capture highlights based on a user prompt or directly from a video.
It uses advanced, state-of-the-art approaches, such as Meta’s SAM3, to track objects, detect events, and generate short highlight clips.
The goal is to combine modern segmentation models (such as SAM) with classical computer vision techniques. Segmentation serves as a strong perception layer, while event detection is driven by motion-based features such as trajectories, velocity, and frequency analysis, along with lightweight reasoning.
The system follows a modular design, consisting of a general perception and feature extraction pipeline combined with task-specific event detection modules.
The system is primarily designed for human action detection (e.g., waving, raising a hand, standing up). As an extension, it can also handle simple sports scenarios, such as tracking a ball moving toward or crossing a goal, demonstrating its ability to generalize to multi-object interactions.
Got a clue? ImageDetective can find the picture. Describe what you’re looking for, upload an image, or search by visual details. ImageDetective connects text and images through foundation models, combining global semantic search with patch-level matching for smarter and more explainable retrieval.
We will propose a vision language retrieval system to implement bidirectional search between images and natural language. With CLIP, we will map both modalities into a share embedding space, a image could be retrieved with FAISS indexing. Perhaps the accurary should be improved, further approach is using a SAM3 for text guided semantic segmentation and enable patch level matching.