Computer Vision

A Spring 2026 course connecting the historical foundations of vision with modern deep learning, transformers, and hands-on project work.

Abstract

Machine learning has profoundly changed computer vision, but the field's current methods build on a long history of image formation, geometry, perception, recognition, and representation learning. This lecture takes a holistic view of the task of vision.

Lectures and tutorials are accompanied by bi-weekly quizzes and project work. Assessment combines bi-weekly quizzes, the project, and the final exam.

Foundations of vision Deep learning Transformers
  • Instructor Francis Engelmann
  • Assistant Nicolai Hermann
  • Format Lectures, tutorials, bi-weekly quizzes, project work, and final exam.
  • Bibliography Foundations of Computer Vision, Antonio Torralba, Phillip Isola, William T. Freeman, MIT Press, 2024.
  • Programs MSc Artificial Intelligence, MSc Computational Science, MSc Informatics, and Faculty of Informatics PhD students.

Student Project Ideas

Project proposals live here so classmates can quickly scan possible directions, compare ideas, and submit additions by pull request.

Scene understanding, geometry, foundation models

Semantic Change Maps from Everyday Walks

Can a short phone video reveal how a campus route changes over time? This project combines monocular depth, semantic segmentation, and feature matching to align walks recorded on different days, then highlights moved objects, blocked paths, or new scene elements. The result would be a small visual demo that connects 3D reasoning, human attention, and practical scene understanding for urban navigation.

Self-supervised video representations, foundation models, anomaly detection

Probing V-JEPA 2: What Does a Video Model Actually See?

V-JEPA 2 is Meta’s self-supervised video encoder, trained without labels to predict masked spatio-temporal regions. We want to open up its latent space and understand how it reacts to the visual world — and, more interestingly, to things that don’t belong in it. Starting from a frozen pretrained encoder, we build an interactive demo that embeds short clips and surfaces structure, similarity, and drift over time. On top of this, we explore anomaly detection as a concrete application: can the embedding space tell a banana from a pair of scissors in a rock-paper-scissors game, a boat on a highway, or an abnormal beat in an ECG recording?

Vision-language embeddings, segmentation, on-device retrieval

One Photo, Many Aisles: Visual Search Across a Heterogeneous Marketplace

A single foundation model rarely knows a sofa, a smartphone and a floral dress equally well — yet a real marketplace catalog mixes all of them on the same shelf. This project studies where general-purpose visual encoders like CLIP and SigLIP 2 stop being enough for e-commerce retrieval, and what has to be rebuilt around them when the catalog is not one domain but twenty. Each query is first stripped of its context — mannequins, human models, living-room scenes, studio gradients — then routed to a category-specific expert whose embeddings are reranked with the fine-grained color and texture cues that generic models quietly discard. The whole pipeline is then compressed into an on-device Android demo, raising a second question the paper versions of Google Lens rarely address in the open: how much of a foundation-model retrieval system actually survives when it has to run on a phone?

Real-time segmentation, predictive architectures, AR visualization

AI-Powered "Jarvis" AR Assistant for Motorcyclists

Can we transform a standard motorcycle commute into a safer, futuristic experience? This project leverages first-person view (FPV) footage to build an intelligent AR dashboard. By integrating SAM 3 for precise real-time object "locking" and V-JEPA 2 to predict potential road hazards, the system simulates a smart helmet interface showcasing 3D spatial awareness.

Video Search, Segmentation, Foundation Models

SemanticSpot: "Ctrl+F" for Videos

Ever wanted to find a specific object or action in a long video without scrubbing through it manually? SemanticSpot is an interactive web app that lets you search videos using natural language. You can simply type a query like "the person wearing a blue hat" or "the red coffee mug". Behind the scenes, the app uses CLIP to instantly locate the exact timestamp where your search appears, and SAM (Segment Anything) to dynamically track and highlight the object on the screen. It’s a smart, zero-shot visual search engine combined with automatic segmentation!

Object detection, tracking, vision-language models, real-time interaction

Interactive Real-Time Semantic Querying of Driving Scenes

What if you could search inside a video as it plays? This project transforms a driving scene into an interactive interface where users can type natural language queries such as "car", "truck", or "pedestrian" and instantly see matching objects highlighted and tracked in real time.

The demo emphasizes fluid interaction: queries can be added or removed on the fly without interrupting the video, multiple concepts can be explored simultaneously, and each query is visualized with distinct colors for clarity. This allows users to dynamically “interrogate” the scene and observe how the system adapts immediately to new inputs.

Rather than focusing purely on detection, the project showcases a new way of interacting with visual data, turning passive video into an active, query-driven exploration tool.

Emotion recognition, deep learning, CNN-RNN

Facial Expression Recognition with Hybrid Models

Can a model reliably read human emotions from a single image? This project explores facial expression recognition by classifying faces into categories such as happiness, sadness, anger, and surprise. We build a small interactive demo that visualizes predicted emotions on input images.

To achieve this, we combine CNNs for spatial feature extraction with RNNs to capture temporal patterns, comparing pretrained models (MobileNetV2, InceptionV3) with a custom CNN-RNN trained from scratch on FER2013 and CK+ datasets.

The project focuses on evaluating performance, robustness, and how transfer learning influences emotion recognition across different data conditions.

Segmentation, Object detection, tracking, foundation models

Smart Event Detection for Highlight Clips

Have you ever missed a highlight during a match? This system can capture highlights based on a user prompt or directly from a video. It uses advanced, state-of-the-art approaches, such as Meta’s SAM3, to track objects, detect events, and generate short highlight clips.

The goal is to combine modern segmentation models (such as SAM) with classical computer vision techniques. Segmentation serves as a strong perception layer, while event detection is driven by motion-based features such as trajectories, velocity, and frequency analysis, along with lightweight reasoning. The system follows a modular design, consisting of a general perception and feature extraction pipeline combined with task-specific event detection modules.

The system is primarily designed for human action detection (e.g., waving, raising a hand, standing up). As an extension, it can also handle simple sports scenarios, such as tracking a ball moving toward or crossing a goal, demonstrating its ability to generalize to multi-object interactions.

CLIP, FAISS, patch-level matching, image retrieval

Image retrieval with ClIP

Got a clue? ImageDetective can find the picture. Describe what you’re looking for, upload an image, or search by visual details. ImageDetective connects text and images through foundation models, combining global semantic search with patch-level matching for smarter and more explainable retrieval.

We will propose a vision language retrieval system to implement bidirectional search between images and natural language. With CLIP, we will map both modalities into a share embedding space, a image could be retrieved with FAISS indexing. Perhaps the accurary should be improved, further approach is using a SAM3 for text guided semantic segmentation and enable patch level matching.