Computer Vision

A Spring 2026 course connecting the historical foundations of vision with modern deep learning, transformers, and hands-on project work.

Abstract

Machine learning has profoundly changed computer vision, but the field's current methods build on a long history of image formation, geometry, perception, recognition, and representation learning. This lecture takes a holistic view of the task of vision.

Lectures and tutorials are accompanied by bi-weekly quizzes and project work. Assessment combines bi-weekly quizzes, the project, and the final exam.

Foundations of vision Deep learning Transformers
  • Instructor Francis Engelmann
  • Assistant Nicolai Hermann
  • Format Lectures, tutorials, bi-weekly quizzes, project work, and final exam.
  • Bibliography Foundations of Computer Vision, Antonio Torralba, Phillip Isola, William T. Freeman, MIT Press, 2024.
  • Programs MSc Artificial Intelligence, MSc Computational Science, MSc Informatics, and Faculty of Informatics PhD students.

Student Project Ideas

Project proposals live here so classmates can quickly scan possible directions, compare ideas, and submit additions by pull request.

Scene understanding, geometry, foundation models

Semantic Change Maps from Everyday Walks

Can a short phone video reveal how a campus route changes over time? This project combines monocular depth, semantic segmentation, and feature matching to align walks recorded on different days, then highlights moved objects, blocked paths, or new scene elements. The result would be a small visual demo that connects 3D reasoning, human attention, and practical scene understanding for urban navigation.

Self-supervised video representations, foundation models, anomaly detection

Can V-JEPA read ECG?

Can V-JEPA learn meaningful representations for ECG-based arrhythmia detection? In this project, we investigate whether a V-JEPA encoder, originally designed for video understanding, can capture clinically relevant patterns from ECG signals. We transform ECG recordings into video-like inputs and use the pretrained V-JEPA encoder to generate latent representations. These representations are then evaluated by training a range of downstream predictors for arrhythmia classification. By comparing performance across predictors, we assess the quality and transferability of V-JEPA features for cardiac signal analysis.

Visual search, marketplace retrieval, segmentation, Android demo

Visual Search for Marketplace

This project explores visual product search for marketplace applications, where users can search for visually similar items using a single photo. I build a retrieval pipeline based on vision-language embeddings, object segmentation, category-aware routing, and color-aware reranking to improve search quality for fashion and marketplace-style product images. The final system is demonstrated in an Android app that performs on-device image-based retrieval over a local product catalog.

GitHub / Code

Real-time segmentation, AR visualization

Aegis Rider - AR Assistant for Motorcyclists

Aegis Rider is a motorcycle AR demo that simulates a smart-helmet riding experience. It takes a first-person riding video, detects surrounding road users, estimates potential collision risks, and overlays a HUD directly onto the video. The system also includes radar-style awareness visualization and basic navigation features, providing riders with real-time environmental and directional information in an intuitive AR interface.

Video Search, Segmentation, Foundation Models

SemanticSpot: "Ctrl+F" for Videos

Ever wanted to find a specific object or action in a long video without scrubbing through it manually? SemanticSpot is an interactive web app that lets you search videos using natural language. You can simply type a query like "the person wearing a blue hat" or "the red coffee mug". Behind the scenes, the app uses CLIP to instantly locate the exact timestamp where your search appears, and SAM (Segment Anything) to dynamically track and highlight the object on the screen. It’s a smart, zero-shot visual search engine combined with automatic segmentation!

Object segmentation, tracking, vision-language models, open-vocabulary queries

Semantic Querying of Driving Scenes

This project transforms a driving scene taken from a dashcam into an interface where users can type natural language queries such as "a classic new york taxi", "a work truck" and "a pedestrian" and see matching objects highlighted and tracked in real time.

The demo emphasizes open-vocabulary capabilities: queries can written in natural language, multiple concepts can be explored simultaneously, and each query is visualized with distinct colors for clarity.

Rather than focusing purely on detection, the project also adds a way to detect whenever a pedestrian is actively crossing the road, marking it in red to highlight it. Moreover a small detector runs in parallel to detect traffic signs, also responding to natural language queries given by the user, like "no parking sign" or "one-way sign".

Emotion recognition, deep learning, CNN-RNN

Facial Expression Recognition with Hybrid Models

Can a model reliably read human emotions from a single image? This project explores facial expression recognition by classifying faces into categories such as happiness, sadness, anger, and surprise. We build a small interactive demo that visualizes predicted emotions on input images.

To achieve this, we combine CNNs for spatial feature extraction with RNNs to capture temporal patterns, comparing pretrained models (MobileNetV2, InceptionV3) with a custom CNN-RNN trained from scratch on FER2013 and CK+ datasets.

The project focuses on evaluating performance, robustness, and how transfer learning influences emotion recognition across different data conditions.

Prompted segmentation, motion tracking, event detection, video highlights

Behaviour Lens

What if a video highlight came with the evidence behind it? Behaviour Lens takes a video and a text prompt, uses SAM3 to segment the requested object in sampled frames, and turns those detections into trajectories, velocities, timestamps, masks, overlays, and auditable CSV outputs.

The goal is to combine modern segmentation models (such as SAM) with classical computer vision techniques. Segmentation serves as a strong perception layer, while event detection is driven by motion-based features such as trajectories, velocity, and frequency analysis, along with lightweight reasoning. The system follows a modular design, consisting of a general perception and feature extraction pipeline combined with task-specific event detection modules.

The system is primarily designed for human action detection (e.g., waving, raising a hand, standing up). As an extension, it can also handle simple sports scenarios, such as tracking a ball moving toward or crossing a goal, demonstrating its ability to generalize to multi-object interactions.

Object detection, segmentation, tracking, vision-language models

Open-Vocabulary Object Tracking with Grounding DINO, SAM 2 and CLIP

We present an open-vocabulary object tracking system that enables users to search, segment, and track arbitrary objects in images and videos using natural language queries.

Our pipeline combines Grounding DINO for text-conditioned object detection, CLIP for semantic verification, and SAM 2 for segmentation and temporal tracking.

The system supports interactive querying through a Gradio web interface and demonstrates how modern vision foundation models can be integrated into a unified visual understanding pipeline.

Monocular depth estimation, hand tracking, augmented reality, human-computer interaction

Air Instrument: Depth-Aware Virtual Music Placement

Air Instrument explores how a normal webcam can turn a room into an interactive musical stage. The system first estimates scene depth using Depth Anything V2, detects candidate floor or surface regions, and lets users place virtual instruments into available 3D space through hand gestures. Once instruments are placed, a playing mode uses MediaPipe hand tracking to control expressive parameters such as pitch and volume without touching any physical device.

The project combines monocular depth estimation, spatial reasoning, gesture recognition, and augmented reality rendering into a live demo. Our goal is to study how depth-aware scene understanding can support natural interaction: where can an object be placed, how large should it appear, and how can the user control it through movement?

CLIP, FAISS, patch-level matching, image retrieval

Image retrieval with CLIP

Got a clue? ImageDetective can find the picture. Describe what you’re looking for, upload an image, or search by visual details. ImageDetective connects text and images through foundation models, combining global semantic search with patch-level matching for smarter and more explainable retrieval.

We will propose a vision language retrieval system to implement bidirectional search between images and natural language. With CLIP, we will map both modalities into a share embedding space, a image could be retrieved with FAISS indexing. Perhaps the accurary should be improved, further approach is using a SAM3 for text guided semantic segmentation and enable patch level matching.

Structure from Motion, 3D reconstruction, mobile robots

One video, a 3D reconstruction of civil infrastructure

Structure from Motion (SfM) is one of the most widely used techniques for reconstructing objects and scenes from images or video. This project aims to test hand-crafted Structure-from-Motion algorithms available in Colmap for reconstructing 3D environments characterized by poor illumination, using frames captured from a limited linear camera movements. The objective is to evaluate the limitations and accuracy of state-of-the-art methods in challenging, hard-to-reconstruct environments.

Gesture detection, tracking

Real time hand gesture detection: from rock paper scissors to sign interpretation

This project focuses on building a computer vision system capable of recognizing a variety of hand signs and gestures captured by a static camera. The core goal is to develop a robust gesture recognition pipeline that can distinguish between different hand configurations in real time. As a proof-of-concept, the system will be integrated into the game of rock-paper-scissors, where it detects each player's gesture and determines the outcome of each round. The final result will be an interactive demo showcasing accurate and responsive hand gesture recognition in a playful, real-world scenario.

Action recognition, video understanding, self-supervised embeddings

From Raw Footage to Recipe: Extracting Cooking Steps from Egocentric Video

This project builds a system that watches egocentric cooking videos and automatically extracts the sequence of cooking actions performed, with the goal of reconstructing a recipe from raw footage alone. Because most frames in a cooking video are irrelevant, the pipeline first applies a relevance classifier to filter out background activity, then routes the remaining clips through an RNN-based action classifier that identifies steps such as cutting, peeling, and boiling. Video representations are produced by V-JEPA 2, which encodes each video as a sequence of 64-frame block embeddings without requiring labeled pretraining data. The result is an end-to-end pipeline that turns an unstructured kitchen video into a structured, step-by-step recipe.

GitHub / Code

Computer vision, OCR, segmentation, vision-language models, object tracking

Real-Time Whiteboard Transcription with Temporal Ledger

When a professor is at the board, you have two choices, pay attention, or copy. You can't really do both at the same time.

We wanted to eliminate that trade-off. Our system transcribes in real time what the professor writes, so the student is free to just listen and understand.

The pipeline captures the full evolution of whiteboard content across a lecture, every correction and erasure included, and synthesises it into structured Markdown output.

Video understanding, vision-language models, action captioning

Action/Event-Focused Captioning: A Three-Model Comparison

This project explores how pretrained image-captioning models can be adapted to produce short action-focused captions for video activity timelines. Instead of generating long descriptive captions, we fine-tune BLIP, ViT-GPT2, and Microsoft GIT on COCO action captions so that the models output compact labels such as “person walking” or “coffee being poured.”

For video inference, frames are sampled over time, captioned by the fine-tuned models, and de-duplicated into a simple activity timeline. The project compares original and fine-tuned models using BLEU-1, BLEU-2, METEOR, and ROUGE-L, and analyzes whether architecture choice still matters after all models are adapted to the same action-caption task.