Lecture of the Future

People

Proposal

Objectives

Explore novel ways to transmit and view lectures online with digital TV.
Current online lectures are predefined in their viewing experience.
Extract and separately transmit the background, speaker, and high quality but low-update-rate video image of the blackboard.
Allow a viewer to concentrate on what is most important for him at the moment.

Overview

Geometric model of these the room is built.
Video input is roughly segmented into speaker, background, and blackboard.
Background video is used to texture the room (perspective texture mapping).
Textured background is used to more accurately segment speaker.
Speaker is transmitted as 2D sprite and positioned as billboard in front of blackboard.
Low-res camera watches for updates on blackboard.
High-res pan&tilt&zoom cameras scan updated parts of blackboard to generated patched high-res blackboard image.
Pan&tilt&zoom cameras are steered based on low-res image.
High-res blackboard images is transmitted separately with low-update-rates.
Client side viewer allows customization and interaction in order to present the most important information to the viewer

Description

Stanford University has a long tradition of transmitting many of its lectures on a specialized network known as the Stanford Instructional Television Network to which many of the nearby Silicon Valley companies subscribe. Most transmitted lectures consist of videos that are switched between showing the lecturer, the blackboard, and slides or other material along with audio. There are several problems with this style of presentation: The viewer can only see what the camera operator chooses to transmit. It is impossible to look at some material in more detail. The resolution of the blackboard image is often less than adequate, rendering some text unreadable. A higher resolution blackboard image would be of great help. Since its contents do not change very often, even high-resolution images would require little bandwidth. For long periods of the lecture, programs only show the talking lecturer. A lot of bandwidth is wasted since the background is usually not changing.

In our project we addressed these problems from multiple directions applying the enhanced capabilities of DTV and image-based techniques. First of all, we have chosen to separately process and transmit the image of the lecturer, the room, the blackboard, and any additional material. Each of these source materials have very distinct video characteristics that we plan to utilize. In the following we give a short overview of the whole project, but concentrating on the extraction of a high-resolution blackboard image.

On the head-end we need to capture multiple video streams of the room with the lecture and the blackboard. We then need to segment and broadcast the different objects (much in the spirit of MPEG-4's video objects ). On the receiver side we need to recompose the different video streams into a single presentation, but are now able to give the user the ability to customize it according to his preferences. This allows him, for instance, to concentrate longer on a blackboard image or review some earlier slides again.

We start by creating a geometric model of the lecture hall, which is augmented with projective textures extracted occasionally from a video stream. As a result we can save considerable bandwidth by only transmitting this model and the textures infrequently instead of sending the image of the background for each video frame. This approach allows a viewer to freely move within the room and view the classroom from whatever location he prefers, not just the angle chosen by the camera operator.

In order to display the lecturer within this model, we need to segment him from the background. We currently use a simple segmentation algorithm that uses the known colors of the background to distinguish it from the lecturer. Using the known camera position and the geometry of the room, we roughly estimate the position of the lecturer in front of the blackboard and place his video image as a 3D billboard into the scene. Although this is a simple technique it already provides a surprisingly realistic view of the lecture while using only a fraction of the bandwidth a full video transmission of the lecture would require (see Figure 1).

Figure 1: In the background: textured model of the lecture room, multiple textures are used after the speaker has been segmented out. The textured model is then used to better segment the speaker, which is transmitted and displayed as a billboard. In order to enhance the realism the foreground is augmented with chairs.

In order to display the blackboard with high enough resolution to be readable, it is necessary to use several cameras to form a running image of the blackboard that is updated in real time. Since a single camera cannot capture the entire board with sufficient resolution, we use cameras that pan and zoom to areas of interest and integrate that data into the running high-resolution image of the board. A single fixed camera is used to obtain a low resolution reference image of the entire board to aid in the integration of the image streams from the higher resolution cameras. It also insures that something can be said about all of the board in the case that an area has not yet been scanned by one of the high-resolution cameras.

The lecturer is segmented from the low resolution camera's input to obtain the running reference image of the board, without the lecturer obscuring the view. The segmentation problem here is simpler than the previous one, since we are eliminating the lecturer, rather than extracting him and a conservative algorithm can be used that might also remove a small border around him. The remainder of the frame is then copied over the running low resolution image. To detect the lecturer, we threshold the intensity difference of the running image and the next video frame based on the fact that the blackboard will stay nearly the same. This simple technique can fail and is therefore augmented with a more robust but slower algorithm that analyzes the color distribution for those areas that have not been updated for a while (because we might have wrongly identified a piece of blackboard as a lecturer).

To decide where to point the pan and zoom cameras, we maintain a ``curiosity'' bitmap. It is marked when we see a large enough difference in the corresponding pixel in the low resolution control image (which is updated in real time, regardless of the positions of the mobile cameras). The moving camera will then sweep out that area of the image, take high-resolution images, and clear the appropriate areas in the curiosity bitmap.

When integrating a high-resolution image from a high-resolution camera into the output stream, we compute the mapping between the camera's image space and board space by identifying markers on the board. After reprojection of the high-resolution image into the blackboard image it is masked with the lecturer and copied into the output image. Having a reference stream with the entire board is crucial to integrating the high-resolution images taken from arbitrary positions.

We currently use a modified MPEG encoder to compress the high resolution blackboard image using a significantly lower frame rate. Ideally, we would like to only transmit those areas that have changed, but MPEG already has a fairly small overhead for coding these unchanged regions.

Last Updated: Feb. 25, 1999, slusallek@graphics.stanford.edu