Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Conference’10, Month 1–2, 2010, City, State, Country.

Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00.

Media Annotation in the Video Production Workflow

ABSTRACT

Content provider companies and televisions have large video archives but usually do not take full advantage of them. In order to assist the exploration of video archives, we developed ViewProcFlow, a tool to automatically extract metadata from video. This metadata is obtained by applying several video processing methods: cut detection, face detection, images descriptors for object identification and semantic concepts to catalog the video content. The goal is to supply the system with more information to give a better understanding of the content and also to enable better browse and search functionalities. This tool integrates a set of algorithms and user interfaces that are introduced in the workflow of a video production company. This paper presents an overview of the workflow and describes how the automatic metadata framework is integrated in this workflow.

Categories and Subject Descriptors

H.5.1 [Multimedia Information Systems]: Video; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – Information filtering.

General Terms

Algorithms, Design, Human Factors.

Keywords

Video Production; Metadata, Video Segmentation, Object Identification

1.  INTRODUCTION

With the global access to high-speed Internet, the content available, which was previously mainly textual, now has a very strong multimedia component. This makes televisions networks and other companies that produce content rethink the way they work, with the goal of providing more and better content, in a fast and convenient way. One possible contribution to achieve this is to reuse old material. By increasing the reused material and reducing time on capturing footage that is already available, the workflow processes of those companies can be speeded up (see Figure 1).

Figure 1 - Content Creation Workflow.

The overall process of obtaining media from the initial production concepts until the archiving phase can be time consuming. However, the capturing and editing stages correspond to the tasks that have a major impact on the workflow duration. A more efficient workflow can provide a better management of the available manpower and reduce the overall costs. For this purpose, tools to automate the different tasks that compose the workflow, in particular the most time consuming are needed. Currently, most of the information extracted from videos is obtained manually by adding annotations. This is a hard and tedious job, which additionally introduces the problem of the user subjectivity. Therefore, there is a need for tools to create relevant semantic metadata in order to provide ways to better navigate and search the video archives. These processes should be automatic whenever possible and should have a minimum need of human supervision to increase the extraction performance.

This paper describes our proposals for content annotation to be included in the workflow (Edition and Capture blocks of Figure 1) of a multimedia content provider company. Our proposal analyzes the audiovisual information to extract metadata like scenes, faces and concepts that will give a better understanding of its content. This paper is focused on the visual part but we also use the audio information: Speech-to-Text methods for subtitle generation; automatic recognition of the best audio source from the footage and sound environment detection. These audio features are combined with the visual information to better identify concepts.

The paper is structured as follows. The next sections present an overview of the VideoFlow project where these questions are addressed. Section 3 introduces the metadata extraction tools from image content. Section 4 presents the user interfaces to access video content enriched with semantic metadata. Finally, we present conclusions and directions for further development.

2.  VIDEOFLOW PROJECT

The work described in this paper is being developed in the scope of the VideoFlow Project. This projects aims at extending the Duvideo workflow with tools for the extraction and visualization of metadata. Duvideo is a multimedia content provider company, partner of the VideoFlow project. The project also includes the development of several search and browse interfaces to reuse the extracted metadata in order to improve the production processes. This metadata is extracted from the visual (scenes, faces, concepts) and audio (claps, sound of instrument, animal sounds, etc.) parts of a video.

2.1  VideoFlow Workflow

Figure 2 presents the workflow proposed to Duvideo, including our system (ViewProcFlow) for media annotation and visualization. It starts with the inclusion of videos into the archive, which is composed of two different types of videos: HD – high quality source - and Proxy – low quality source.

Figure 2 - VideoFlow Architecture.

When a video is on a tape format, a conversion to digital is needed so that it can be added to the archive. This step occurs for old video material, which still is a large segment of the archive.

The new footage is already recorded on a digital format, and MXF (Material eXchange Format) [1] is the current format that the cameras use. MXF is a standard format in production environments, alongside with AAF (Advanced Authoring Format) and it wraps the essence (video, audio) with metadata. Both formats incorporate metadata to describe the essence but AAF is more suited for postproduction and MXF for the final stages of production. The MXF format is structured on KLV (Key-Length-Value) segments (see Figure 3).

Figure 3 – MXF, File Organization.

The description of the essence and its synchronization is controlled by three packages (see Figure 4).

Figure 4 - MXF, Connections Between Packages.

The Material Package represents the final timeline of the file on post edition. File Package incorporates the essence without any kind of edition while the Source Package includes all the EDL (Edit Decision List) created.

Users (journalists, screenwriters, producers and directors) use two main applications that work with this MXF files: PDZ-1 [2] and ScheduALL [3]. The PDZ-1 application is used to create the content for a program by defining stories with the edited footage. This application will be complemented or replaced to include the new search and browse features. The ScheduALL application manages annotations that are used to describe the video content and will be maintained on the proposed workflow. These programs work in the Proxy versions of the archive, generating data that will be used to create the final product with HD content, in the Unity component.

As mentioned, ViewProcFlow will substitute PDZ and synchronize data with ScheduALL regarding content annotation.

We took the same approach, using the Proxy videos for all the processing and creating metadata to be used with HD videos. Our approach was to first concentrate on various technologies for metadata extraction. We left for a later development stage the final format (MXF) generation, because all the technologies work at the pixel level that can be accessed in the different video formats.

2.2  VideoProcFlow Architecture

The proposed system splits into a Server-side and a Web Client-side. The Server-side does the video processing and deals with the requests from the Web Client-side. The Client-side is used to view the video archive and all the metadata associated with it (see Figure 5). It will also provide mechanism to validate the extracted content. With the Web Client, users can access the system wherever they are, not being restricted to a local area or specific software, as they will only need an Internet connection and a browser.

The Server-side was implemented using C++ and openFrameworks [4] (a framework that provides easy API binding to access video content and also creates an abstract layer that makes the application independent from the operative system). The Client-side was developed with Flex 3.0. This technology provides an easy way to prototype the interface for the final user without jeopardizing the expected functionality.

The next two sections explain the video processing task performed in the server (section 3) and the client-side with Web user interface used to search and to browse media content (section 4).

Figure 5 - Client and Server Side (ViewProcFlow).

3.  VIDEO PROCESSING

As mentioned on section 2, a great portion of the video archive is still on tape format which means that all the clips from one tape, when converted to digital, are joined in one final clip. For that matter, the segmentation is essential to extract the scenes from the clip and get a better representation of it. All the metadata gathered is stored in XML (eXtensible Markup Language) files, because it is the standard format for data to be shared and also to help the integration with the Web Client.

3.1  Video Segmentation

To accomplish the segmentation of the video, we used a simple difference of histograms [5] to detect scenes from the video. Once the scenes are detected, one frame is chosen to identify it. The middle frame of the shot is selected to represent the whole scene. This criterion was used to avoid noise (regarding semantic content) that happens in the beginning and at the end of a scene (see Figure 6).

Figure 6 - Frames With Noise.

Normally this occurs when there are effects between scenes – e.g., fade or dissolve. The frames obtained in this way are the input of the following techniques.

3.2  Face Detection

Faces are pervasive in video content and provide preliminary indexing. We integrated the Viola and Jones [6] algorithm to detect faces that appear on images. It works with Integral Images, which makes the algorithm to compute convolution filters on areas work in a very fast way.

The Viola and Jones algorithm is based on a set of cascades of classifiers, previously trained, that are applied in image regions. This algorithm has some limitations, for instance, it does not detects partial faces or faces in a profile view. It also produces some false positives. To overcome these problems the user will be included in the process in order to eliminate them.

3.3  Image Descriptors

For video access, some queries require the comparison of an images or image region. Our proposal uses the information extracted with the Scale-Invariant Feature Transform (SIFT) [7] and with the Speeded Up Robust Features (SURF) [8] to compare images.

These algorithms find keypoints on images that are invariant to scale and rotation and extract the descriptor that represents the area around that keypoint. This descriptor is used for matching purposes between images. On Figure 7, the red dots identify the detected keypoints and the blue lines are drawn between those that match.

Figure 7 - Example of Matching Images With Extracted Keypoints.

3.3.1  SIFT

The keypoints are detected using a filter of difference of Gaussians applied to the image. The next step computes the gradient in each pixel in a region of 16x16 pixels around each keypoint. For each 4x4 pixel block of the region, an histogram is calculated considering 8 directions of the gradient. The descriptor is created with these histograms (a vector with 128 values).

3.3.2  SURF

This method uses smaller regions than the SIFT method to create the feature descriptor. It also uses the Integral Images technique [6] in the process to increase the performance of the algorithm. These regions are 8x8 centered over the keypoint, and are divided on 4x4 blocks, to which is applied Haar wavelet. The results for each sub-region are added to create a descriptor composed by 64 values.

3.4  Semantic Concepts

With the work described in [9], it is possible to browse a personal library of photos based on different concepts (see Table 1). Photos were classified and could be accessed in several client applications.

Table 1 - Trained Concepts

Indoor / Snow
Beach / Nature
Face / Beach
Party / People

Our proposal is based on the Regularized Least Squares (RLS) [9] classifier that performs binary classification on the database (e.g., Indoor versus Outdoor or People versus No People). It also uses a sigmoid function to convert the output of the classifier in a pseudo-probability. Each concept is trained using a training set composed of manually labeled images with and without the concept. After estimating the parameters of the classifier (that is, after training), the classifier is able to label new images.

Each image is represented by visual features, which are automatically extracted. We used the Marginal HSV color Moments [9] and the features obtained by applying a bank of Gabor Filters [9] as image representations. Using these classifiers, the tool was capable of executing interesting queries like “Beach with People” or “Indoor without People”.

Duvideo usually uses a set of categories to access the archives. Table 2 presents a subset of the Thesaurus used by Duvideo and a set of related concepts obtained from the list of concepts used in ImageCLEF [10] for submissions on “Visual Concept Detection and Annotation Task”. Currently, we have trained a subset of the concepts presented in Table 1 and we plan to train most of the ones in Table 2.

Since we know that for several categories on thesaurus, it is difficult to identify features due to abstraction level of the subject – e.g., “Rights and Freedoms” or “Legal form of Organizations”, we will overcome these difficulties by looking into categories that we may find connections with several individual concepts using ontologies

Table 2 - Some Examples of Concepts Matched With Thesaurus Categories

Concepts / Thesaurus Category
Car, Bicycle, Vehicle, Train / 4816 – Land Transport
Airplane / 4826 – Air and Space Transport
Nature, Plants, Flowers / 5211 – Natural Environment
Trees / 5636 - Forestry
Partylife / 2821 – Social Affairs
Church / 2831 – Culture and Religion
Food / 60 – Agri-Foodstuffs
Fish / 5641 - Fisheries
Baby, Adult, Child, Teenager, Old Person / 2806 – Family, 2816 – Demography and Population
Mountains, River / 72 - Geography

We also plan to use human intervention for the abstract categories in a semi-automatic process, which requires appropriate user interfaces as described next.