Television Listings and XMLTV

Introduction

For several years I've wanted to assemble my own PC. Every time I decided to replace my computer, I would say that maybe this time I will get around to building my own. This resolution lasted about as long as it took for me to go to Dell and price its latest, comparing that to the amount of free time I had, which is very little.

The emergence of Linux-based packages for building personal video recorders (PVR) like TiVO -- something I would probably never be able to justify buying just for its own sake -- offered me the chance I waiting for. A mini PC with a TV capture card, a WiFi card, a monster hard drive (you can get up to a quarter terabyte nowadays), and a Linux package like MythTV can not only do almost everything a TiVO can do, but can also serve up MP3 files, act as a Windows file server with Samba, run a web server, and more.

One critical element of a DIY TiVO is TV listings. Without these all the fancy hardware in the world won't do much good. But there's an open source, Perl XML-based solution by Edward Avis called XMLTV that many of the TV-on-your-PC packages like Freevo and MythTV support. With support for screen-scraping data for many country's cable systems, XMLTV can take various sources and create a consistent stream of XML.

Here's a snippet to give you an idea of the kind of information you can get:

<tv>

<programme channel="C54amc.zap2it.com"

start="20031230002000 -0500" stop="20031230022000 -0500">

<title>Mystic Pizza</title>

<desc>Three teenage girls come of age one summer working in a

pizza parlor in Mystic, Conn.</desc>

<date>1988</date>

<category>Comedy</category>

<rating system="VCHIP">

<value>14</value>

</rating>

<rating system="MPAA">

<value>R</value>

</rating>

<star-rating>

<value>2.5/4</value>

</star-rating>

</programme>

</tv>

If you have an iCal-compliant viewer (like Mozilla) you can even convert this to a calendar using Irving Probst's XSLT stylesheet (screenshot).

Getting started

As a first step I grabbed the latest Windows version of XMLTV from the SourceForge project. (For OS X, RPM-based Linux systems, and Debian package-based systems you also get packages; see the home page for details.) This gives you a binary "xmltv.exe" at the top level of the directory where you unpack the ZIP file. Like any good tool with a UNIX heritage, XMLTV is meant to act as a filter chained together with other programs. Once you set it up (in my case to point to the North American listings), you can run the program and get a stream of XML suitable for your homegrown electronic program guide:

C:\writing\xmltv-0.5.24-win32>xmltv tv_grab_na --configure

Timezone is -0500

Welcome to XMLTV 0.5.24 (tv_grab_na V3.20031101) for Canada and US tv listings

Please report any problems, bugs or suggestions to:

email

For more information consult

checking XMLTV release information..

Warning: failed to get current release information from:

If this problem persists, look for a new XMLTV release.

starting manual configuration process..

how many times do you want to retry on www site failures ? (default=2)

how many seconds do you want to between retries ? (default=30)

what is your postal/zip code ? 11375

getting list of providers for postal/zip code 11375, be patient..

Choose a service provider:

0: DIRECTV New York - New York (128766)

1: DISH New York - New York (128719)

2: RCN Cable (Microwave) - New York - Digital Rebuild (70946)

3: RCN Cable (Microwave) - New York - Rebuild (70945)

4: RCN Cable (Microwave) - New York (70944)

5: Time Warner Cable - Brooklyn - Cable Ready (71328)

6: Time Warner Cable - Brooklyn - Digital (71329)

7: Time Warner Cable - Brooklyn (71327)

8: Time Warner Forest Hills - Forest Hills - Cable Ready (71440)

9: Time Warner Forest Hills - Forest Hills (71439)

10: C-Band - USA (87341)

11: DIRECTV - USA (62044)

12: DISH Network - USA (62046)

13: VOOM - USA (179304)

14: Local Broadcast Listings (137303)

Select one: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14 (default=0)] 6

you chose 71329 # Time Warner Cable - Brooklyn - Digital

getting channel list, be patient..

After a few moments you get:

got channel list

add channel 1 NY1 ? [yes,no,all,none (default=yes)] A

.

.

.

add channel 1000 MOONDEM ? yes

add channel 1020 ADONDEM ? yes

add channel 1031 HBODEM ? yes

add channel 1032 CINDEM ? yes

add channel 1033 SHOWDEM ? yes

add channel 1034 TMCDEM ? yes

updating C:\/.xmltv/tv_grab_na.conf..

configuration step complete, let the games begin !

My first impression? digital cable gives you way too many choices for good health.

Looking at the Format

Preamble

The top-level element, <tv>, contains no big surprises:

<tv date="20031230133339 -0500" generator-info-name="tv_grab_na V3.20031101"

generator-info-url="

source-info-name="Zap2It" source-info-url="

...

</tv>

In date, the timestamp given (including the GMT offset for timezone) lets you know when the original source generated the listing data. The attributes source-info-url and source-info-name provide a glimpse into how xmltv the program works: for the U.S. it screen-scrapes HTML from a website providing channel listings by ZIP code. We'll be reading right past this information for our example program below.

This brings up an important question: what's the legal status of XMLTV? The Zap2IT license seems to be broad enough to allow for it.

While you may interact with or download a single copy of any portion of the Content for your own personal, non-commercial entertainment, information or use, you may not and may not authorize others to reproduce, sell, publish, distribute, modify, display, repost or otherwise use any portion of the Content in any other way or for any other purpose without the prior written consent of TMS. Requests regarding use of the Content for any purpose other than personal, non-commercial use should be directed to Feedback at Zap2it.com.

Other services in other countries have shut out XMLTV. And it's possible that they'd make more a bigger issue of it if there were more Linux PVRs out there pulling down their data. Even if there were no legal concerns about XMLTV sourcing, there is also the technical risk: every time the HTML layout on Zap2IT changes, XMLTV will break. There seems to be a small market for people who might pay an annual fee for reliable XML-formatted EPG (electronic programming guide), but one debate in tne XMLTV forum on the DigiGuide pay service pointed out that North American TV listings are a duopoly, and Bill Gates paid $6 million for his listings for WebTV. It would be hard to make a profit off homegrown DIY users wanting commercial-grade TV listings, especially given the risk that providing the data in a format which is so easy to redistribute. The whole issue brings to mind the MP3 debate: do people use software like XMLTV because there's no good pay alternative, or because they wouldn't use it unless it was free?

No matter what happens with the listing sources, XMLTV itself is still useful to understand and handle, and it's a good example of XML's strengths in syndication and bridging diverse applications.

Channel information

Next up in the format we have multiple <channel> tags describing all the available channels in your area. XMLTV maps this information to the program listings by an ID which we'll see again later; the ID should follow RFC 2838: Uniform Resource Identifiers for Television Broadcasts but the DTD obviously can't enforce this. Channels can include an optional icon and an optional URL.

<channel id="C2wcbs.zap2it.com">

<display-name>2 WCBS</display-name>

<display-name>2</display-name>

<icon src="

</channel>

XMLTV supports basic localization by a "lang" attribute, e.g. fr_FR. (In a perfect world the DTD would have used xml:lang instead.) It thus allows for multiple display names. Thankfully one variant offered for at least my feed is the channel number itself, which will be needed for PVR software.

Program information

The mother lode of information in XMLTV is in the program listings: what programs play on what channel ID, starting and stopping at what times. Here's an example:

<programme channel="C2wcbs.zap2it.com" start="20031230043000 -0500"

stop="20031230050000 -0500">

<title>CBS Morning News</title>

<desc>News reports on current events.</desc>

<category>News</category>

<audio>

<stereo>stereo</stereo>

</audio>

<subtitles type="teletext"/>

</programme>

The DTD allows for a lot of optional information, including icon, URL, language, year, country, credits (director, actor, writer, etc.), star ratings, audio metadata, video aspect ratio, whether it has subtitles, etc.. We're going to stick with title for the example; for a serious application you might need a commercial feed (should one ever become available) with more reliable and detailed information.

Episodes

Episodic programs get special treatment in the XMLTV format. Here's an example from the feed I pulled:

<programme channel="C2wcbs.zap2it.com" start="20031230030700 -0500"

stop="20031230033700 -0500">

<title>Becker</title>

<sub-title>Small Wonder</sub-title>

<desc>Reggie and the gang dispute Becker's crazy theory

that little people are bad luck.</desc>

<episode-num system="xmltv_ns"> . . 0/3</episode-num>

<audio>

<stereo>stereo</stereo>

</audio>

<subtitles type="teletext"/>

<rating system="VCHIP">

<value>PG</value>

</rating>

</programme>

The "system" attribute in <episode-num> has two allowed values: "xmltv_ns", which is used here, and "onscreen". The latter provides the human displayable version; the former has more structured data. It's supposed to be three numbers (with "." as a separator): the season number, the episode number within the entire series, and finally the part number. Slashes indicate out of how many, and numbers begin at zero; so "0/3" means the first of three. The DTD provides a good set of examples:

The first episode of the second series is '1.0.0/1'. If it were a two-part episode, then the first half would be '1.0.0/2' and the second half '1.0.1/2'. If you know that an episode is from the first season, but you don't know which episode it is or whether it is part of a multiparter, you could give the episode-num as '0..'. Here the second and third numbers have been omitted. If you know that this is the first part of a three-part episode, which is the last episode of the first series of thirteen, its number would be '0 . 12/13 . 0/3'. The series number is just '0' because you don't know how many series there are in total - perhaps the show is still being made!

Easy, right? But look at the actual data. As you can probably guess, this "Becker" episode is not a three-parter, and the first two fields are missing entirely. We're looking at dirty data: no season number, no episode number, and an unreliable last segment. You couldn't run a real electronic program guide off of XMLTV, which is probably good for the developer's legal exposure.

Playing Around

grep is a good way to scan through XML for fragments of interest, but if you want to process XMLTV programatically you'll want heftier tools. One of my favorite tools for processing XML with minimal programming effort is XPath. The Jaxenproject provides a good implementation in Java, my language of choice, but the open source community has provided a wealth of options in your pick of languages. If your only goal is to produce HTML, you could also consider using XSLT.

XPath packs a lot of information into a very small space, so mixing it with your procedural and OO code can make for compact, expressive code. It's also very easy to store XPath fragments in XML, databases, and property files, so you can make your program more configurable. Here's the path to find all programs:

//programme

and then all programs on CBS, using the channel ID for our area:

//programme[@channel='C2wcbs.zap2it.com']

and all programs with a rating of PG or G:

//rating[value='PG' or value='G']

Let's say you want to develop a "coming up" program schedule for a fan homepage for Becker. You might even be thinking of turning the fragment into a portlet to collect all those Becker fan pages . (I promise the code will be more realistic than the premise.) We can find all the Becker episode titles with a single line of XPath code:

//programme[@channel='C2wcbs.zap2it.com' and title='Becker']/sub-title/text()

Next we need source data. You can get the next 14 days worth of data in a nightly cron job. After configuring your feed source, you can run the following to get a full two weeks of source data.

xmltv tv_grab_na --days 14 > feed.xml

Next we need to process it. The following sample Java code loads the file into a DOM Document and uses Jaxen to select and print the episode titles under the nodes. (Note this example excludes all error handling, reasonable argument processing ,and modular design you'd expect from production code.)

import java.io.File;

import java.util.List;

import java.util.HashMap;

import java.util.Map;

import javax.xml.parsers.DocumentBuilder;

import javax.xml.parsers.DocumentBuilderFactory;

import org.jaxen.XPath;

import org.jaxen.dom.DocumentNavigator;

import org.w3c.dom.Document;

import org.w3c.dom.Element;

public class XMLTV {

public static final void main(String[] args) throws Exception {

// set up Java XML processing

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

DocumentBuilder docBuilder = dbf.newDocumentBuilder();

// parse the feed

File srcFile = new File(args[0]);

Document doc = docBuilder.parse(srcFile);

// get an instance of Jaxen's DOM handler

DocumentNavigator navigator = DocumentNavigator.getInstance();

// pre-compile the XPath expressions

XPath channelXpath = navigator.parseXPath("/tv/channel");

XPath beckerXpath = navigator.parseXPath("//programme[title='Becker']");

// create a mapping from ID to display name

Map channelMap = new HashMap();

List channelNodes = channelXpath.selectNodes(doc);

for (int ii = 0; ii < channelNodes.size(); ii++) {

Element channelElem = (Element)channelNodes.get(ii);

Element displayNameElem = (Element)channelElem.

getElementsByTagName("display-name").item(0);

channelMap.put(channelElem.getAttribute("id"),

displayNameElem.getFirstChild());

}

// find the episode nodes!

List nodeList = beckerXpath.selectNodes(doc);

System.out.println(nodeList.size() + " matches found");

for (int ii = 0; ii < nodeList.size(); ii++) {

Element programElem = (Element)nodeList.get(ii);

Element subTitleElem = (Element)programElem.

getElementsByTagName("sub-title").item(0);

System.out.print("Episode title = " +

subTitleElem.getFirstChild());

System.out.println("'; channel = " + channelMap.get

(programElem.getAttribute("channel")));

}

}

}

The example does a little more than get the episode title. It first maps channel ID to channel name, then finds all the elements. This is something that you can do very quickly in Perl or Java but that might take a little more work in XSLT. Of course, emitting HTML based on output would be much easier in XSLT, arguing for a combination of the two -- creating a pipeline with an XMLTV producer, a Java processor, and then a stylesheet using Cocoon might be one way to do it.

For a real tool you might consider SAX2 despite the greater complexity, and implement page caching using a package like OSCache or produce the HTML in a nightly batch as well. XMLTV creates a lot of data and a web app that transforms from even a large static file has the potential to be very slow.

A Wish List

XMLTV is an evolving format; the version covered in this article is 0.5. A revised but convertible 0.6 format is on the way. For the future, I have a short wish list, all XML technical issues. (The content aspect already seems quite complete.)

  • It would be nice to have a standard namespace so one could consider weaving XMLTV content together with other XML vocabularies.
  • An XML schema would be useful here to allow stricter validation; DTD can't cover the typed data XMLTV carries around. It would also provide a structured way to make visible the great documentation hidden away in comments in the DTD now.
  • The application itself emits a DOCTYPE with a relative location for the DTD; an HTML URL might be more appropriate, especially since the application already requires access to the Web.