Capture v. Derive

Posted on March 5, 2006
Filed Under metadata, tagging, virage, web 2.0, yahoo | Comments

Universal Law:  It is easier, cheaper and more accurate to capture metadata upstream, than to reverse engineer it downstream.

Back at Virage, we worked on the problem of indexing rich media - deriving metadata from video.  We would apply all kinds of fancy (and fuzzy) technology like speech recognition, automatic scene change detection, face recognition, etc. to commercial broadcast video so that you could later perform a query like, “Find me archival footage where George Bush utters the terms ‘Iraq’ and ‘weapons of mass destruction.’”

What was fascinating (and frustrating) about this endeavor is that we were applying a lot of computationally expensive and error-prone techniques to reverse engineer metadata that by all rights shoulda and coulda been easily married to the media further upstream.  Partly this was due to the fact that analog television signal in the US is based on a standard that is more than 50 years old.  There’s no convenient place to put interesting metadata (although we did some very interesting projects stuffing metadata and even entire websites in the vertical blanking interval of the signal.)  Even as the industry migrates to digital formats (MPEG2), the data in the stream generally is what is minimally needed to reconstitute the signal and nothing more.  MPEG4 and MPEG7 at least pay homage to metadata by having representations built into the standard.

Applying speech recognition to derive a searchable transcript seems bass-ackwards since for much video of interest the protagonists are reading material that is already in digital form (whether from a teleprompter or a script.)  So much metadata is needlessly thrown away in the production process.

In particular, cameras should populate the stream with all of the easy stuff, including:

Heartrate and galvanic skin response of the camera operator?  Ok, maybe not… I’m making a point.  That point is that it is relatively easy and cheap to use sensors to capture these kinds of things in the moment… but difficult (and in the case of barometric pressure) impossible to derive them post facto.  Why would you want to know this stuff?  I’ll be the first to confess that I don’t know…  but that’s not the point IMHO.  It’s so easy and cheap to capture these, and so expensive and error-prone to derive them that we should simply do the former when practical. 

Hug

An admittedly slightly off-point example…  When the Monika Lewinsky story broke, the archival shot of her and Clinton hugging suddenly became newsworthy.  Until that moment she was just one of tens of thousands of bystanders amongst thousands of hours of archival footage.  Point being - you don’t always know what’s important at time of capture.

So segueing to today…  Marc, Ellen, Mor and the rest of the team at Yahoo Research Berkeley have recently released ZoneTag.  One of the things that ZoneTag does is take advantage of context.  I carry around a Treo 650 with Good software installed for email, calendar, contact sync’ing.  When I snap a photo the device knows a lot of context automagically, such as:  who I am, time (via the clock), where I am supposed to be (via the calendar), where I actually am (via the nearest cell phone tower’s ID), who I am supposed to be with (via calendar), what people / devices might be around me (via bluetooth co-presence), etc.  Generally most of this valuable context is lost when I upload an image to Flickr via the email gateway.  I end up with a raw JPG (in the case of the Treo even the EXIF fields are empty.)

ZoneTag lays the foundation for fixing this and leveraging this information.

It also dabbles in the next level of transformation from signal to knowledge.  Knowing the location of the closest cell phone tower ID gives us course location, but it’s not in a form that’s particularly useful.  Something like a ZIP code, a city name, or a lat/long would be a much more conventional and useful representation.  So in order to make that transformation, ZoneTag relies on people to build up the necessary look-up tables.

This is subtle, but cool.  Whereas I’ve been talking about capturing raw signal from sensors, once we add people (and especially many people) to the mix we can do more interesting things.  To foreshadow the kinds of things coming… 

All of the above examples lead to extrapolations that are “fuzzy.”  Just as my clustering example might have problems with people “eating turkey in Turkey”, it’s one thing to have the knowledge - it’s another to know how to use it in ways that provide value back to users.  This is an area where we need to tread lightly, and is worth of another post (and probably in fact a tome to be written by someone much more cleverer than me.)

Even as I remain optimistic that we’ll eventually solve the generalized computer vision problem (”Computer - what’s in this picture?”), I wonder how much value it will ultimately deliver.  In addition to what’s in the picture, I want to know if it’s funny, ironic, or interesting.  Much of the metadata people most care about is not likely to be algorithmically derived against the signal in isolation.  Acoustic analysis of music (beats per minute, etc.) tends to be a poor predictor of taste, while collaborative filtering (”People who liked that, also liked this…”) tends to work better.

Again - all of this resonates nicely with the “people plus machines” philosophy captured in the “Better Search through People” mantra.  Smart sensors, cutting-edge technology, algorithms, etc.  are interspersed throughout these systems, not just at one end or the other.  There are plenty of worthwhile problems to spend our computrons on, without burdening the poor machines with the task of reinventing the metadata we left by the side of the road…

Comments

8 Responses to “Capture v. Derive”

  1. Long ago Y! on March 5th, 2006 6:36 pm

    First of all:

    “impossible to derive them ipso facto.”

    post facto, not ipso facto…  [corrected, thanks.]

    Also:

    There’s a lot to be learned in observing how information is manipulated by both it’s creator and subsequent consumers. Who saves it, who forwards it to who, who deletes it, how long they view it, how frequently they view it and so on.

  2. joe lazarus on March 6th, 2006 1:55 am

    cool concept. can’t wait to give zonetag a spin.

    nice job on the new blog. the posts are great so far. not sure if you realized, but there seems to be some problem with your rss feed in that old posts keep appearing as new content in my blog reader, bloglines.

    thanks again for sharing your thoughts!

  3. LULOP.org [opensource] » Capture vs Derive on March 6th, 2006 11:37 am

    […] Metadata should be added to the video by capture device rather than derived from the video itself with computer vision analysis. Because it s enourmously easier for the camera to capture things relating to the video with approriate sensors than for any algorithm to derive them from the video, if not impossible. read this beautiful post on Capture versus Derive. Oh, and the weblog comes from one Bradley Horowitz of Yahoo, formerly of Virage… […]

  4. LULOP.org [opensource] » Capture vs Derive on March 6th, 2006 11:37 am

    […] Metadata should be added to the video by capture device rather than derived from the video itself with computer vision analysis. Because it s enormously easier for the camera to capture things relating to the video with approriate sensors than for any algorithm to derive them from the video, if not impossible. read this beautiful post on Capture versus Derive. Oh, and the weblog comes from one Bradley Horowitz of Yahoo, formerly of Virage… […]

  5. Abu Hurayrah on March 6th, 2006 3:14 pm

    How much are we willing to share about ourselves, though, when it comes to the data we’re contributing? Granted, I realize that I am leaving my browser type, version, OS, time of access, IP address, and so on where I go on the Internet (though I don’t have with certain Firefox extensions able to change the User Agent string and whatnot), the idea that uploading an image to, say, Flickr, is going to record more than I may realize I’m sharing is somewhat disconcerting.

    However, I also realize that I am giving up more information now than I was 5 years ago, and even more than before that. Still, it would seem this rate of private information exchange is accelerating (that is, a 2nd derivative of greater-than one), and we need to simultaneously develop ways to guard what we want guarded.

    I find the concept of metadata very interesting, because it helps us put more data into a form easier-to-manage, search, and index for our computers, but is hidden automation of all of these components really the best and/or only way? Can we do it as sort of an opt-in method, where those with more technical skills can manually tag and index their own content?

    I am referring to the near future, and not so much to the now.

  6. Pete Cashmore on March 7th, 2006 12:41 am

    This is an excellent post - I’m really enjoying your blog.

    In response to Abu’s comment, did anyone see the Slashdot post where the commenters figured out the location of a botnet creator based on the metadata of an image in the Washington Post? More here:

    http://www.techdirt.com/articles/20060221/0318222_F.shtml

  7. Neela Jacques on March 22nd, 2006 11:13 pm

    Your blog started me thinking about Meta data. It does seem clear that as we capture and attach more meta data the underlying content can become exponentially more useful (the REAL long tail is generated).

    It may be useful however to think about classifying types of meta data. It appears to me that Meta data (for media) might be broken up into 3 separate categories:
    1. Fact (It’s George Bush, taken at 2pm)
    2. Preponderance (most people, but not all would agree on tag….ie. Its a joke,
    3. Subjective/contect dependent (VERY funny, well written article, conservative)

    Comptuers (using just a few data points or regressions etc.) can do a very good job at figuring the first if given a few data points. Second requires more data points (and requires statistics) but again is solvable.
    This last category is IMHO the most interesting. It is a significant challenge. Amazon first then so many others added a ton of value by providing places to find others’ opinions of things we might be interested in (using Metadata is broadest sense of the word here). The problem is that (using the amazon example) I don’t read every book people reccomend to me, but there are some people I know think similarly to me who’s book suggestion have a much stornger impact on me (e.g. buying Crossing the Chasm on Bradley’s suggestion in 1998). The problem is that we care not about does someone like it, but does someone LIKE ME like it. Expanding that to Metadata, some of the most useful meta data would be data that tells me what people most similar to ME think it is…This is much harder for a computer to do…. but again with enough data, for places where it really matters, definitely worth it. The obvious example would be places like NetFlix, Tivo and Amazon, but also blog engines, joke pages etc. Those that are able to add this third layer will have a HUGE advantage on their competitors as they will be able to match people with the content they are looking for, interested in, and most likely to be surprised by.

    (In Heinlein’s The Moon is a Harsh Mistress a computer seeking to understand humor is able to grasp what is funny has the main character rate joke after joke…but does he really understand what is funny? Most humans don’t agree on this themselves….)

  8. thaddaeus brophy on March 2nd, 2007 12:44 am

    Most metadata really only passes for meta-blahblah in my opinion. I think Sir Arthur Conan Doyle said something about the probability of meaningfully engineering metadata from primary sources without human intervention: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” How virage worked is a great clue of how it can be done however–better hire some talented semioticians–complex grammars aren’t really very easily comprehensible within symbolic languages or the people who have tortured their minds ( ;7 ), to work in them ‘prima facie’, IMHO.

Leave a Reply