Ever since the early days of archie and excite.com, it was clear that computers could search through huge amounts of data quickly. If you knew what the needle looked like, the size of the haystack was not so important.
The key to semantic text search in the context of audio or video media is metadata. If content has good logging at the point of creation, and top level titling, referencing, and indexing, then textual search can be extremely effective. The most complex part of metadata is that it is often not actually “part” ofthe encoded data that comprises the video and the audio. Consequently, while the metadata can be bundled with the audio and video in, for example, an MPEG transport stream or an MP4 container, there is rarely any consistency in how encoders and transcoders encode and maintain the metadata layer found in one when creating, for example, a Flash video file or even a different transport stream. While the audio and video are central, the metadata has most certainly been considered a bastard-child, where its importance has emerged as the sheer volume of content becomes otherwise unsearchable.
The advertising industry - with a keen appetite for monetization realized very early the value of metadata consistency in creating the “currency” of their campaigns, and understanding the usage of their video. As such, long before a gold standard has emerged, indexing and logging general video in the online world, VAST (Video Ad Serving Template) has become widely adopted - driven, of course, by the commercial interests of the advertisers.
Sadly, the interests of “generic” content providers are never quite aligned. Indeed in most production houses, still to this day, scripts and shot logging are added manually as an early stage of post-production, and yet all the data that is entered is lost as versions of the content are mastered into various digital formats. Unless there is a specific effort made, most metadata beyond a basic sheet of title and perhaps creator is lost, and any chance of being able to search for content within the video using text searching is lost without regenerating the metadata again.
When working on the Parliament Live website in its initial incarnation in 2003 or thereabouts, one of the team, Lee Atkinson at Westminster Digital, whom I was contracted to for the project, merged a number of interesting technologies to make a semantic search option for the videos. This worked because, as part of the tradition in the UK Parliament, an organization called Hansard take formal dictation records of everything that is said in the Houses. The original video feed we were handling was made for the BBC Parliament TV channel, and this meant that they had an automatic subtitle system, which used voice to text conversion to bring up a roughly accurate subtitle on the live video feed.
This subtitling data was stored as part of the original video source, but in a format that Lee could extract to a separate data channel in the Windows Media based workflow, and so he initially carried this on to enable subtitles on the webcasts we were looking after.
However, it dawned on him that he could search the subtitle data and retrieve time codes where there was a match. By combining the partly inaccurate data from the voice to text system with the highly accurate data from the Hansard scripts, it became possible to offer a usably accurate lookup on our video content management system that enabled the public to explore video relating to references to various subjects they wanted to search for. So it was possible to search for “Weapons of Mass Destruction,” and every mention in the House of Commons was brought up in a search result set, offering a direct link to the point in the relevant video where the comment was made, and referencing the Diaries and other supporting documents along side.
Although some years later I saw a smoke-and-mirrors version of something similar from the now discredited Autonomy, Lee's system was the finest working semantic video search I have yet seen, and that was nearly 15 years ago. While Windows Media was particularly easy to set up, that inherent capability that gave Windows Media the edge was lost, and sadly, even if that system were still live, very few people would have a Windows Media Player set up suitably to benefit from that capability.
Still to this day my iTunes collection is chaos. After using a couple of “magic apps” to tidy my metadata, I would conjecture that 30% of my music collection is now named incorrectly. So, even if I do manage to thumb my way through the search window on my Apple TV music search, the chances are that the file I eventually select may not be the one I want to listen to anyway.
So while obviously a central part of both the search and the tracking of the use of digital media assets, there seems to be some way to go before content is commonly deeply searchable based on metadata baked into it.
Naturally a large amount of content is legally published through well-ordered content management systems (CMS) - often a service provided by an online video publisher (OVP). Within the confines of a single CMS there will be a metadata structure. Typically content workflows are designed to ensure that both the asset and the correct metadata for that CMS are carried forward. This makes the CMS searchable, and because the setup is proprietary, it also provides OVPs with lock-in, since migrating the assets and the metadata to a competitor is going to add complexity to the workflow, until and unless a metadata standard is established.
-  http://www.iab.com/guidelines/digital-video-ad-serving-template-vast-3-0/