How Microsoft Research uses creative data mining

Because MSR (Microsoft Research) almost exclusively focuses on software, the group’s passions favor computing applications, rather than mechanisms or infrastructure. Stick a tape recorder in front of Rashid, for example, and he will muse at length about the implications of a world with virtually unlimited data storage.

In just a few years, he says, a terabyte of storage will be cheap enough to be within the reach of most consumers. Rashid is confident of an explosion in computing resources and capabilities. “What I do worry about is, are we going to be able to do anything with them?” Such is his charge to MSR: Figure out how people can take advantage of unlimited storage, gobs of bandwidth, and scads of processing power.

Examples of creative data mining and management abound. Anoop Gupta, a former Stanford computer science professor, has cooked up software that automatically produces highlight reels, or summaries, of video content.

Feed it a ballgame and the software will listen for cues such as the roar of the crowd, the crack of a bat, or the screech of an excited sportscaster. The software will add those moments to the recording and leave all that tobacco spitting between pitches on the hard drive’s cutting room floor.

Gupta’s colleague, research sociologist Marc Smith, has developed Netscan, a way to mine data from discussion groups. By identifying the longest threads of messages in a group devoted to Microsoft’s Visual Basic programming language, for example, the product’s manager could find a feature that needs better documentation in the next version. By looking at which people post messages most often and how often people respond to them, a company might be able to identify the leaders in its customer community, Smith says.

Build it or buy it?

The long-term focus is paramount, but when research at MSR yields a potentially useful product, it’s not ignored. “We bear responsibility, because we are part of Microsoft, to take technologies we develop and, where they seem mature, to move them into our products,” Rashid says. Smith’s Netscan, for example, might debut next year as a product for corporate marketers or others interested in tracking communities on the Web.

Some technology transfers have spawned whole new products, albeit with mixed success. Microsoft’s Digital Media Division, which started in MSR, produces the Windows Media Player and associated software for digital rights management as well as music and video compression. Researcher Henrique Malvar’s signal processing research team supports the group by churning out improved compression, watermarking technology, and noise-filtering algorithms. Meanwhile, MSR’s work on text retrieval and indexing will play a vital role in the upcoming release of the Sharepoint portal server, which will help companies find documents and publish them on intranets.

But these efforts all beg the same question: Couldn’t Microsoft have just licensed these technologies? None of these products or features seems unique. There is no shortage of media compression algorithms, or text-indexing portal servers, or – another MSR product contribution – e-commerce databases that predict customer preferences. Is there any apparent reason – at least from a short-term product perspective – that such seemingly common innovations have to actually come from in-house?

Of course there is, replies Craig Mundie, Microsoft’s senior vice president of advanced strategies. Integration with an in-house lab is always greater than with a licensee. “We frequently have researchers who leave MSR for our product groups to expedite the transfer of technology, and then return to continue developing advanced software,” he says.