Data set free

10.16.13 -

Kenton McHenry, one of the leaders of NCSA’s Image and Spatial Data Analysis Division, tells Access’ Barbara Jewett about a project benefitting everyone.

What is Brown Dog?

Brown Dog is an effort to deal with large, uncurated collections of data. Uncurated in that there’s no metadata associated with it. Sometimes there’s not much structure at all associated with it. A good example is a large collection of images. It’s easy for anybody to take pictures these days. And what do you do with them? Most people are like me and dump them into a folder on the computer and that’s it. Going back and finding something after years of having collections of pictures—you don’t. Not easily, anyway. It’s not easy for a computer to search through and find something in them unless you put metadata on it—you tag the image with a keyword saying there is so and so in this image and so forth. But nobody does that because it’s a long boring task and it hinders our ability to make more data. That’s not a good thing, especially when it comes to data used in science.

This project is motivated by the reality that these days science is reliant on software and large amounts of digital data, and the fact that you can’t always access that data in the future. The software, in addition, may not be good for science. Science is all about the scientific method, reproducing a documented procedure and getting the same result every time. If the software that was involved in that data no longer exists, you most likely cannot get that exact result anymore.

What instigated this project?

Dealing with large collections of uncurated data is important. Not just images and other unstructed types of data, this also involves old data, legacy data. With Brown Dog we’re trying to do two things: deal with file formats and address the lack of metadata in these collections of uncurated and/or unstructured data. This project stems from our team’s work with the National Archives.

The National Archives were trying to find an optimal file format to store 3D data they were getting from various government agencies (e.g. with regards to ship parts for the Navy). Vendors used different software, and each had its own file format. We wanted to find the ideal file format empirically, to say why this file format is the best file format for 3D data in terms of long term preservation. We decided to base it on information loss, because when you do conversions from one file format to another file format, which is going to be necessary in order to get it to that optimal file format, you tend to lose some information. Our idea was to find a file format where you lost the least amount of information in converting. But in order to do that we needed a conversion system that was able to do, in essence, all possible conversions. And that did not exist.

So what did you do?

Building a universal converter that can convert anything to anything was totally impractical. So we created a system called Polyglot based on a service called a software server. What this did was allow us to use whatever software already existed to do the conversion. Most software supports importing and exporting to a handful of other formats for some degree of portability, so we used that to get in and out of formats, including formats where no specification was available. What Polyglot will do is use not just the conversions that any one application has, but chain together multiple applications in order to hop across different file formats. So if there is no direct conversion in one application it will try to find intermediary applications and formats to get to your destination somehow.

Brown Dog will take Polyglot and the software servers and harden them to make a service we’re going to call the data access proxy (DAP), which will ideally make file formats a non-issue for everybody. That’s the goal.

How does Brown Dog work?

The idea here is kind of grand. An analogy would be the domain name service (DNS) that tells your Internet browser how to translate URLs into IP addresses because the numbers are what actually matter. That DNS is a relatively simple thing but it’s essential, without it you wouldn’t be able to type google.com and get to google.com. But you don’t see it and you take it for granted. I’ve been calling our work a DNS for data. It fills an essentially similar role but with the idea of making data more accessible. The data access proxy part of this will make it so that file formats are no longer an issue.

What happens if you try to download a file but don’t have an application that can open the file? You can save the file, but you can’t view or edit it. Once you set up your machine with a data access proxy, files like that will work, automatically. Based on the source format and the target format that you specify, the proxy will send it to a software server somewhere through a chain of applications that will convert it to whatever you need it to be. The idea of the data access proxy is to make the Internet agnostic to file formats.

You mentioned dealing with uncurated data.

The other half of Brown Dog is the DTS, which stands for data tilling service. The DTS will serve as a framework for storing and running analysis tools which will be used to examine the contents of a file for the purpose of automatically creating metadata or numerical signatures capturing some aspect of the file’s contents. Think of keywords or tags on an image. The purpose of this tool would be to automatically generate some of that. This metadata can then be used to index or search through the data, or possibly some other form of analysis over the data. Essentially, getting this metadata on a file is a first step towards further data analysis or user curation. The DTS prepares the data, as a tiller prepares the soil, for these next steps.

And how will the DTS be used?

Again, the idea is that this is for every Internet user, not just scientists. With a browser you go to a URL, and a content-based search would be available to you. As an example let’s consider pictures again. What the DTS will allow you to do is drag a picture from your desktop to your browser and drop it there. Then it would automatically call the DTS to index all images at the given URL, extracting features and using a measure to compare the different features in the various files and returning a rank order list of similar data, in terms of contents, to your query. The topmost would be the most similar looking, the second one the second most similar looking, and so on. There’s a lot that’s happening in the background but to the user it’s not visible.

There’s many different ways to do these kinds of comparisons. Content-based image retrieval is an open research topic in the field of computer vision. Ideally the DTS keeps a library of all the ways of doing comparisons and you would be able to configure which one it uses, the idea being that certain ones are better for certain situations than others in terms of quality, in terms of computational cost.

If this is going to become a standard part of how we use the Internet, I'm thinking about privacy issues for personal images, and proprietary issues for scientific images.

The actual data on the website is on the website’s server. It would have to be downloaded in order to be processed, we’d have to go through each image and do a feature extraction and process each image. Same with the query image, that would probably have to be on our server because we would have to do the feature extraction but we’d throw it away. We only need it for the feature extraction and then we’d get rid of it. And once the results are returned to you we don’t need anything, we could trash it all. The browser example that I’m using here is just an example application of the underlying system, both with the DAP and DTS. The idea would be that potentially you would create other applications, maybe in the operating systems file manager itself that would call all these services directly. So in your file manager, your window with all your folders in it, you could maybe do something like this in there once it’s implemented. The framework is what we’re building behind it.

We’ve extracted features. What about extracting tags?

So those are the signatures I mentioned which is one part of this. The other is the metadata part. The idea, at least in the browser case, is that the files under a URL would be searchable through a modified find box that would take into account keyword metadata automatically extracted from the file. So you have people, car, Panicum virgatum, etc. You would press enter and then, if you are looking for images, you would get all the files with images of people and cars and switchgrass, with some accuracy of course. Behind the scenes it was running these metadata extractors on all the files at that URL on potentially a diverse number of distributed resources. But you don’t see that. You just see that you can access data more easily.

How did you come up with the name Brown Dog?

The brown dog is the proverbial super mutt, a mixture of many dogs. Our Brown Dog is also a mixture. It’s a mixture of the software and the analysis tools that go into the DAP and DTS. It’s a framework that brings them all together in a common way so they’re accessible through a common and programmable interface, and through that it’s able to address these problems of using uncurated collections in a practical means and a realistic means. So that’s where the name comes from. It’s a mutt. Of software.