Wednesday, October 04, 2006

Negative Databases

Protecting Data Privacy through Hard-to-Reverse Negative Databases is an entertaining paper. Here's the problem it tries to solve: You want to store personal details but you don't want people to be able to mine the database. For example you might want to store brief personal details with a social security number so you can verify that someone is a member of some organisation, but for security reasons you don't want a hacker to be able to get a list of members even if they steal the entire database. That sounds like an impossible set of requirements. But the trick is this: instead of storing the personal details you store a list of all possible bitstrings representing personal details that members don't have. Insane right?


If the personal details are just 256 bits long, say, then to store details for 100 people, say, the database needs 2256-100 entries. That's vast. But the database of non-members is highly compressible. Let me lift an example directly from the papers. Suppose the personal information is three bits long and we wish to store the entries {000,100,101}. Then the non-entries are {001,010,011,110,111}. The non-entries can now be compressed using a wildcard scheme {001,*1*}. The *1* represents all of the entries fitting that pattern from 010 to 111. This turns out to be quite a compact representation. It's beginning to look plausible.


And here's the neat bit: SAT can be translated into the problem of finding valid database entries from the wildcard representation. So extracting entries is provably NP-complete, despite the fact that we can check entries in a reasonable amount of time (for suitable definitions of 'reasonable').


Anyway, I still have a whole slew of objections, and the actual compression scheme proposed has holes in it that can result in false positives. But what initially sounded implausible is beginning to look workable.


PS You can also read about this at the Economist but I've tried to spare you the irrelevant fluff :-)

5 comments:

  1. There are SAT solvers that work very well in practise, even though SAT is NP-complete.
    So knowing that a problem is hard in theory is no guarantee that it's hard in practise. I wonder what the case is here.

    ReplyDelete
  2. And even without a SAT solver, extracting an entry from a real database might be easier than solving SAT because real databases might lie in a nicely characterisable subspace of the set of all possible databases. (Eg. there are multiple choices for the wildcards in the -ve database. The actual wildcards generated might depend on the execution history of the algorithm as entries are added - and this may give additional clues.)

    But to me, what's interesting here isn't the actual specific algorithm, but that such a crazy sounding idea could turn out to have any kind of plausibility at all. Someone might eventually be able to make something of this.

    ReplyDelete
  3. Your blog normally makes my brain hurt (actually, it makes it wither into smoking, bubbling gobbets of confusion), but this one I can follow. The important thing with such a database isn't that it couldn't be attacked, but that once it's gotten to work that it'll be sufficiently infeasible for non-governments to attack it that there's no easy money in it.

    ReplyDelete
  4. Our immune systems work in a similar way. Computer immune systems exist that use regular expressions to match everything that is not usually found in the system.
    Such systems can never find everything that is not supposed to be in a system. Finding everything that is false in a system is a related problem.

    ReplyDelete
  5. with apologies for commenting on an old thread...

    @david: the economist article claims that the immune system inspired the researcher behind this concept

    @sigfpe: your economist link currently points to a "premium content" teaser. the full article appears to be available here: http://www.ebusinessforum.com/index.asp?layout=rich_story&doc_id=9286&categoryid=&channelid=&search=puts

    ReplyDelete