You're watching the makings of a great game when the worst fate crashes down on you: your party is out of snacks. As the ever-gracious host, you agree to get more as soon as halftime begins. You head to the grocery, lickety-split, and pick up the requisite items. You begin rushing back, hoping to outpace the clock in a desperate attempt to see the whole game.
Tragedy strikes. You hear yelling and clapping. The second half has begun. In fact, they're ten minutes in and there's no clear sign whether your team is up or down. You'd like to hear whether your team is winning from your friends, but only if your team is winning. If they're losing, you'd prefer to slowly uncover your eyes, taking the slowest possible peek at the scoreboard to allow the news time to sink in.
There's a problem here. If your team is winning, all is well and your friends can tell you the score immediately. But if they're losing, your friends can't indicate this to you in the way you'd prefer: hearing it from the TV and not from your friends. If they so much as point at the screen with a slightly saddened expression, you'll know what's up. Here's a table to be perfectly clear:
|Losing||non-winning, a.k.a. losing|
How do we resolve this seeming paradox? Techniques from data anonymization give us some help here. In this particular case, the answer could be that your friend flips a coin (hidden from your view, to avoid information leakage). If the coin comes up heads, and your team is winning, your friends tell you and all is well. If the coin comes up heads, and your team is losing, your friends don't say anything. If the coin is tails, then regardless of the status of the game your friends don't say anything.
We've just shadowed the information with the help of randomization—the coin flip. A table makes it clear: we're only told information if our team is winning, but a lack of information doesn't guarantee either losing or winning.
Is this information perfectly hidden? Or have we discovered something, even when the entry in the "Information" column is "unknown"?
Let's try to build up some intuition first. Instead of flipping a coin, say your friend now uses a random number generator, with choices from 1-1000. They're a nice friend, so if your team is winning, they'll inform you if the generator gives any number 1-999. Otherwise, they say nothing. Our table looks exactly the same as it did above, just with different probabilities in one column.
|1 to 999||Winning||Winning!|
|1 to 999||Losing||unknown|
Intuitively, this feels different than flipping a coin. There are 1000 ways for your friend to say nothing when your team is losing, and only one way for them to do so when your team is winning. Let's check the math though, to be sure. We assume as a prior that your team is equally likely to be winning or losing at this point.
Let's substitute our variables into Bayes' Theorem (above) for the case of the number generator.
In the case of the coin we actually find an imbalance as well.
This is an information leakage! In the case of the random number generator, we know that hearing nothing is very likely to be bad news, as expected. However, even a fair coin doesn't hide information properly. Hearing nothing from your friends is still pretty likely to be bad news. What have we learned here? Well, it's tough to hide some information while revealing the rest.
Information is often thought of as binary—either you know something or you don't. However, the probability that something is a fact does give us additional information, even though it may not feel that way. There are a host of cognitive biases which affect how humans treat things they're not sure about, and these intuitions can be messed with. It's the old "either I'll win the lottery, or I won't, so the odds must be 50-50". Dealing in probabilities helps us clarify why you don't really have even odds of winning the lottery, as well as why information is leaked in this scenario.
How does information hiding apply to real-world datasets? We definitely want to keep what should be anonymous out of view. However, data-level security hasn't gotten quite the level of focus it should have. It's easier to assign roles (e.g. members can see everything, non-members can't see anything) than to attempt to allow access to a truly anonymized dataset. That time of ignorance can be overlooked, but we should at least try to have security at every level going forward. There are plenty of statistical methods that can be applied to hide information, but they're used somewhat rarely compared to keeping data on lockdown. Information wants to be free!
There's an area of interest for us that this relates to: machine learning models. Imagine a model trained on an unanonymized dataset. Could that model contain traces of the insecure information from the dataset? Well, yes. Of course. The model was trained on the dataset, and the only way for it to learn is from the data presented to it. In fact, in some cases it's possible to infer backwards from model to datapoints, and find the boundaries in the model where classifications switch over from class A to class B. Assuming that even a deep-learning model with many complicated layers will hide what should be hidden is a form of practicing security through obscurity. What's worse, sometimes these models are exposed directly to the internet, when the databases backing them never would be.
In many ways, security through obscurity is a bad idea, and getting worse as attacks grow more sophisticated. Attackers have access to machine learning methods too, of course, and there are ways to automatically find weaknesses in many models. Hidden patterns in your model can betray something either you or your customers would prefer to keep safe. So, take steps to protect that information. You can keep it as hidden as you like, but one of these days somebody is going to find it.