Last week I was lucky enough to attend a small room presentation by Joshua S. Bloom. He went through a tour de force in data science. He had to figure out a lot of this not because of the trendy big data problems and approaches that we see in just about every new vendor that pops up in the security space, but because he had so much data flowing in from telescopes and other apparatus that doing good data science was his only defense. He’s translated that learning into a company focused on harnessing machine learning for business, Wise.io. And he suggested two excellent free follow up reading assignments:
- “On Being a Data Skeptic” by Cathy O’Neil
- “Machine Learning: The High Interest Credit Card of Technical Debt” published by Google Research with a collaboration of many authors.
I also highly recommend both. Mark Twain said that education is “that which reveals to the wise, and conceals from the stupid, the vast limits of their knowledge.” I thought I knew a bit about big data, data science, machine learning, and other advanced analytics topics going into this lecture. After the lecture and reading these pieces, I understand just how much there is yet to be learned.
I’ve already noted just how many companies at this year’s RSA Conference were pushing an analytics theme in their message. We were, too, in a way. When we talk about analytics it’s with a small “a.” We take our own data and use specific analysis to determine if a certain type of horizontal movement attack may be in progress, or if a certain user may be the probable owner of a file share. This is analysis for sure, but it’s not the kind of data science they’re using to find new supernovae events in the night sky. Many are claiming they use that style of analysis. Maybe some do. But it seems to me they have a big data problem. Not #bigdata as in the hashtag, but they have a data problem that’s pretty big. Many of these security-focused analysis vendors use SIEM and other aggregation systems for their data sources, and others go right to logs and systems of record. But anyone who has been dealing with those systems knows they all have a big data hygiene issue. We know because we’re in the business of giving SIEM, IAM, and other systems these analysis players tap into better, more contextualized data so they can increase their scope and accuracy. I may have a lot left to learn about data science, but it seems like having good data from the start is a requirement.