Count-Min Sketches, and Other Digressions!
I have so far been posting almost exclusively about my personal side projects or inspiring talks from Elon Musk. Even though my day job and my nights + weekends projects happen to fall in the same skill domain. So, I thought I would take a break from posting about my iOS projects, and post a little something about the stuff I come across in my professional life.
I am not going to be talking about any actual projects that we’re working on at Shutterfly, as I’m sure that would mean violating some clause or the other in my NDA. Instead, I will talk about general themes around Scaling, Performance, Availability, Analytics, and Security that are common to any web application.
To start things off, I just want to introduce whoever is interested in data analytics to Bloom Filters, Count-Min Sketches, and Spark!
That was my TLDR for this post ;)
Now for a little bit of background:
At Shutterfly, we regularly conduct Tech Meetups and invite industry “experts” to come and give talks about some cutting edge technology they are currently involved in. Most of the time, the talks end up being just another product pitch for some minute uninteresting piece of technology with a bunch of industry buzz words thrown in for goof measure. Big Data, NoSQL, Hadoop, Whoop diDoop… Take your pick!
But every once in a while, someone comes along with a Talk that really challenges our assumptions about the status quo, and piques our curiosity and makes us dig deeper into the science behind it all. And the last meetup was one such affair, with a talk given by Tim Tully, Yahoo’s Chief Data Architect.
The talk was generically titled “Big Data”, and the last one we had with a similar title was just another product pitch with a Q&A session afterwards. However, Tully talked about Yahoo’s Analytics Architecture, and how their data collection pipeline is split up into two to be processed for both real-time statistics using Storm, Kafka, etc., as well as for conducting Business Intelligence (BI) Queries across a huge historical data set in post using MapReduce, Hadoop, Hive, etc.
The talk only lightly touched Hadoop and Map Reduce, which are currently the status quo whenever someone wants to conduct BI Queries across large data sets, and immediately delved into Tully & Team’s involvement with Spark and Shark, two open-source projects that aim to make querying huge data sets up to 100x faster, at the expense of a little bit of accuracy.
How does one achieve that kind of speed gain? And, what does it mean to lose a little bit of accuracy?
Enter the world of Non-Deterministic a.k.a Probabilistic a.k.a Sketching Data-Structures. This might be old school stuff for those that have been in the Big Data space for a while, but I didn’t go to Grad School, nor have I been deeply involved in Big Data Analytics at work. The only times I used Hadoop and Map Reduce, it was to run analytics jobs on a big HBase table to collect billing stats (which have to be 100% accurate), or it was for parallelizing my task of crawling the web and using Machine Learning algorithms to understand where the meat of the content resides on different websites. For instance, ignoring the site headers, footers, navigation bars and sidebars, and figuring out what constitutes as design elements found across all pages of a site, and what constitutes as unique content… but I digress! I’ll do another post about that project, which I worked on almost 10 years ago.
Back to Tully’s talk. My mind went blank when he threw the terms KMV Sketch on a Slide. That’s K-Minimum Value Sketch for us amateurs, and he talked about how they evolved out of Bloom Filters and Count-Min Sketches. Obviously, there was a lot still left for me to learn, and I went scouring the web to learn more, without having to deal with the crazy math-heavy explanations given in Wikipedia Articles. And my mind was blown by how simple these things were, yet how useful they were! So, here’s my gift to you:
Sketching Data Structures
That link does a better job of explaining things than I could ever do. So, click it, learn something new, and go expand your minds!
Also, Elon Musk isn’t the only guy with crazy ambitions. If you guys haven’t seen this yet, you should watch Larry Page’s interview at TED.