“Big Data” …. A Rant


So this is a repost of a piece that I wrote in anger about 20 or so months ago. Sadly little has changed and the world is still blindly marching towards the cliff of big data. I hope you enjoy reading about my hatred of shitty tech-marketing as much as I enjoyed drinking the alcohol that fueled this invective.

 ---8<------8<------8<------8<--- 

The next person to freely bandies about the phrase "big data" is going to get a kick in the dick from a dwarf. And if that's too sexist and dwarfist for you, then I'll extend the above threat to at your optioin include what ever painful non gender biased part of your reproductive tract is both marginally accessable and maximally painful (I'm fairly certain ((95% CI)) that I can find some portion of the anatomy that fits all the selection cirterion ). I'll even throw in an individual of average height to do the kicking - assuming a normally distributed population of at least 30 individuals.... just so you know that's a stats joke.

So what has me so pissed off about this? Firstly, I'm on a plane, and between the large corporations that couldn't give a shit about my flying expierence and the dedicated TSA agents who were vigilantly keeping their facebook accounts terror free on their mobile phones.... well as Jack Burton says "Son of a bitch must pay". (If you don't know who Jack Burton is - go to school and get ye some learnin'). So today I'm unleashing my pent up rage on the "Big Data" crew; devotes and neophytes alike.

So let me start out these 95 thesis with thesis number zero:

You do not have a big data problem. You have a functional ignorance problem.

Go back and read that a few times if necessary. Or to put it another way:

"Before you turned to big data, did you first try 'small data'(tm)"

Or to put it yet a third and more direct way, a way that all those who are falling in love with "Big Data" can simply understand:

"What's your fucking question?"

That's right - you heard me. "What's your fucking question?" Most people who are "turning to big data" in their time of need don't even know the question that they are questing for. As a result, many of the current "big data" set (pun intended) are collecting exabytes of data to hide their collective ignorance. They amass huge amounts of "data" (not information, mind you) and then wave the magic buzz word worthy technical concept of the day to make it seem like they have the provebial clue. The best part is, that until you know what the question is, it's difficult if not impossible to know what data might be helpful. It is highly unlikely that my purchase history from Amazon is going to help you locate the next of the closest time vulture or the next planetoid.

Even worse than clueless but well intended are the data unicorns that super glue wings onto pigs with a little math and then declare whole heartedly that pigs should fly because there is a strong corelation between wings and flight. Often times, data unicorns don't like the answer that they have in hand, so they collect more data in the hopes that reality will somehow bend to their will.

XKCD - Time Vultures

I angrily put forth the following: the very people who should be the champions of using powerful data analytics to answer interesting questions and make new business models - e.g. startups - are cheapening the term by using it to prop up and endless series of questionable business models and generally bad ideas.

Here's what I imagine happens durring the cool invesment pitch of the week:

MRS INVESTOR: Bob, thanks for comming. We're really interested in how WeasleDirect.com is going to use the money should we choose to invest.

BOB BIGDATA: Don't worry Mrs. Investor, we're going to collect every shred of information we can and the answers will magically appear to give you your $2.5M back.

MRS INVESTOR: That's really interesting, Bob, but I'm curious as to the specifics of how you're going to use that money. Are you going to hire a sales team? Are you going to use this to enter a new market? I'm a little concerned that the current market for home Weasle delivery is too small to return on our $2.5M investment through online direct sales.

BOB BIGDATA: MRS. Investor given the current increase in powerful data analysis tools, we expect that data collecting density will asymptotically approach what we like to call the Maginot Line. If we assume that people in the home Weasle delivery market are distributed as OUI^2; it's fairly trivial to then optimize the google funnel using a fairly standard Yahoo matrix. We're not sure what that looks like yet, but I feel confident that we can collect enough data to fill optmize the Yahoo matrix. It's a pretty standard "Big Data" problem.

At this point, what the investor hears is: "blah blah blah 'big data' blah blah blah...." and sends over a term sheet. Because, "big data" is the strategic spot for the early stage investor guy or gal, right?

Twelve months later, we return to a different scene:

MRS INVESTOR: Hi Bob, I'm a little concerned, I got a call from the bank saying that you've gone through the line of credit and that you're almost out of money?

BOB BIGDATA: We're confident that were close to figuring this out. We recently started asking customers to share daily bowel movement information with us. We think that it could be the key that unblocks this whole thing. It's pretty standard in 'Big Data' to run into these sorts of {$TECHNO_BABLE} issues.

MRS INVESTOR: Umm... okay. I'm a little worried about the fundamentals of your business model, Bob - can you walk me through it again?

BOB BIGDATA: Sure thing. If you look into these goat intrails that we've spread out on the white board, it's fairly intuitive and simple to see that our model has us owning 107% of the home Weasle deliver market.

By month 24 Bob Bigdata and homeweasele direct have pivoted six times and still don't have anything to show for it. What's more frustrating to Bob, is that each time, the pivot was "Data Driven" and was supported by "the numbers" from his

If the current trend continues "Big data" will become the snake oil of a new set of startups.... just like some other completely diluted and useless terms cough cloud cough....

I have an active interest in what was called descriptive statistics in the '60s and '70s; AI in the '80s; was reduced to the low point of DataWarehousing, Business Inteligence[^1] and Crystal reports in the '90s; and actually had it's first win starting in the 'OOs when a company we all know and love/hate set out to not be evil and organize the worlds information into easily advertisable knoweledge nuggets. And my take away from attempting to use the tactical nukes of the knowledge world are as follows:

  1. This shit is hard to get right.
  2. If you think you got it right, see rule \#1

Models often have hidden flaws that only get exposed in the real world. Randall's 12th law of Edge Cases: nothing generates more edge cases than the real world[^2]. You're model can be fine with test data, but break on production data. The sad part is that you won't know that it's broken until your auto suggestion algortihm starts reccomending home Euthenasia kits to people searching for elder care books.

[^1]: Ever noticed that almost anything with "Inteligent" as part of its' name or description merrits close scrutiny? e.g.: Central Inteligence Agency, Intelligent Design, Business Inteligence.

[^2]: I number my laws like old basic code just in case I need a law with higher numerical precedent[^3]....

[^3]: and because I make them up on the spot so inconsistent numbering makes it seem like there's some method to this madness.

This post isn't to vilify every company that mentions data analysis as being core to their product. On the contrary, when companies get it right - data analysis becomes a secret sauce that is difficult to compete against. Just ask Bing, Blockbuster and Borders who all had thier respective fates sealed by Google, Netflix, and Amazon. This post is an attempt to throw the wet blanket of reality onto the bonfire of investment that seems to be throwing perfectly good VC cash down the drain in the hopes that analyzing the data from your last game-mechanic-social-coupon-buying will finally have them making money instead of spending it. With so many companies hoping that a move into the "big data" space will save them, more than a few VC firms are going to start resembling OTB parlors where the patrons habitually double down on the lame horse because "he's due".

Now that I have you whipped into a frenzy and ready to storm the Bastile of "big data" with your pitchfork in hand. I must offer one small salve to the wound I've if not created, opened wide and poured alcohol in. Your reward for sifting through my vitriol are three places to look before you approach the "big data" event horizon. (This is the constructive part where I attempt to redeem this 10k charcter bitch fest.)

Three places to look before you hit big data:

Classical descriptive statistics.

Everytime you map reduce without drawing a box plot, God murders a marmoset. If you don't know what a box plot is, please run over to Wikipedia and check out their article on box plots. For most data sets, starting with boxplots and histograms does no harm and provides valuable insight on how to proceed. Far too often, we have one tiny nail to drive and you reach for a gigantic sledge hammer.

Simulation:

The forgotten tool of the computer age. Computers are awesome at simulating things. If you don't believe me, go watch the Battle Field 3 Thunder Run Trailer. Doom, Quake, Planetfall, Sim City - these are all simulations of highly varying fidelity. But all of them model a system based on the "real world" to a greater or lesser degree. Instead of recording down every possible piece of information - try recording a little bit of information and simulating the rest. This very technique lies at the heart of one of the more powerful techniques in the statistical tool box: Monte Carlo Simulation. Think of it like the AK-47, when you absolutely positively have to kill every... - well - you get the point.

Ignore it and it Might go Away/Be unimportant:

That's right - my favorite technique is to go focus on something else. Like your business model, or your golf swing. You might spend 2 years proving conclusively that those people who buy Malox also buy depends. Good on you mate, because you could have arrived at that same place by surveying your checkout girls or just asking AARP. Some questions just aren't that important or intreaguing. Let me revise that - MOST questions just aren't that important. Are you asking questions that are central to improving your product or service, or are you Yak Masturbating with CSV files?

3 thoughts on ““Big Data” …. A Rant”

  1. I see a lot of interesting posts on your page.
    You have to spend a lot of time writing,
    i know how to save you a lot of work, there is a tool that creates
    unique, SEO friendly articles in couple of minutes, just search in google – k2 unlimited content

  2. 1- Find a product to promote, something you feel passionate about
    and that you know people will want to buy it. Dеѕрі
    tе thе аѕѕurаnсе оf fоо
    lрrооf рrіvасу рrоtесtіоn bу thеѕе ѕеаrсh еngіnе
    ѕ, уоu ѕhоuld lооk
    іntо іt реrѕоnаllу.
    The only tab of your concern is Public Templates, and no actions
    are necessary as it is already on the screen.

    Here is my blog; Google

  3. I read a lot of interesting articles here. Probably you spend
    a lot of time writing, i know how to save you a lot of time, there
    is an online tool that creates readable, SEO friendly posts in minutes, just type in google – laranitas free content source

Leave a Reply

Your email address will not be published. Required fields are marked *