Posted by Vincent Granville on January 19, 2014 (C) BigDataNews

Denying that big data is a new paradigm (post year 2000) is like saying that the human population has been huge for a long time: if we can handle 10 million human beings as we did a few thousand years ago, we can handle 10 billion today the same way, even one trillion. It’s the same as saying that data flowing at 10 million rows per day can be processed and analyzed the same way as 10 billion or one trillion per day, which (billions per day) is common in transaction data (credit cards), mobile, web traffic, sensor data, retail data, health data, NSA, NASA, stock trading and many more.

Each time a credit card is swiped or processed online, an analytic algorithm is used to detect if it’s fraudulent or not (and the answer must come in less than 3 seconds most of the time, with low false negative rate). Each time you do a Google search, an analytic engine determines witch search results to show you, and which ads to display. Each time someone posts something on Facebook, an analytic algorithm is run to determine if it must be rejected (promotion, spam, porn etc) or not. Each Tweet posted is analyzed by analytic algorithms (designed by a number of various companies) to detect new viral trends (for journalists), or disease spread, intelligence leaks or many other things. Each time you browse Amazon, the customized content delivered to you is analytically “calculated” to optimize Amazon’s revenue. Each time an email is sent, an analytic algorithms decides whether or not to put it in your spam box (that’s intensive computations for Gmail). This is analytic at billions of rows per day. Evidently there is a gigantic amount of pre-computations and look-up tables being used to make this happens, but it still is “big data analytics”. The analytic engineer knows that his Ad matching algorithm must use the right metrics, right look-up tables (that he should help design, if not automatically populate) to do a great computation (as best as possible) given the finite memory resources and the speed at which the results are delivered, typically measured in milliseconds. You just can’t separate the two processes: data flow, and analytics or data science. Indeed the word “data science” conveys the idea that data and analytics are bedfellows.

Also, big data practitioners working for start-ups usually wear multiple hats: data engineer, business analyst and machine learning / statistics / analytics engineer. The term “data scientist” suits them really well.

Finally, even with transactional data, if you want to split the data scientist role (in large companies) in silos – data versus analytics or business engineers, there is still an important issue: sampling. Analytics engineers can work on samples, but how small, how big or how good? Who determines what makes a good sample? Again, you need to be a data scientist to solve these questions, and the answer is: samples must be far bigger than you think (100 million rows in the contexts described above) and also much better selected. I have worked with an Ad network company managing truly big data. They sent me a sample with about 3 million clicks. But it did not have a rich set of affiliate data (that is, many affiliates with enough data for each of them) that I could not clearly identify instances of affiliates collusion (a scheme leveraging Botnets to share hijacked IP addresses among affiliates, for click fraud). I needed 50 million rows (clicks) to clearly identify this type of massive (but low frequency) fraud. This raises three questions:

  • If you are provided with a 3 million rows sample for your statistical analyses, it might be too small for you to notice some patterns. You will miss many important signals well buried in the full data, and won’t know what you are missing.
  • If (in my case) using 50 million rows (rather than 3 million) helps me detect lots of new interesting, valuable stuff, what if my sample had 500 million rows instead? I might discover even more, who knows?
  • At some point, increasing sample size to an even bigger number, brings diminishing returns. A one billion rows sample might not provide much additional value (except maybe if it is data sampled over a 12 months period rather than two weeks) than a 100 million rows sample. Interestingly, in this case, obtaining advertiser data (with conversions) rather than Ad network data is a great alternative (combining both advertiser and ad network data is even better), even it it means creating dummy (honeypot) advertiser accounts to monitor fraud. It then becomes an experimental design project, and a 100,000 rows data set might be enough. It is the data scientist responsibility to think about and propose an implementation of dummy advertiser accounts to solve the problem, leveraging both his/her statistical, big data, and domain expertise.

To see the entire article go to:

InCyber Comments:

The InCyber PAS Pro-Active and Predicting System has been proven 100% effective against Insider Threats. For additional information write to: . We are now offering a Free Insider Penetration Test for up to 500 Employees using your own historical data.