Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


August 30, 2016

Teradata ANZ

The role of data in data storytelling

Storytelling is a natural human trait. We are hardwired to create and follow narratives; this is how we understand, memorise and recall complex concepts. Telling stories is so natural that a five year old can do it; in fact they often do.
Yet, an (alarmingly) large number of comments and opinions describe in great lengths how people in technical professions are unable to explain or storytell their experiments and findings. Have we regressed that far that something as natural as stories has disappeared from our skillset? Not really.

While people communication abilities vary, when someone says “my technical staff is unable to explain what they do” it actually means “I don’t understand what they do and it’s their fault for not telling me properly”. The issue: context; most scientists and inventors are excellent communicators to an audience of peers. On the other hand, non-technical people expect a narrative that they can follow with their knowledge.

Should a writer adapt to its audience, or should the audience adjust to the writer? The challenge is that there are a limited number of options for people with different background to communicate effectively:
– Non-technical people could learn technical background. While this is unlikely to happen, understanding uncertainty, causality, or Bayesian probabilities should be a prerequisite for anyone aiming to take data-driven decisions
– Eschewing the complexities, and focusing the story on… the story, with picking only some data to play a supporting role while pushing the analytics in the background. This is the preferred communication method of business, politics, news, and snake oil merchants alike.

This latter approach has a problem: we humans are hardwired to understand and be compelled by stories and narratives, but we are pretty rubbish at understanding risk and uncertainty. As a consequence, the narrative completely overrides the data, and is given priority over the data, not letting facts get in the way of a good story. This can be a recipe for disaster, see illustration below.

Clement Fredembach - Storytelling 1

Above: The same data points support a number of different models and “stories” depending on methodology, intent, and audience. Because reality can differ significantly from those models, proper data-centric storytelling must convey not only the story but also underlying assumptions, hypotheses, and context.

“Big data” and analytics allow us to build more resilient and accurate models; it enables the identification of new features and behaviours. But models are, at best, an imperfect representation of reality that can support a number of dissonant narratives.

Being able to disassociate data from story and understanding implicit assumptions made by models requires enough technical skills to make an informed decision[1]. Without these skills, decision makers are asking to be lied to, because constructing a compelling narrative is easy enough that my 5 year old niece can do it, but it doesn’t mean its contents are accurate or indeed truthful.

If the goal is for complex analytics to deliver accurate insights and tangible, long-term value, then data (and data processing techniques) has to be at the centre of the conversation. Storytelling, like data visualisation, is an essential communication tool. But it is not a substitute for understanding the underlying data or science.



[1] This is similar to informed consent in medicine: procedures can be described, but hardly ever truly understood by non medical practitioners.

The post The role of data in data storytelling appeared first on International Blog.


August 29, 2016

Revolution Analytics

Video series: Introduction to Microsoft R Server

Microsoft R Server extends the base R language and Microsoft R Open with big-data capabilities. Specifically, it adds the RevoScaleR package, which creates an out-of-memory "CDF" data structure (so you can process data larger than available RAM), and algorithms that allow you to perform computations on such data using parallel and distributed algorithms. (A limited version of the RevoScaleR package is also included in the free Microsoft R Client.)

If you'd like to get up to speed on the capabilities of Microsoft R Server, my colleague Matt Parker has created a 4-part video series that introduces Microsoft R Server and delves into data ingestion, data management, and predictive modeling. The videos are embedded below, and you can also download the videos for offline viewing from Channel 9.

Part 1: Introduction to Microsoft R Server


Part 2: Data Ingestion

This video covers the eXternal Data Frame (XDF) format, and functions for importing data from files and databases. 


Part 3: Data Management

This video covers key functions for the standard data management tasks: creating and modifying variables, sorting, subsetting, deduplication, and merging datasets.


Part 4: Predictive Modeling

This video focuses on functions for exploring and summarizing data, building predictive models, and making predictions. 


If you'd like to follow along with the videos on your own PC, you can download Microsoft R Client for free and use all of the functions described in the video.

Channel 9: Microsoft R Server 4-part series


[INFOGRAPHIC] Creating a Data-Driven Customer Loyalty Strategy

Here's a great visual summary overview of what you need to start a customer loyalty programme optimised: the questions to ask before getting started and an overview of all the possible data sources to consider.

Curt Monash

Are analytic RDBMS and data warehouse appliances obsolete?

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that...


August 28, 2016

Simplified Analytics

The Good, The Bad & The Ugly of Internet of Things

The greatest advantage we have today is our ability to communicate with one another. The Internet of Things, also known as IoT, allows machines, computers, mobile or other smart devices to...


August 26, 2016

Revolution Analytics

Because it's Friday: The font of Stranger Things

If you haven't caught up on Neflix's Stranger Things yet, you're missing out. It's awesome: a perfect homage to 80's-era sci-fi and fantasy movies, with a great cast and stellar performances (particularly from Winona Ryder and newcomer Millie Bobby Brown). This spoiler-laden supercut shows scenes from Stranger Things alongside corresponding scenes from Close Encounters, the Goonies, Alien, and many other films of the era and the inspirations are clear. If you liked those movies, you'll love Stranger Things.

One thing I love about the series is the opening titles, which is simply a close-up of the letters in the two-word title slowly moving to assemble the logo, Stranger Things. It's done in the font ITC Benguiat, which brings back memories of it being used on the cover of Choose Your Own Adenture books and Stephen King novels. The title sequence looks like it was filmed, not animated, and in a way it was: filmed backlit transparencies were used as the reference for the animation process. The short film above (via Vox) is a lovely insight into the process behind the creation of the title sequence.

That's all from us for this week. Have a good weekend (you have enough time to catch up on Stranger Things!), and we'll be back here on Monday. See you then.

Forrester Blogs

On-Premise Hadoop Just Got Easier With These 8 Hadoop-Optimized Systems

Enterprises agree that speedy deployment of big data Hadoop platforms has been critical to their success, especially as use cases expand and proliferate. However, deploying Hadoop systems is often...


Revolution Analytics

Microsoft R Open 3.3.1 now available for Windows, Mac and Linux

Microsoft R Open 3.3.1, our enhanced disstribution of open source R, is now available for download for Windows, Mac, and Linux. This update upgrades the R langauge engine to version 3.3.1, streamlines the installation process, and bundles some additional packages for parallel programming.

R version 3.3.1 fixes a few rarely-encountered bugs, for example to generate Gamma random numbers with zero or infinite rate parameters, and correctly match text that only differed in the encoding. (See here for a complete list of fixes.) There are no user-visible changes in the language itself, which means that scripts and packages should work without changes from MRO 3.3.0.

For reproducibility, installing packages with MRO 3.3.1 will by default get you package versions as of July 1, 2016. Many packages have been updated or released since MRO 3.3.0, including packages for multivariate covariance generalized linear models, simulating data sets, and creating HTML tables. (See here for some highlights of new packages.) As always, if you want to use newer versions of CRAN packages (or access older versions, for reproducibility), just use the checkpoint function to access the CRAN Time Machine

MRO 3.3.1 now pre-installs some additional packages, including the parallel-programming packages doParallel, foreach and iterators and (on Windows) the RODBC package for interacting with databases. (See here for a list of packages bundled with MRO.) MRO 3.3.1 is also now easier to install: it's now just a single download, with the option to install the Math Kernel Libraries (and speed up computations) during the install process. (Some benchmarks from an older version of MRO are shown below; results are about the same for MRO 3.3.1.)


We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. To download Microsoft R Open (it's free!), simply follow the link below.

MRAN: Download Microsoft R Open

Data Digest

How to strategically position the CDO organisation for success?

In the previous article, A case for the CDO, I discussed the approach for determining if your company needs a CDO and if so, what is the best way to address alignment or positioning of the CDO organisation, at a high-level.

Getting the right set-up with a defined purpose and supportive positioning is critical for the success of the CDO and it warrants a detailed organisational design approach, just as any other function would require, to deliver the best intended outcomes.

Organizational Positioning’ refers to where the CDO org fits within the entire organisation. The decision should be based on the following four important considerations.

A. CDO Span of Control and Scope of Activities
B. Business function prominence in business value chain
C. C-level leader profile
D. Outreach effectiveness for success

Where and how the CDO organisation is created in any company should be a result of evaluation of the points above. Sometimes, changes may be needed as situations evolve. But if careful analysis is performed upfront, your CDO initiative would start strong with the early nurturing, and any subsequent changes would be an effect of general business/company evolution.

Even though CDOs are company executives with a ‘C’ in their title, they typically report to one of the first-line C-level executives. And I think that is the way it should be as, except in a few rare cases, the CDO role is not something that can effectively hold itself as a direct report to the CEO.

CDO is an important, game-changing function for the company, however, there is a degree of detail and mid-level influencing and consensus building that needs to happen in the CDO role, which will not be possible if the role is far removed from the layers where the decisions for data and analytics are made. Therefore, an effective CDO should be able to directly influence at least 3 organisational levels, up and down, including their own.

CDO is an important, game-changing function for the company, however, there is a degree of detail and mid-level influencing and consensus building that needs to happen...

A. CDO Span of Control and Scope of Activities

Depending on how the company is organised - a single entity, multiple entities across business lines rolling into a single corporation, a single company with global presence across continents/countries or multiple levels of aggregation creating more complex structures – CDO span of control and scope of activities will differ.

The higher the CDO org is in the company, the farther removed or limited it will be in execution for actual on-the-ground work for data engineering, governance, architecture, data quality and issue remediation.

In such cases, the span of control would be the actual responsibility for rolling out standardised policies, framework and procedures and maybe even support common tools. The scope of activities may be limited and the local CDO organisations would be executing much of the work.

There is no one approach that will work here, and the key is to do an informed analysis on company structure, interactions, degree of influence and autonomy of the corporate vs sub-entities while defining a firm-wide CDO set-up and roll-out.

Remember, the decisions should be based on what will work for effective management of data across the entire firm, including all sub-entities and it is ‘OK’ to differ in your approach between entities with regard to alignment, structure and coverage.

B. Business function prominence in business value chain: Who carries the weight around?

It is very rare that all functions are equally important or equally loaded in any company. I agree that every function is required. But depending on the industry, complexity, market and nature of the business, some functions within the company are more critical in the value chain than the others. We will call this as driving-function. As such, the leverage they can achieve through effective data and analytics initiatives will be better than if the CDO was aligned somewhere else.

For example, in financial & banking sector, the primary driving functions are Risk Management, Finance, Marketing (Customer 360) and Technology. Data is at the core of everything that fuels the activities and output for these areas. Clean, reliable, timely and relevant data is non-negotiable, as well as high-standards related to the management of this data with adequate controls and governance. As a result, we usually find that the CDOs typically report to CRO, CFO, CMO or CIO.

Whereas in manufacturing, higher education, software, insurance or pharmaceutical sectors, it may make good business sense to evaluate alignment of the CDO function into C’s of alternate verticals (in addition to finance or technology) such as Plant Management, Academic Affairs, Product, Strategy & Innovation, Engineering, R&D, Actuary etc.

CDO is an evolving field and, based on the experience I have across 4 of the 6 sectors mentioned above, there are always more than one possible driving function in each of these sectors. The final decision on where the CDO should be positioned will be based on the company’s unique story along with the talent and capabilities available across the C-level, to support the function.

Note: CDO is a Business – Technology role and while the position/org stays within Business or Technology, the organisation must be staffed with people who can effectively bridge across the two sides. Also, if aligned under technology, the CDO should be a direct report to the CIO, as any other alignment would dilute the effectiveness of the organization or make it more of a technology-only initiative, that it isn’t.

C. C-level leader profile

In the previous article, A case for the CDO, I mentioned that the easiest and quickest way to ensure that the CDO initiative runs into the ground, is to align it with a C-level leader who does not understand the function (even at a high level), not value it, cannot see the strategic benefits of it or have conflicting interests.

These may be extreme words, but the intention is to ensure that the reader understands the criticality in avoiding the mistake of incorrect alignment when it comes to evaluating a C-level leader to whom the CDO will report to. This is as important as choosing the right candidate for being the CDO.

As I discussed in point B of this article – Business function prominence – there may be several candidate functions that are equally dependent on data being managed as an asset, and wanting to drive the CDO. However, not all leaders of those functions are inherently qualified to own the CDO. So what qualities do you look for and what should you avoid?

What to Embrace:

  • Strategic thinker/true leader who fundamentally believes in the value of well-managed data and analytics.
  • Visionary leader who can clearly provide support to CDO, to help garner visibility and acceptance of the CDO function, by effectively helping sell its value among their peers and with the CEO/Board.
  • Leads by example, by willingly standing strong to correct the people, process, technology and control issues within their own function, by working with the CDO, before taking it to the rest of the company.
  • A well-rounded executive who understands the inner workings of the company, broader challenges and opportunities, the various business functions and their inter-relationships.
What to Avoid:

  • Weak leaders. Not everything is black & white and can be represented in numbers only. Most challenges faced by the CDO function are qualitative in nature, such as issues related to cultural/political aspects, job insecurities, conflicting priorities, resource contention and just the change involved in pushing something new that is company-wide. The C-level leader needs to be someone who has the humility, fortitude and capacity to understand the full picture when it is conveyed to them, and support the CDO in all ways to remove obstacles. 
  • Selfish leaders. CDO is a function that must operate across the company. As a result, whether it is under CIO, CFO or CRO, the function should operate across organisational boundaries. That implies the C-level leader is expected to think beyond their independent functions and always willing to have difficult conversations, if needed, to understand different perspectives.
  • Unpopular leaders. Even at the top, there are a few leaders who are not very popular among their peers. If the function is lined up under such a leader, it will be extremely difficult for the CDO and the team to execute on the strategy as the cooperation from other areas can be expected to be sub-par.
  • A leader with a different agenda. Though rare, this could be an issue, not just for CDO, but for any function under such a leader. Sometimes a C-level has a completely different agenda, either to better their own situation or brought in for a different purpose by the management. Whatever be the reason, it is difficult to establish a new function from the ground up under such a person, who in most cases can be attributed to a “bull in a china shop”. It is best to avoid positioning the CDO function under such a leadership as it will be a long time before any benefits are seen from the initiative.

D. Outreach effectiveness for success.

If there is more than one area/leader that would be a good fit for the CDO organisation after evaluating the CDO positioning based on all the three points covered above, then that is a good problem. However, one function/leader combination would have a better outreach effectiveness and execution capabilities than the others would. Make that choice and there will be no regrets.

Disclaimer: All thoughts, ideas and opinions expressed in my articles are my own and do not reflect the views of my current / past employers or clients. No references or details will be provided in these articles that would expose any trade secrets or inner operations of any company whatsoever.

Other articles in this series:

0. The CDO Journey: A practitioner’s perspective      
1. A case for the CDO 

Prakash Bhaskaran is a Business-Technology leader with a passion for solving complex business problems and challenges, using a combination of business process, technology, data, analytics and organizational transformation. Through his varied experience across manufacturing / supply chain, higher education, software development, banking and financial sectors, he helps companies excel at managing data as an asset. Contact Prakash on LinkedIn; Twitter; Email

August 25, 2016

Silicon Valley Data Science

Learning from Imbalanced Classes

If you’re fresh from a machine learning course, chances are most of the datasets you used were fairly easy. Among other things, when you built classifiers, the example classes were balanced, meaning there were approximately the same number of examples of each class. Instructors usually employ cleaned up datasets so as to concentrate on teaching specific algorithms or techniques without getting distracted by other issues. Usually you’re shown examples like the figure below in two dimensions, with points representing examples and different colors (or shapes) of the points representing the class:

The goal of a classification algorithm is to attempt to learn a separator (classifier) that can distinguish the two. There are many ways of doing this, based on various mathematical, statistical, or geometric assumptions:


But when you start looking at real, uncleaned data one of the first things you notice is that it’s a lot noisier and imbalanced. Scatterplots of real data often look more like this:

The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.

Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. —Here are some examples:

  1. About 2% of credit card accounts are defrauded per year1. (Most fraud detection domains are heavily imbalanced.)
  2. Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).
  3. Disk drive failures are approximately ~1% per year.
  4. The conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
  5. Factory production defect rates typically run about 0.1%.

Many of these domains are imbalanced because they are what I call needle in a haystack problems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.

When you encounter such problems, you’re bound to have difficulties solving them with standard algorithms. Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration2. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.

This might seem like pathological behavior but it really isn’t. Indeed, if your goal is to maximize simple accuracy (or, equivalently, minimize error rate), this is a perfectly acceptable solution. But if we assume that the rare class examples are much more important to classify, then we have to be more careful and more sophisticated about attacking the problem.

If you deal with such problems and want practical advice on how to address them, read on.

Note: The point of this blog post is to give insight and concrete advice on how to tackle such problems. However, this is not a coding tutorial that takes you line by line through code. I have Jupyter Notebooks (also linked at the end of the post) useful for experimenting with these ideas, but this blog post will explain some of the fundamental ideas and principles.

Handling imbalanced data

Learning from imbalanced data has been studied actively for about two decades in machine learning. It’s been the subject of many papers, workshops, special sessions, and dissertations (a recent survey has about 220 references). A vast number of techniques have been tried, with varying results and few clear answers. Data scientists facing this problem for the first time often ask What should I do when my data is imbalanced? This has no definite answer for the same reason that the general question Which learning algorithm is best? has no definite answer: it depends on the data.

That said, here is a rough outline of useful approaches. These are listed approximately in order of effort:

  • Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on the so-called natural (or stratified) distribution and sometimes it works without need for modification.
  • Balance the training set in some way:
    • Oversample the minority class.
    • Undersample the majority class.
    • Synthesize new minority classes.
  • Throw away minority examples and switch to an anomaly detection framework.
  • At the algorithm level, or after it:
    • Adjust the class weight (misclassification costs).
    • Adjust the decision threshold.
    • Modify an existing algorithm to be more sensitive to rare classes.
  • Construct an entirely new algorithm to perform well on imbalanced data.

Digression: evaluation dos and don’ts

First, a quick detour. Before talking about how to train a classifier well with imbalanced data, we have to discuss how to evaluate one properly. This cannot be overemphasized. You can only make progress if you’re measuring the right thing.

  1. Don’t use accuracy (or error rate) to evaluate your classifier! There are two significant problems with it. Accuracy applies a naive 0.50 threshold to decide between classes, and this is usually wrong when the classes are imbalanced. Second, classification accuracy is based on a simple count of the errors, and you should know more than this. You should know which classes are being confused and where (top end of scores, bottom end, throughout?). If you don’t understand these points, it might be helpful to read The Basics of Classifier Evaluation, Part 2. You should be visualizing classifier performance using a ROC curve, a precision-recall curve, a lift curve, or a profit (gain) curve.

    ROC curve


    Precision-recall curve

  2. Don’t get hard classifications (labels) from your classifier (via score3 or predict). Instead, get probability estimates via proba or predict_proba.
  3. When you get probability estimates, don’t blindly use a 0.50 decision threshold to separate classes. Look at performance curves and decide for yourself what threshold to use (see next section for more on this). Many errors were made in early papers because researchers naively used 0.5 as a cut-off.
  4. No matter what you do for training, always test on the natural (stratified) distribution your classifier is going to operate upon. See sklearn.cross_validation.StratifiedKFold.
  5. You can get by without probability estimates, but if you need them, use calibration (see sklearn.calibration.CalibratedClassifierCV)

The two-dimensional graphs in the first bullet above are always more informative than a single number, but if you need a single-number metric, one of these is preferable to accuracy:

  1. The Area Under the ROC curve (AUC) is a good general statistic. It is equal to the probability that a random positive example will be ranked above a random negative example.
  2. The F1 Score is the harmonic mean of precision and recall. It is commonly used in text processing when an aggregate measure is sought.
  3. Cohen’s Kappa is an evaluation statistic that takes into account how much agreement would be expected by chance.

Oversampling and undersampling

The easiest approaches require little change to the processing steps, and simply involve adjusting the example sets until they are balanced. Oversampling randomly replicates minority instances to increase their population. Undersampling randomly downsamples the majority class. Some data scientists (naively) think that oversampling is superior because it results in more data, whereas undersampling throws away data. But keep in mind that replicating data is not without consequence—since it results in duplicate data, it makes variables appear to have lower variance than they do. The positive consequence is that it duplicates the number of errors: if a classifier makes a false negative error on the original minority data set, and that data set is replicated five times, the classifier will make six errors on the new set. Conversely, undersampling can make the independent variables look like they have a higher variance than they do.

Because of all this, the machine learning literature shows mixed results with oversampling, undersampling, and using the natural distributions.


Most machine learning packages can perform simple sampling adjustment. The R package unbalanced implements a number of sampling techniques specific to imbalanced datasets, and scikit-learn.cross_validation has basic sampling algorithms.

Bayesian argument of Wallace et al.

Possibly the best theoretical argument of—and practical advice for—class imbalance was put forth in the paper Class Imbalance, Redux, by Wallace, Small, Brodley and Trikalinos4. They argue for undersampling the majority class. Their argument is mathematical and thorough, but here I’ll only present an example they use to make their point.

They argue that two classes must be distinguishable in the tail of some distribution of some explanatory variable. Assume you have two classes with a single dependent variable, x. Each class is generated by a Gaussian with a standard deviation of 1. The mean of class 1 is 1 and the mean of class 2 is 2. We shall arbitrarily call class 2 the majority class. They look like this:


Given an x value, what threshold would you use to determine which class it came from? It should be clear that the best separation line between the two is at their midpoint, x=1.5, shown as the vertical line: if a new example x falls under 1.5 it is probably Class 1, else it is Class 2. When learning from examples, we would hope that a discrimination cutoff at 1.5 is what we would get, and if the classes are evenly balanced this is approximately what we should get. The dots on the x axis show the samples generated from each distribution.

But we’ve said that Class 1 is the minority class, so assume that we have 10 samples from it and 50 samples from Class 2. It is likely we will learn a shifted separation line, like this:


We can do better by down-sampling the majority class to match that of the minority class. The problem is that the separating lines we learn will have high variability (because the samples are smaller), as shown here (ten samples are shown, resulting in ten vertical lines):


So a final step is to use bagging to combine these classifiers. The entire process looks like this:


This technique has not been implemented in Scikit-learn, though a file called (balanced bagging) is available that implements a BlaggingClassifier, which balances bootstrapped samples prior to aggregation.

Neighbor-based approaches

Over- and undersampling selects examples randomly to adjust their proportions. Other approaches examine the instance space carefully and decide what to do based on their neighborhoods.

For example, Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together.

Tomek links

Tomek’s algorithm looks for such pairs and removes the majority instance of the pair. The idea is to clarify the border between the minority and majority classes, making the minority region(s) more distinct. The diagram above shows a simple example of Tomek link removal. The R package unbalanced implements Tomek link removal, as does a number of sampling techniques specific to imbalanced datasets. Scikit-learn has no built-in modules for doing this, though there are some independent packages (e.g., TomekLink).

Synthesizing new examples: SMOTE and descendants

Another direction of research has involved not resampling of examples, but synthesis of new ones. The best known example of this approach is Chawla’s SMOTE (Synthetic Minority Oversampling TEchnique) system. The idea is to create new minority examples by interpolating between existing ones. The process is basically as follows. Assume we have a set of majority and minority examples, as before:

SMOTE approach

SMOTE was generally successful and led to many variants, extensions, and adaptations to different concept learning algorithms. SMOTE and variants are available in R in the unbalanced package and in Python in the UnbalancedDataset package.

It is important to note a substantial limitation of SMOTE. Because it operates by interpolating between rare examples, it can only generate examples within the body of available examples—never outside. Formally, SMOTE can only fill in the convex hull of existing minority examples, but not create new exterior regions of minority examples.

Adjusting class weights

Many machine learning toolkits have ways to adjust the “importance” of classes. Scikit-learn, for example, has many classifiers that take an optional class_weight parameter that can be set higher than one. Here is an example, taken straight from the scikit-learn documentation, showing the effect of increasing the minority class’s weight by ten. The solid black line shows the separating border when using the default settings (both classes weighed equally), and the dashed line after the class_weight parameter for the minority (red) classes changed to ten.


As you can see, the minority class gains in importance (its errors are considered more costly than those of the other class) and the separating hyperplane is adjusted to reduce the loss.

It should be noted that adjusting class importance usually only has an effect on the cost of class errors (False Negatives, if the minority class is positive). It will adjust a separating surface to decrease these accordingly. Of course, if the classifier makes no errors on the training set errors then no adjustment may occur, so altering class weights may have no effect.

And beyond

This post has concentrated on relatively simple, accessible ways to learn classifiers from imbalanced data. Most of them involve adjusting data before or after applying standard learning algorithms. It’s worth briefly mentioning some other approaches.

New algorithms

Learning from imbalanced classes continues to be an ongoing area of research in machine learning with new algorithms introduced every year. Before concluding I’ll mention a few recent algorithmic advances that are promising.

In 2014 Goh and Rudin published a paper Box Drawings for Learning with Imbalanced Data5 which introduced two algorithms for learning from data with skewed examples. These algorithms attempt to construct “boxes” (actually axis-parallel hyper-rectangles) around clusters of minority class examples:


Their goal is to develop a concise, intelligible representation of the minority class. Their equations penalize the number of boxes and the penalties serve as a form of regularization.

They introduce two algorithms, one of which (Exact Boxes) uses mixed-integer programming to provide an exact but fairly expensive solution; the other (Fast Boxes) uses a faster clustering method to generate the initial boxes, which are subsequently refined. Experimental results show that both algorithms perform very well among a large set of test datasets.

Earlier I mentioned that one approach to solving the imbalance problem is to discard the minority examples and treat it as a single-class (or anomaly detection) problem. One recent anomaly detection technique has worked surprisingly well for just that purpose. Liu, Ting and Zhou introduced a technique called Isolation Forests6 that attempted to identify anomalies in data by learning random forests and then measuring the average number of decision splits required to isolate each particular data point. The resulting number can be used to calculate each data point’s anomaly score, which can also be interpreted as the likelihood that the example belongs to the minority class. Indeed, the authors tested their system using highly imbalanced data and reported very good results. A follow-up paper by Bandaragoda, Ting, Albrecht, Liu and Wells7 introduced Nearest Neighbor Ensembles as a similar idea that was able to overcome several shortcomings of Isolation Forests.

Buying or creating more data

As a final note, this blog post has focused on situations of imbalanced classes under the tacit assumption that you’ve been given imbalanced data and you just have to tackle the imbalance. In some cases, as in a Kaggle competition, you’re given a fixed set of data and you can’t ask for more.

But you may face a related, harder problem: you simply don’t have enough examples of the rare class. None of the techniques above are likely to work. What do you do?

In some real world domains you may be able to buy or construct examples of the rare class. This is an area of ongoing research in machine learning. If rare data simply needs to be labeled reliably by people, a common approach is to crowdsource it via a service like Mechanical Turk. Reliability of human labels may be an issue, but work has been done in machine learning to combine human labels to optimize reliability. Finally, Claudia Perlich in her Strata talk All The Data and Still Not Enough gives examples of how problems with rare or non-existent data can be finessed by using surrogate variables or problems, essentially using proxies and latent variables to make seemingly impossible problems possible. Related to this is the strategy of using transfer learning to learn one problem and transfer the results to another problem with rare examples, as described here.

Comments or questions?

Here, I have attempted to distill most of my practical knowledge into a single post. I know it was a lot, and I would value your feedback. Did I miss anything important? Any comments or questions on this blog post are welcome.

Resources and further reading

  1. Several Jupyter notebooks are available illustrating aspects of imbalanced learning.
    • A notebook illustrating sampled Gaussians, above, is at Gaussians.ipynb.
    • A simple implementation of Wallace’s method is available at It is a simple fork of the existing bagging implementation of sklearn, specifically ./sklearn/ensemble/
    • A notebook using this method is available at ImbalancedClasses.ipynb. It loads up several domains and compares blagging with other methods under different distributions.
  2. Source code for Box Drawings in MATLAB is available from:
  3. Source code for Isolation Forests in R is available at:

Thanks to Chloe Mawer for her Jupyter Notebook design work.

1. Natalie Hockham makes this point in her talk Machine learning with imbalanced data sets, which focuses on imbalance in the context of credit card fraud detection.
2. By definition there are fewer instances of the rare class, but the problem comes about because the cost of missing them (a false negative) is much higher.
3. The details in courier are specific to Python’s Scikit-learn.
4. “Class Imbalance, Redux”. Wallace, Small, Brodley and Trikalinos. IEEE Conf on Data Mining. 2011.
5. Box Drawings for Learning with Imbalanced Data.” Siong Thye Goh and Cynthia Rudin. KDD-2014, August 24–27, 2014, New York, NY, USA.
6. “Isolation-Based Anomaly Detection”. Liu, Ting and Zhou. ACM Transactions on Knowledge Discovery from Data, Vol. 6, No. 1. 2012.
7. “Efficient Anomaly Detection by Isolation Using Nearest Neighbour Ensemble.” Bandaragoda, Ting, Albrecht, Liu and Wells. ICDM-2014

The post Learning from Imbalanced Classes appeared first on Silicon Valley Data Science.

VLDB Solutions

Using Joins Across Multiple Tables

Readings in Database SystemsJoins

When dealing with large data sets, it’s important to ensure that the data can be accessed correctly. Failure to address this issue early on in database development can lead to problems when later attempting to extract information from the data itself. This was highlighted recently during a count for London’s Mayoral election on 5 May, when staff at the Electoral Commission had to ‘manually query a bug-stricken database’ which delayed the result by ‘several hours’. These kinds of problems aren’t helpful in terms of furthering the role of computerisation of such tasks, and in particular for data companies. At present, electronic voting machines are not currently used by UK voters, however, counting software was used for the London Mayoral and Assembly elections. In the article, electoral law expert Prof. Bob Watt of Buckingham University expressed concerns about such tasks being undertaken digitally; he said: ‘The bigger issue is of course just how stable these machines are and that’s something that I have a great deal of worry about and have had for some time’. As you can see, it’s important that companies dealing with data get it right. The BBC article on this story doesn’t go into any specific details on why the data was not accessible, so it’s difficult to offer any kind of assessment – however, when dealing with such large sets of data involving tens of thousands of rows, it’s important that the right people have access to the information when they need it.


When storing data in a relational database management system (RDBMS), data is contained in multiple tables which are related to each other through one or more common factors. Using the example above, a table may contain information relating to the voter in a table named ‘voter’, such as id, name and email_address. This then may be linked to another table containing contact information for the voter such as their address, telephone and other related information. We may then have a table relating to the candidates of the election, and information relating to them which is necessary for The Electoral Commission to efficiently track the vote. In order to access the information across two tables, we would use a join.


A join is used to examine information, not just across two tables, but across multiple tables (depending on system capabilities) and is an integral part of querying data. Each table in a database should contain a primary key which ensures that each row is unique and prevents duplication of redundant data. In addition, a table can contain a foreign key which references another table. For example, by connecting a table containing voter information using voter_id and candidate_id, we could find out which individuals voted for a specific candidate. In this instance, the candidate_id would be the foreign key in our voters table, as it would reference a different table containing information on the candidate. To perform a join in SQL, we need to create a statement which references which information we require, followed by the tables we want to join and how we want to join them together to provide the results we need.


To fully understand how joins work, we need to create two tables and some data within them. But first, we will briefly go over the different types of joins. The default join is an inner join; this statement is used to return results where the data in each table matches. The next type of join we will look at is a left join. This statement is used to return all the data from the left table and only the matching data in the right table. The next table is a right join; this is the opposite to a left join. This statement is used to return all of the data from the right table and only the matching data in the left table. The final join is a full outer join; this statement is used to return all the data in both tables. Let’s have a look at each in join in more detail. Firstly, we will examine the data contained in our tables.

Table examples:


Join Table VLDB Solutions


Join Table VLDB


Inner Join

The first join we spoke of was the inner join. The inner join is used to return matching data in each table. Using the statement below on the tables we have created, we can display all of the people that voted for a candidate.


SELECT voter.voter_id




FROM voter

INNER JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id


This table shows the results from the above statement:

Join Table VLDB

The results from the inner join show us that only five rows from the voter table have matches in the candidate table.


The above venn diagram shows us how the tables are connected when we join two tables on an inner join. In this instance we only get the data which matches in both tables.


Left Join

The next join statement is the left join. This will provide us with all the data from the left table and only the connecting data in the right table.


SELECT voter.voter_id




FROM voter

LEFT JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id



Results table from the above statement:

Join Table VLDB

The left join displays all of the data in the left table, which in this statement is voter, and displays data of the connected table. If there is no data available for a row, then null is added to that row; this is what has happened in row 6 as Liam didn’t choose a candidate.


This venn diagram shows us how a left join is used. All of the data contained in voter is returned as is the matching data in candidate.


Right Join


As you may have guessed, the next statement we will look at is the right join. This statement will result in the right table providing us with all the data it holds and then only displaying the data that is connected to it from the left table.


SELECT voter.voter_id




FROM voter

RIGHT JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY candidate.candidate_id



Results from the above statement:

Join Table VLDB

The SQL statement has been tweaked slightly. The right join statement has the order by changed to candidate.canidate_id; this is to make the results more readable. The right join statement has displayed all of the data available in the right table, which was candidate, and now only shows the connecting data in the left table (voter), and again if there isn’t any data in the left table that connected to the right table then null is added to that row.



This right join venn diagram shows us the opposite to a left join, and displays all of the data in candidate and the connecting data in the voter table.



Full Outer Join

The last statement that will be explained now is a full outer join. This statement brings in all the data from both tables and connects the results together where it can.

SELECT voter.voter_id




FROM voter


ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id



Results from the above statement:

Join Table VLDB

With a full outer join, the results are from both tables, and results will be matched together. Also, if there is no matching data, the field will again contain a null value.


And this is the last venn diagram. This shows us a full outer join and displays all the data in both tables; it attempts to match the data together where it can.


As you can see from the examples above, we have managed to join the data from two individual tables and link them together in a variety of different ways to provide the results we required. The ability to join tables is a fundamental aspect of SQL and a common feature of the role of a database developer.





The post Using Joins Across Multiple Tables appeared first on VLDB Blog.


August 24, 2016

Revolution Analytics

Deep Learning Part 2: Transfer Learning and Fine-tuning Deep Convolutional Neural Networks

by Anusua Trivedi, Microsoft Data Scientist

This is a blog series in several parts — where I describe my experiences and go deep into the reasons behind my choices. In Part 1, I discussed the pros and cons of different symbolic frameworks, and my reasons for choosing Theano (with Lasagne) as my platform of choice.

Part 2 of this blog series is based on my upcoming talk at The Data Science Conference, 2016. Here in Part 2, I describe Deep Convolutional Neural Networks (DCNNs) and how Transfer learning and Fine-tuning helps better the training process for domain specific images.

Please feel free to email me at if you have questions.


The eye disease Diabetic Retinopathy (DR) is a common cause of vision loss. Screening diabetic patients using fluorescein angiography images can potentially reduce the risk of blindness. Current trends in the research have demonstrated that DCNNs are very effective in automatically analyzing large collections of images and identifying features that can categorize images with minimum error. DCNNs are rarely trained from scratch, as it is relatively uncommon to have a domain-specific dataset of sufficient size. Since modern DCNNs take 2-3 weeks to train across GPUs, Berkley Vision and Learning Center (BVLC) have released some final DCNN checkpoints. In this blog, we use such a pre-trained network: GoogLeNet. This GoogLeNet network is pre-trained on a large collection of natural ImageNet images. We transfer the learned ImageNet weights as initial weights for the network, and fine-tune these pre-trained generic network to recognize fluorescein angiography images of eyes and improve DR prediction.

Using explicit feature extraction to predict Diabetic Retinopathy

Much work has been done in developing algorithms and morphological image processing techniques that explicitly extract features prevalent in patients with DR. The generic workflow used in a standard image classification technique is as follows:

  • Image preprocessing techniques for noise removal and contrast enhancement
  • Feature extraction technique
  • Classification
  • Prediction

Faust et al. provide a very comprehensive analysis of models that use explicit feature extraction for DR screening. Vujosevic et al. build a binary classifier on a dataset of 55 patients by explicitly forming single lesion features. These authors use morphological image processing techniques to extract blood vessel, and hemorrhage features and then train an SVM on a data set of 331 images. These authors report accuracy of 90% and sensitivity of 90% on binary classification task with a dataset of 140 images.

However, all these processes are very time and effort consuming. Further improvements in prediction accuracy require large quantities of labeled data. Image processing and feature extraction of image datasets is very complex and time-consuming. Thus, we choose to automate the image processing and feature extraction step by using DCNNs.

Deep convolutional neural network (DCNN)

Image data requires subject-matter expertise to extract key features. DCNNs extract features automatically from domain-specific images, without any feature engineering techniques. This process makes DCNNs suitable for image analysis:

  • DCNNs train networks with many layers
  • Multiple layers work to build an improved feature space
  • Initial layers learn 1st order features (e.g. color, edges etc.)
  • Later layers learn higher order features (specific to input dataset)
  • Lastly, final layer features are fed into classification layer(s)


Figure 1
C layers are convolutions, S layers are pool/sample



Convolution: Convolution layers consist of a rectangular grid of neurons. The weights for this are the same for each neuron in the convolution layer. The convolution layer weights specify the convolution filter.





Pooling: The pooling layer takes small rectangular blocks from the convolutional layer and subsamples it to produce a single output from that block.





In this post, we are using GoogLeNet DCNN, which was developed at Google. GoogLeNet won the ImageNet challenge in 2014, setting the record for the best contemporaneous results. Motivations for this model were a simultaneously deeper as well as computationally inexpensive architecture.





Transfer Learning and Fine-tuning DCNNs

In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest.

Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios:

  1. New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features.

  2. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network.

  3. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network.

  4. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

Fine-tuning DCNNs: For this DR prediction problem, we fall under scenario iv. We fine-tune the weights of the pre-trained DCNN by continuing the backpropagation. It is possible to fine-tune all the layers of the DCNN, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a DCNN contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the DCNN becomes progressively more specific to the details of the classes contained in the DR dataset.

Transfer learning constraints: As we use a pre-trained network, we are slightly constrained in terms of the model architecture. For example, we can’t arbitrarily take out convolutional layers from the pre-trained network. However, due to parameter sharing, we can easily run a pre-trained network on images of different spatial size. This is clearly evident in the case of Convolutional and Pool layers because their forward function is independent of the input volume spatial size. In case of Fully Connected (FC) layers, this still holds true because FC layers can be converted to a Convolutional Layer.

Learning rates: We use a smaller learning rate for DCNN weights that are being fine-tuned under the assumption that the pre-trained DCNN weights are relatively good. We don’t wish to distort them too quickly or too much, so we keep both our learning rate and learning rate decay really small.

Data Augmentation: One of the drawbacks of non-regularized neural networks is that they are extremely flexible: they learn both features and noise equally well, increasing the potential for overfitting. In our model, we apply L2 regularization to avoid overfitting. But even after that, we observed a large gap in model performance on the training and validation DR images, indicating that the fine tuning process is overfitting to the training set. To combat this overfitting, we leverage data augmentation for the DR image dataset.

There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. As the color information in these images is very important, we only rotate the images at different angles – at 0, 90, 180, and 270 degrees.

Figure 5
Replacing the input layer of the pre-trained GoogLeNet network with DR images. We fine-tune all layers except for the top 2 pre-trained layers which contains more generic data-independent weights.

Fine-tuning GoogLeNet: The GoogLeNet network we use here for DR screening was initially trained on ImageNet. The ImageNet dataset contains about 1 million natural images and 1000 labels/categories. In contrast, our labeled DR dataset has only about 30,000 domain-specific images and 4 labels/ categories. Thus, the DR dataset is insufficient to train a network as complex as GoogLeNet and so we use weights from the ImageNet-trained GoogLeNet network. We fine-tune all layers, except for the top 2 pre-trained layers which contains more generic data-independent weights. The original classification layer "loss3/classifier" outputs predictions for 1000 classes. We replace it with a new binary layer.

Figure 6
Fine-tuning GoogLeNet


Fine-tuning allows us to bring the power of state-of-the-art DCNN models to new domains where insufficient data and time/cost constraints might otherwise prevent their use. This approach achieves a significant improvement of average accuracy and improves the state-of-the-art of image-based medical classification.

In my Part 3 of this blog series (coming soon), I will explain re-usability of these trained DCNN models.

The Data Lab

Where to Study Data Science in Scotland?

Scotland Data Science course finder


What is it?

Our web application allows you to search for a specific degree by the modules and programming languages that are taught. It allows you to highlight the courses that you think look particularly interesting. Information about the entry requirements and course details such as duration and part time study are also given. Furthermore, there is a tab for scholarships for the MSc courses that provides information about the awards that are available per each degree and their eligibility criteria.

We outline the courses currently part of The Data Lab MSc, a fully funded challenge-based learning programme, now going into its second year. This aims to tackle the skills gap in data science and provide students with an education geared towards meeting the needs of industry so they learn the skills that organisations actually require. There are 90 places across 9 programmes in 7 Scottish universities available for 2016/7. To find out more about this course, visit our MSc page, download the MSc brochure, or contact us on


Data science tools

There is a multitude of tools used in data science. From SQL, to Hadoop to R, each has its own advantages and disadvantages. Determining an industry standard is challenging. In a recent CrowdFlower study, where approximately 3500 data scientist and analyst job postings in LinkedIn were analysed, the most requested skill found in the job adverts was SQL, followed by Hadoop, Python, Java and R. With over 19 MSc courses offering a mixture of these tools, this is bound to put a student in good stead for a career in data science.

Findings from the 2015 version of the annual O’Reilly Data Science Salary Service echo these results. When data scientists where asked what languages they used most frequently, the top 5 were SQL, Excel, Python, R and MySQL. Interestingly, the use of Spark and Scala grew by 17% and 10% respectively in 1 year, with those who use these languages averaging a higher median salary compared to their peers. Degrees which teach these tools can be searched for individually or in conjunction on the web app.


The skills industry need?

According to a recent CrowdFlower survey, 83% of respondents stated that there are not enough data scientists to go around.[4] With the growth of data-specific MSc courses in Scotland, it is clear that universities are trying to fill that gap. However we must always question whether these courses teach and train students the skills that Scottish employers need?


The Data Science Course Finder

This web application was developed this summer by two of our very talented interns, Perry Gibson and Amy Ramsay. You can find the code for this application on GitHub.

If you have any thoughts, questions or suggestions about this please contact us at



Google Plus
Teradata ANZ

Is A Picture Worth A Thousand Words? The Truth About Big Data And Visualisation

Data visualisation has always been a vital weapon in the arsenal of an effective analyst, enabling complex data sets to be represented efficiently and complex ideas to be communicated with clarity and brevity. And as data volumes and analytic complexity continue to increase in the era of big data and data science, visualisation has come to be regarded as an even more vital technique – with a vast and growing array of new visualisation technologies and products coming to market.

Whilst preparing for an upcoming presentation on the Art of Analytics recently, I had reason to re-visit Charles Minard’s visualisation depicting Napoleon’s disastrous Russian campaign of 1812. In case you aren’t familiar with this seminal work, it is shown below.

This visualisation has been described as “the best statistical graphic ever drawn”. And by no less an authority than Edward Tufte, author of “The Visual Display of Quantitative Information”, the standard reference on the subject for statisticians, analysts and graphic designers.

There are many reasons why Minard’s work is so revered. One reason is that he manages to represent six types of data – geography, time, temperature (more on this in a moment), the course and direction of the movement of the Grande Armée and the number of troops in the field – in only two dimensions.

A second is the clarity and economy that enables the visualisation to speak for itself with almost no additional annotation or elaboration. We can see clearly and at a glance that the Grande Armée set off from Poland with 422,000 men, but returned with only 10,000 – and this only after the “main force” was re-joined by 6,000 men who had feinted northwards, instead of joining the advance on Moscow.

And yet a third reason is that the visualisation was ground-breaking; though flow diagrams like these are named for Irish Engineer Matthew Sankey, he actually used this approach for the first time very nearly 30 years after the Minard visualisation was published. Today, Sankey diagrams are used to understand a wide variety of business phenomena where sequence is important. For example, we can use them to map how customers interact with websites so that we can learn the “golden path” most likely to lead to a high-value purchase – and equally to understand which customer journeys are likely to lead to the abandonment of purchases before checkout.

But even Minard’s model visualisation is arguably partial. Minard shows us the temperature that the Grande Armée endured during the winter retreat from Moscow – inviting us to conclude that this was a significant reason for the terrible losses incurred as the army fell back, as indeed it was.

However, the Russians themselves regarded the winter of 1812 / 1813 as unexceptional – and the conditions certainly did not stop the Cossack cavalry from harrying Napoleon’s retreating forces at every turn. Napoleon’s army was equipped only for a summer campaign – because Napoleon had believed that he could force the war to a successful conclusion before the winter began. As the explorer Sir Rannulph Fiennes has said, “There is no such thing as bad weather, only inappropriate clothing.”

Exceptional weather also affected the campaign’s advance, with a combination of torrential rain followed by extremely hot conditions killing many men from dysentery and heatstroke. But Minard either cannot find a way to represent this information, or chooses not to. In fact, he gives us few clues as to why the main body of Napoleon’s attacking force was reduced by a third during the first eight weeks of the invasion and before the major battle of the campaign – even though, numerically at least, this loss was greater than that suffered during the retreat the following winter.

Terrible casualties also arose from many other sources – with starvation as a result of the Russian scorched earth policy and inadequate supplies playing key roles. The state of the Lithuanian roads is regarded by historians as a key factor in this latter issue, impassable as they were to Napoleon’s heavy wagon trains both after the summer rains and during the winter. But again, Minard either cannot find a way to represent the critical issue (the tonnage of supplies reaching the front line) or its principal cause (the state of the roads) – or chooses not to.

Minard produced this work 50 years after the events it describes, at a time when many in France yearned for former Imperial glories and certainties. His purpose – at least if the author of his obituary is to be believed – seems to have been to highlight the waste of war and the futility of overweening Imperial ambition. It arguably would not have suited his narrative to articulate that Napoleon’s chances of success might have been greater had the Russia of 1812 been a more modern nation with a more modern transport infrastructure – or had Napoleon’s strategy made due allowance for the fact that it was not.

With the benefit of 20th century hindsight, today we might still conclude that the vastness of the Russian interior and the obduracy of Russian resistance would anyway have doomed a better planned and executed campaign; but that hindsight was not available in 1869, either to Minard – or to the contemporaries he sought to influence.

Did Minard’s politics affect his choice of which data to include? Or were the other data simply not available to him? Or beyond his ability to represent in a single figure? From our vantage point 150 years after the fact, it is difficult to answer these questions with certainty.

But when you are looking at a data visualisation, you certainly should attempt to understand the author’s agenda, preconceptions and bias. What is it that the author wants you to see in the data? Which data have been included? Which omitted? And why? Precisely because good data visualisations are so powerful, you should make sure that you can answer these questions before you make a decision based on a data visualisation. Because whilst a good data visualisation is worth a thousand words, it does not automatically follow that it tells the whole truth.

This post first appeared on Forbes TeradataVoice on 31/03/2016.

The post Is A Picture Worth A Thousand Words? The Truth About Big Data And Visualisation appeared first on International Blog.


The New Science behind Customer Loyalty

The power of Customer Experience and growing competition are driving companies to take a more scientific approach to building customer loyalty.


August 23, 2016

Big Data University

This Week in Data Science (August 23, 2016)

Here’s this week’s news in Data Science and Big Data. Connected Volcano

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

  • Constant Contact: Using IBM BigInsights to Create Business Insight – Join this session on August 25th to learn how Constant Contact, a leader in email marketing, is using IBM BigInsights to create useful insights for their clients in a way that scales.
  • IBM i2 Summit – Join the IBM i2 Summit on August 30-31 to hear directly from experts who are using all forms of data, including “dark data,” to outthink threats.
  • Combining IBM SPSS Statistics and R for competitive advantage – This Data Science Central Webinar event on September 1st, will show you how SPSS Statistics can help you keep up with the influx of new data and make faster, better business decisions without coding.
  • Big Data and Health presented by IBM Canada – Join We Are Wearables Toronto and IBM Canada on September 16th for a look at how wearables and sensors are changing healthcare.
  • How Data Can Help in the Fight Against Sexual Assault – Join the Center for Data Innovation and Rise, a civil rights nonprofit, on October 6th in Washington D.C., for a panel discussion on how policymakers and law enforcement can develop data-driven policies and practices to help in the fight against sexual assault and improve the lives of survivors.

The post This Week in Data Science (August 23, 2016) appeared first on Big Data University.

Revolution Analytics

Edward Tufte Keynote Presenter at Data Science Summit, Sep 26-27

I'm excited to share that one of my data science heroes will be a presenter at the Microsoft Data Science Summit in Atlanta, September 26-27. Edward Tufte, the data visualization pioneer, will deliver a keynote address on the future of data analysis and the how to make more credible conclusions based on data.

If you're not familiar with Tufte, a great place to start is to read his seminal book Visual Display of Quantitative Information. First published in 1983 — well before the advent of mainstream data visualization software — this is the book that introduced and/or popularized many familiar concepts in data visualization today, such a small multiples, sparklines, and the data-ink ratio. Check out this 2011 Washington Monthy profile for more background on Tufte's career and influence. Tufte's work also influenced R: you can easily recreate many of Tufte's graphics in the R graphics system, including this famous weather chart.


The program for the Data Science Summit looks fantastic, and will also include keynote presentations from Microsoft CEO Satya Nadella and Data Group CVP Joseph Sirosh. Also there's a fantastic crop of Microsoft data scientists (plus yours truly) giving a wealth of practical presentations on how to use Microsoft tools and open-source software for data science. Here's just a sample:

  • Jennifer Marsman will speak about building intelligent applications with the Cognitive Services APIs
  • Danielle Dean will describe deploying real-world predictive maintenance solutions based on sensor data
  • Brandon Rohrer will give a live presentation of his Data Science for Absolutely Everybody series
  • Frank Seide will introduce CNTK, Microsoft's open source deep learning toolkit
  • Maxim Likuyanov will share some best practices for interactive data analysis and scalable machine learning with Apache Spark
  • Rafal Lukawiecki will explain how to apply data science in a business context 
  • Debraj GuhaThakurta and Max Kaznady will demonstrate statistical modeling on huge data sets with Microsoft R Server and Spark
  • David Smith (that's me!) will give some examples of how data science at Microsoft (and R!) is being used to improve the lives of disabled people
  • ... and many many more!

Check out the agenda for the breakout sessions on the Data Science Summit page for more. I hope to see you there: it will be a great opportunity to meet with Microsoft's data science team and see some great talks as well. To register, follow the link below.

Microsoft Data Science Summit, September 26-27, Altanta GA: Register Now

Teradata ANZ

Internet of Things – Lessons from an IoT prototype project

The Internet of Things, commonly known as IoT, is spreading at a much faster rate than what I initially thought it would about a year ago. I have been curious to learn about what IoT is all about and encountered a large community of contributors and followers in the two dominant open source forums, Arduino and RaspberryPi. Both are working hard to transform the lives of people, organisations and cities around world with the power of digital.

Agile Analytics of Things with prototyping

I experimented with Arduino projects by prototyping a few sensors to understand implications for large scale data collection, data management and analytics. Here are some of the lessons that I learned.
Things want to speak! Are you listening?

There is an incredibly large number of organisations such as ThingSpeak that are providing open platforms in the cloud that leverage micro-controllers and sensors from the likes of Arduino and Raspberry Pi to collect and share sensor data and analytics from around the world, which is a good resource for experimenting and learning about IoT.

What is Internet of Things (IoT) anyway?

IoT is a major development that promises to extend the digitally connected world in myriads of ways to connect things and people to offer new ways of monitoring events and situations to learn and respond in real-time using analytics.

I use the Arduino Mega 2560 micro-controller to write programs called Sketch to monitor and collect sensor data and manage them over embedded communication networks such WiFi, Ethernet, GSM SIM card, Bluetooth etc.

Sundara Raman - IoT 1

With a passive infrared motion sensor I am able to track foot traffic in strategic locations that is useful in marketing, retail and town planning. When combined with the photo-electric sensor data, I can determine how the overcast weather affects traffic movement in the same location. In addition, the RFID reader allows me to verify and track proximity of things that have embedded tags.

My favourite is the Accelerometer which when placed in my car with a GPS sensor allows me to track the precise locations of bumps and pot holes on the road which can help town planners and transport authorities with traffic planning and commuter safety. These sensors also continuously measure acceleration and lane changes to determine driving behaviours which insurance companies can use to provide incentives or penalise policy holders on insurance premiums.

IoT comes to life with actuator

Sensors are one thing but actuators make IoT actionable. Weather sensors can monitor rain water levels which can be analysed to predict trigger time for remote activation of irrigation pumps to meet the needs of crops.

Alphabet soup of IoT

As I experimented with IoT, I am learning that “Internet of Things” is a misnomer. It is not a single, unified network of connected devices but rather a set of related technologies and systems, including the use of sensors, RFID chips, communication networks that work in coordination together. For instance, moisture sensors in remote agricultural fields and acoustic sensors in distance rainforests are likely to be using solar power, low maintenance sensors with long-life batteries, multi-kilometer reach with secure communication network to continually collect and transmit data.

In fact, industrial IoT is more likely to use whatever communication networks (e.g. Infrared, ZigBee, RFID, Bluetooth, LoRa, WiMax, NB-IoT, Coax, Fibre, 4G/LTE) are suitable for the application depending on the remote communication distance range and reach, bandwidth need, cost and power source / battery life of the sensors. “Communication of Things” or “Network of Things” may be more meaningful but people are using whatever acronyms to suit their area of focus – “Analytics of Things” and “Internet of Everything” are some of its variations that are coming to take roots.

Irrespective of the differences in acronyms used and communication technologies deployed, the internet protocol and web services are likely to dominate in the minds of IoT application development communities across various industry sectors. In fact, my Arduino prototype is a web server that communicates sensor readings from the micro-controller which can be viewed in a standard web browser and/or be integrated with Apache NiFi.

Business versus Engineering focus of IoT – Sundara Raman - IoT 2

While sensor metrics such as temperature, relative humidity, motion etc. are well understood in the business community, when it comes to industrial IoT where sensors are embedded in machines and equipment, most metrics will require scientific and engineering knowledge and industry expertise to interpret the meaning of the sensor data to develop actionable insights in optimising the performance of equipment and machines.

A brief look at the word-cloud below will provide a glimpse of the partial range of sensors and measures made possible by the open source IoT community. This list does not include a wide range of biometric sensors that are used in the healthcare industry which expands the range of possible sensors.

Sundara Raman - IoT 3

From sketch to scale with Big Data

Each type of sensors produces one type of metric which is often not that helpful in itself in making decisions about possible remedial action. Data context and semantic consistencies, with data being collected from other sensors which operate in the same environment, is required in order to develop an integrated view of the situation impacting the environment.

As sensors continually monitor their environment they generate data continuously, often with same value repeated over extended periods of time until a change occurs in the environment. This results in huge volumes of data with high levels of redundancy that is too expensive to be transported over the communication network. Elimination of redundancy requires local processing of the data closer to the point of data collection and therefore requires consideration of a distributed architecture (e.g. hub and spoke architecture) for efficient data management.

While open source IoT platforms such as Arduino and RaspberryPi allow prototype development of sketch for sensor data collection and actioning of the actuators remotely, it will greatly benefit by leveraging and integrating with technologies from the Big Data open source communities such as Apache Hadoop. As I started to integrate my Arduino prototype sensor data with Apache NiFi for central data transformation, processing and correlation, I learnt that the RaspberryPi community is well on its way to embed Apache NiFi in its microcontroller board. This is a big step in enabling distributed and local processing by leveraging the two open source technologies.

Machine learning and predictive analytics

Sensor data collected from my Arduino prototype offers exciting opportunities for machine learning and predictive analytics with Aster Analytics in the Teradata Unified Data Architecture (UDA). The prediction score and insights gained from Aster Analytics can be combined with the customer and product data in the Integrated Data Warehouse, in the UDA it can be used to trigger remote activation of the actuators.

If you want to know more about ‘Analytics of Things then start here.

The post Internet of Things – Lessons from an IoT prototype project appeared first on International Blog.

Data Digest

What every Chief Customer Officer should be worried about

It starts with “s” and ends with “s” and, at one time or another, sends shivers up every Chief Customer Officer’s spine. Of course I’m talking about silos – organisational silos, channels silos, product silos, all wreaking havoc on the customer experience.

In their favour, silos promote internal accountability, focus and expertise. But in today’s business world, where customer experience has displaced product and price as the key competitive differentiator, companies need to re-evaluate the raison d’être of an inward-looking siloed approach.

Just think of the times the contact centre is not informed of a campaign by marketing; or the credit card company calls a business owner about a business card without knowing they’re speaking to one of their most loyal Platinum card members. There are a million scenarios of disjointed, dissatisfying service experiences because the left hand doesn’t know what the right hand’s doing.

While silos are difficult to break down completely, it’s still the Chief Customer Officer’s (CCO) role to build and communicate an overarching set of customer outcomes across the organisation. Secondly, the CCO is tasked with building a culture of collaboration to achieve the desired customer outcomes.  This will almost certainly involve cultural change management, new team-based KPIs and reward programs, and cross-functional working groups.

Office politics and individual egos are perhaps the biggest barrier to implementation. It’s imperative that the CCO have the necessary power and authority to overcome these, in order to successfully instigate reform.
Office politics and individual egos are perhaps the biggest barrier to implementation. It’s imperative that the CCO have the necessary power and authority to overcome these, in order to successfully instigate reform.

Given the employee sensitivities which invariably arise from “silo shake-ups”, it’s important to show fairness and transparency. Employees are themselves all customers at the end of the day, and if shown in a logical way (without finger-pointing) how their actions or processes are causing customers to be unhappy, they are more likely to co-operate with proposed changes. Whether it’s through dynamic dashboards on every employee’s desktop or wallboards on every floor of the office, employees should be able to track results of the CCO’s efforts to make the customer experience easier and happier.

Be part of the premiere event for Customer Experience Executives - Chief Customer Officer Forum, Sydney happening on 28-29 November 2016. For more information, visit   

By Sharon Melamed:  

Sharon Melamed is a digital entrepreneur with 25 years’ experience in contact centres and customer experience.  In 2012, she launched Matchboard, a free website where companies can enter their needs and find “right-fit” vendors of solutions across the customer lifecycle. In 2014, she launched FindaConsultant, an online portal of business consultants.  She holds a double honours degree and University Medal from the University of Sydney, and speaks five languages. Contact Sharon on LinkedIn; Twitter; Email

August 22, 2016

Revolution Analytics

Five great charts in 5 lines of R code each

Sharon Machlis is a journalist with Computerworld, and to show other journalists how great R is for data visualization she shows them these five data visualizations, each of which can be created in 5 lines of R code or less.


I've reproduced Sharon's code and charts below. I did make a couple of tweaks to the code, though. I added a call to checkpoint("2016-08-22") which, if you've saved the code to a file, will install all the necessary packages for you. (I also verified that the code runs with package versions as of today's date, and if you're trying out this code at a later time it will continue to do so, thanks to checkpoint.) I also modified the data download code to make it work more easily on Windows. Here are the charts and code:






For more on the charts, read Sharon's Computerworld article by following the link below.

Computerworld: 5 data visualizations in 5 minutes: each in 5 lines or less of R


Revolution Analytics

Five problems (and one solution) with dual-axis time series plots

If you need to present two time series spanning the same period, but in wildly different scales, it's tempting to use a time series chart with two separate vertical axes, one for each series, like this one from the Reserve Bank of New Zealand:


Charts like this typically have one or more crossover points, and that crossing imparts meaning to the viewer of the sense that one series is now "ahead" of the other. One problem is that crossover-points in dual-axis time series charts are entirely arbitrary. Changing either the left-hand or right-hand scale (and replotting the data accordingly) will change where the crossover points appear. And (as if often the case) the scales are automatically chosen to allow each series to use the full vertical space available, just changing the time-range of the data plotted will also change the location of the crossover points.

In an excellent blog post, statistician Peter Ellis points out five problems with dual-axis time series charts:

  1. The designer has to make choices about scales and this can have a big impact on the viewer
  2. In particular, “cross-over points” where one series cross another are results of the design choices, not intrinsic to the data, and viewers (particularly unsophisticated viewers) will not appreciate this and think there is more significance in cross over than is actually the case
  3. They make it easier to lazily associate correlation with causation, not taking into account autocorrelation and other time-series issues
  4. Because of the issues above, in malicious hands they make it possible to deliberately mislead
  5. They often look cluttered and aesthetically unpleasing

A simple alternative is to rescale both time series, for example to define both series to have a nominal value at a specific time, say both start at 100 on January 1, 2016. This is a useful way to compare the growth in two series since the beginning of the year, and means that both can be represented using the same single scale. (If you're using the ggplot2 package in R to plot time series, you can use the stat_index function from Peter's ggseas package to scale time series in this way.) The problem though is that you use the interpretability of the chart, having now lost the true scales for both time series.

All that being said, Peter suggests that there are times when a dual-axis chart can be appropriate, for example when the two axes are conceptually similar (as above, when both are linear monetary scales), and you use a consistent process to set the scales of the vertical axes. Other considerations include color-coding the axes for interpretability, and choosing colors that don't favor one series over the other. Implementing these best practices, Peter has created the dualplot() function for R, which cooses the axes according to a cross-over point you specify. This is equivalent to rescaling the series to have the same value at that specified points, but keeps the real-value axes for interpretability. Heres' the above chart, rendered with dualplot() with a crossover point at January 2104:

NZ dollar

For more great discussion of the pros and cons of dual-axis time series charts, and the R code for the dualplot() function, follow the link to Peter's blog post below.

Peter's stats stuff: Dual axes time series plots may be ok sometimes after all (via Harlan Harris)

The Data Lab

And the winners are...

We asked participants to tell us why they would like to take part in the course, and how they would use the data science Boot Camp training in their current roles. After going through all the participants’ videos, and carefully scoring each criteria, we have a tie! The first place joint winners are Colin Spence, Project Lead at Material DNA and Rocio Martinez, Analytical Consultant at Aquila Insight. And the second prize goes to Beata Mielcarek, IT Program Manager and Associate Data Scientist at CISCO. Have a look at their video entries here:






Rocio and Colin will be attending the three-week full programme, and Beata will take part in the Machine Learning module, thanks to the sponsorship of MBN Solutions. Participants will use the skills they learn to have a positive impact on their projects and organisations, and through it, contribute towards bringing benefits to Scotland’s economy.

Colin Spence mentions in his video: “When I heard about The Data Lab Boot Camp, I was really excited. It appears to be an accelerator programme that I could handle, that will give me the necessary understanding I need. Our goals at Material DNA is to create a globally scalable data platform that creates jobs and revenue for Scotland. Participation in the Boot Camp will be a significant contributor towards meeting this goal.”

Rocio Martinez said: "Ever since I started working, I noticed the need of learning new skills, specially now, in an evolving world that is demanding us to be the best we can be. I understand the need and power of analytics. I think the course aligns perfectly with me and who I am.”

After announcing the winners, Michael Young, CEO of MBN Solutions said: “We were pleasantly surprised with the content and passion in videos from all participants. It shows the appetite there is for this kind of short-term training opportunities, and for people in the industry to upskill and grow in the area of data science. This is the time for Scotland to invest in talent, and we at MBN Solutions are thrilled to be a part of it.”

There are only a few places left for the Boot Camp, developed by The Data Incubator, which begins on 12th September in Edinburgh. The course will focus on developing practical application skills such as advanced python, machine learning and data visualisation in a collaborative environment. You can sign up for the whole 3-week programme, or for individual modules.

To sign up and secure your space in the Boot Camp, please fill the application form and send it to For more information visit our Boot camp page, download our brochure or email


Google Plus
Data Digest

The 80/20 Rule of an Effective Chief Analytics Officer

We all know the problem – data is exploding, the number of analysts is decreasing and expectations on big data and analytics are ever increasing. In such a scenario, how can one best manage this situation? To become an effective Chief Analytics Officer, is there a ‘rule’ to follow?

To shed light on these issues, we spoke to Cameron J. Davies, SVP, Corporate Management Sciences at NBC Universal. Cameron is responsible for both the corporate management sciences and NBCU news group insights teams, including the development and execution of advanced analytics, data and research strategies driving NBCU priorities such as leveraging big data, personalisation of content, monetisation strategies, alternative measurements, and customer insights.

How has your role evolved over the past 12-18 months? 

CD: Most of these roles tend to follow similar paths when a company like NBCU is just getting started in this arena.  Consequently, my last 12-18 months have been in those initial stages.  The first 6 months were largely about building relationships, listening and getting integrated into the business.  It really doesn’t matter how experienced you are either at “analytics” or the industry, your first step should always be about listening to the business, understanding their struggles, challenges, and opportunities.  You can pick up some quick wins along the way but it is largely just learning.  Months 6-12 are about establishing a strategic vision for how you want to prioritise and drive value.  We called it setting a “North Star” of where we would like to see the organisation in 5 years.  We know the path with evolve and shift, but it is important to set it and then make decisions incrementally as you evolve with the business.  You also spend a lot of time in months 6-18 just doing the heavy infrastructure lifting of establishing your Data Strategy (finding it, storing it, curating it, then syndicating it), most of the first stage value comes from driving efficiency and effectiveness through these processes.  The next 18-36 months is where the job gets really fun and exciting, because with the foundations in place we can begin to really deliver value added integrated tools and processes that help the company make better decisions more often.

What advice would you give someone wanting to become a Chief Analytics Officer, and what are the core skills one needs to have to thrive in the role?

CD: The role is only 20% data and math and 80% human behaviour and organisational change. There are a lot of very smart technically and mathematically talented people who fail at these roles because they don’t understand that. Data and analytical skill sets are really just the table stakes anymore. The people that will excel at these roles in the future (especially where the real growth will be outside of the digital natives) are going to be those that can influence an organisation. I would strongly advise anyone thinking of attempting one of these roles to spend as much time reading, studying their Organisational Behaviour and Psychology “textbooks” as they do trying to dig through the math of the latest machine learning algorithm.

Data is exploding, the number of analysts is flattening and expectations and demand are growing – how does one best manage in this scenario? Should the focus be on processes or business problems?

CD: Yes, There is no “OR”, that is like asking whether I should focus on breathing or pumping blood. If you stop doing either, you die.  Business processes are by nature evolved to deal with a business problem.  Any advanced analytics “tool” you want to roll out or put in place has to absolutely be designed to enhance a specific set of decisions the business makes on an ongoing bases AND do so in a way that can be acted upon appropriately (i.e., fit with the business processes).  For example, a LOT of vendors want to sell media companies “real-time tools”.  The idea of being able to see who is tuning in or out of my program in “real-time” sounds exciting BUT… by the time I put a procedural drama on the air (fully produced), there are very few decisions I can make “in the moment” that will impact the content or airing of that show.  Consequently, what value does that information generate for me in that moment?

The idea of being able to see who is tuning in or out of my program in “real-time” sounds exciting BUT…what value does that information generate for me in that moment?

What is the biggest challenge you face within your role today and how are you looking to tackle it?

CD: We are no different than 99% of the folks out there trying to do this.  Our #1 challenge is non-existent, incomplete, or bad data and/or the inability to quickly process all of the data we have stored (new tools like Hive and Spark are helping with this but still an issue).  We are tackling it via an integrated Data Strategy that aligns with our overall Advanced Analytics strategy.  It starts with a consistent and persistent Master Data and Metadata strategy and moves through to Curation and Syndication in ways that create “single” consistent sources for other use cases like enhanced automated reporting, forecasting, etc.

What is the biggest challenge faced by the analytics/big data industry currently and in what ways does this affect your business?

CD: Too many vendors and too few qualified candidates, especially highly qualified data architects.  In particular, those people that are able and willing to dig down into the bowels of the data to create useful repositories.  Everyone wants to be the “rock star” who does cool math but that can only happen if all of the hard work of data procurement and curation has been done.  Consequently, people want to throw that work back to a vendor who really doesn't understand your data or business and/or a traditional BI group within IT, who by habit and nature tend to think in rows, columns and traditional EDW structures.  There are a lot of start-ups out there right now trying to tackle this particular problem but it is sort of like the promise of a magic diet pill.  Maybe it will exist one day and work well without all of the nasty side effects but for now, it just takes hard work and you have to be willing to roll up your sleeves and get it done.  It isn’t sexy and it isn’t fun but the results can be amazing if you have the tenacity and fortitude to stick with it and get it done right.

Where do you see will be the biggest area of investment in analytics within your industry over the next 12 months?

CD: Distributed computing.  Distributed storage made it possible to gather in and keep all of this “data” but in some ways we are still drinking from the proverbial data “ocean” with a very thin straw.  It is getting better and better and I think this is where you will see the biggest gains over the next year or so.

Hear more from Cameron J. Davies and other leading CAOs at the coming Chief Analytics Officer Forum Fall happening on 5-7 October 2016 in New York. For more information, visit   

August 21, 2016

Curt Monash

More about Databricks and Spark

Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes: What I called...


Simplified Analytics

How Industrial Internet of Things are impacting our lives

General Electric coined the term Industrial Internet in late 2012. It is effectively synonymous with the Industrial Internet of Things, and abbreviated as Industrial IoT or IIoT. While many of us are...


August 19, 2016

Revolution Analytics

Because it's Friday: Gravity wave detecting lasers ... in space!

Earlier this week I was lucky enough to meet and see a fascinating talk by Shane L. Larson, astronomer at Chicago's Adler Planetarium. The topic was Black Holes and Gravitational Waves, and the first part of the talk focused on the LIGO project, a pair of massive detectors in Washington and Louisiana that can detect the tiny ripples in space caused by massive gravitational events far in the universe, like stars going supernova or black holes colliding. And on September 14 last year, not long after the detectors were first turned on, that's exactly what they detected: two massive black holes, merging in space.

Here's what that event would look like, if you could position yourself near the merging black holes. This computer simulation was generated using the same software used for the movie Interstellar.)


However, something I learned about LIGO was that while telescopes are more like eyes on the universe, these gravity-wave detectors are more like ears: they can 'hear' the 'sound' from these cataclysmic events as the gravity waves pass through earth, but don't form an image of them directly. We can get a rough idea of the direction of the event because we have two sensors (in much the same way our two ears give us a rough idea of the direction of a loud sound). You can actually convert the sensor readings into audible sound, and this is what the above event sounded like: 


The actual sound is just 1 second long; it's played 4 times here, first in its actual frequency, then upsampled to a higher register where it's easier to hear (and then the cycle repeats). The little "chirp" you hear is the final phase of the black hole merger, where they spin super-rapidly before colliding at half the speed of light.

We're in the early days of gravitational wave observation, which literally gives us a new view (or listen?) on the universe to observe events that don't generate light or are obscured by the dust of the galaxy. More sensors are being constructed around the globe, which will give us the ability to detect more events and get a better sense of where they occurred. And there's even a project to build an observatory in space: LISA, the Laser Interferometry Space Antenna. (The original NASA project was sadly not funded, but the European Space Agency has taken up the mantle.) Like the ground-based detectors, LISA uses lasers to detect the minute (smaller than the diameter of a proton!) change in lengths of the detection arms as the gravity waves roll through. A precursor mission, LISA Pathfinder, was launched in December as a proof of concept to make sure accurate space-based detection of distances was possible. The experiment was to float two small masses in space, and use a laser reflecting off the sides to measure the distance. Dr Larson gave me a replica of one of the masses to hold — even though it was made of tungsten and not gold/platinum like the real ones, it was still surprisingly dense!

LISA mass

Thankfully, the LISA Pathfinder mission was a success, and the space-based gravity wave observational platform could be launched as early as 2039.

That's all from us on the blog for this week. See you on Monday, and have a great weekend!


August 18, 2016

Silicon Valley Data Science

SVDS Strengthens Executive and Advisory Team

I’m excited to announce two new members of our team: Antony Falco (VP, Product & Innovation) and Nayla Rizk (Advisor).

At Silicon Valley Data Science, we aspire to bring the best of agile, iterative data product development used by innovative companies in the Silicon Valley to our clients. Tony Falco has joined our team as VP, Product & Innovation, to help us continue to bring a product development focus to our work. He brings extensive experience as a serial entrepreneur in building products, platforms, and SaaS offerings. In addition to a client focus, Tony will help assess product opportunity for those of our teammates that have entrepreneurial ambitions.

Before joining SVDS, Tony was the founder of, a SaaS API service that eliminates the need to deploy and scale NoSQL databases. was acquired by CenturyLink in April 2015. In 2008, Tony co-founded Basho—a pioneer in distributed databases and cloud storage, and an early entrant in the NoSQL space. Prior to that, Tony helped Akamai grow its distributed caching product from pre-revenue to post-IPO and beyond.

In explaining the move, Tony said: “I was excited to join SVDS because of the people and the model. Working with a diverse team of smart, motivated people to figure out big questions about a massive, emerging industry is by far the best part of the job. SVDS’ model provides me a platform to test new ideas, ready them for market, and if one piques my interest, to lead it as a spinout.”

As we have grown our diverse team and variety of client engagements, I also wanted to add expertise in business strategy, organizational effectiveness, and people development. I’m pleased to welcome Nayla Rizk to our Advisory Board to assist our leadership team and Board of Directors with both strategic and tactical guidance on enabling our people to achieve their best. Because of Nayla’s amazing background in consulting, operating roles, and executive placement, she will advise the company on growth, culture, and recruiting. Nayla has always contributed back to the community as a leader and member of a number of non-profit organizations, reflecting one of our most important values as a firm.

Nayla currently consults with start-ups in Silicon Valley and New York City. Previously she was a senior director at Spencer Stuart, the leading global executive search firm, where she was a core member of the firm’s North American Technology, Communications & Media and Board Practices, based in Silicon Valley. Earlier, Nayla was director of strategic planning and reengineering at Network Equipment Technologies. She served as an engagement manager with McKinsey & Company in New York and, before that, began her career as an operations research analyst for Chevron U.S.A. in San Francisco.

The post SVDS Strengthens Executive and Advisory Team appeared first on Silicon Valley Data Science.

Revolution Analytics

Sentiment analysis of Trump's tweets with R

Data Scientist David Robinson caused a bit of a stir in the media when he analyzed Donald Trump's tweets and revealed that those sent from an Android device were likely sent by the candidate himself, while those sent from an iPhone were likely sent by campaign staffers. The difference? As seen in the chart below, Android-based tweets used angrier, negative words while iPhone-based tweets tended be straightforward campaign announcements and hashtag promotions. The news was reported in Scientific American, the LA Times, PC Magazine and David even gave an interview with Time magazine


David used the R language and several contributed packages to analyze Trump's tweets. (The R code behind the analysis is available on Github as an R Markdown document, which also makes an excellent example of literate programming with R.) He used the twitteR package and the userTimeline function to download tweets and metadata from the @realDonaldTrump account, which formed the raw data for the analysis. The tidytext package function unnest_tokens extracted and standardized individual words from the tweets, from which simple tabulation generated the chart above of words more frequently used by iPhone and Android tweets. The tidytext package was also used to measure the sentiment of the words used, and again there was a clear difference between iPhone and Android tweets in use of words related to sadness, fear, anger and disgust; and surprise, anticipation, trust and joy.

Trump sentiments

For more details on David's analysis of the Trump tweets, and a fascinating hypothesis that an iPhone-based staffer is attempting to emulate the style of the real Trump's tweets, check out his blog post linked below.

Variance Explained: Text analysis of Trump's tweets confirms he writes only the (angrier) Android half

Silicon Valley Data Science

5 Ways to Facilitate Failure

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

At the Strata + Hadoop World conference in London, Ellen Friedman gave a wonderful talk called, “Big Data Stories: Decisions that drive successful projects.” One of her points, about the need for room to fail in order to innovate, got us thinking—again—about the critical role of failure and experimentation when it comes to innovation.

Ellen Friedman tweet

The idea of “freedom to fail” as a condition necessary for achieving transformative change is now widely accepted in our industry. Indeed, many smart people have been beating this drum for years now. But not many companies have yet found a way to create this freedom while simultaneously maintaining smooth enterprise-scale operations. Failure is appealing as a stepping stone along the path to innovation, but it’s very scary in practice—especially when you can’t yet see where the path is leading.

We’d like to suggest the following five guidelines as a place to start when you design for investigation and innovation—and the inevitable failures you’ll have along the way. We call this the Experimental Enterprise.

1. Build a Bridge to the Business

Innovation must not interfere with the important production processes of a business, but it can’t be too far removed from it, either. It’s not enough to create segregated areas for each: there needs to be a tie, otherwise you’ll never get innovation that’s relevant to the business, or you’ll never be able to actually turn it into product. Design a sandbox in which innovation can play alongside production.

2. Make Failure Affordable

Some industries are more risk-averse than others because of the amount of overhead involved. If you’ve ever complained about the slew of crappy movie sequels and comic remakes, then you’ve noticed Hollywood’s increasing reticence to make anything that doesn’t yet have a proven audience. The toy industry is similar: endless Elmos and Doras crowd the shelves because toymakers are unwilling to take a risk on anything that might not sell enough units to cover their overhead costs. Shrink overhead where you can to make failure affordable.

Remember: failure needs to be cheap both economically and politically. It’s dangerous to pin reputations and kudos to pilot projects. Every organization has the two-year-old pilot project that refuses to die, because those involved have too much at stake. The worst fate for a pilot is to neither succeed nor fail, but instead to wander on, zombie-like, in limbo.

3. Crunch the Numbers

“Freedom to Fail” doesn’t exempt you from projecting an ROI. If a project is untested or truly innovative, it becomes even more imperative to get a good grasp of what you stand to gain, and what you’ve laid on the table. Understanding the stakes will allow you to gauge the trade-offs and make others in your organization more comfortable with the possibility of failure.

4. Invest in Learning

Playing the lottery is a stupid get-rich-quick scheme, but it’s affordable entertainment: what you’re really buying when you buy a mega-millions ticket is not any real possibility of becoming wealthy, but the license to dream for a few days about what would happen if you did. Viewed in that light, it’s cheaper (by the hour) than many other forms of entertainment.

If you can similarly measure your corporate experiments by what you stand to gain when you lose, then the value proposition shifts. Structure your projects so that even if you fail, you fail in a way that allows you to learn something valuable. If the experiment doesn’t succeed, you’ve still gotten something back on your investment; and if the experiment does succeed: jackpot!

5. Fail Fast

The best way to learn, of course, is with very short cycles. Failing quickly and repeatedly is a good discovery mechanism for the contours of the problem space: it gives you a “fingertip feel” for the area. The faster you fail, the tighter your feedback loop will be, allowing a more effective response to changing market conditions and political whims.

It’s probably apparent that an organization designed for scale and efficiency is completely at odds with the organizational design required for experimentation. Your company needs to recognize this at the highest levels and make room. The young “rockstars” will say your organization is failing because it can’t innovate, and the old guard will say the new kids don’t know how to run anything in production. Both are right, and both are wrong.

It’s the organizational know-how to run and to innovate in parallel—and to bridge the two—that’s the secret sauce. At a technical level, this is where platforms and APIs count for a lot. But there’s an organizational analog, too: certain roles allow for people to act as political bridges. You want to have both in place.

For more on how to build an adaptive environment, check out our free white paper on Building the Experimental Enterprise. For further reading, we recommend Dealers of Lightning: Xerox PARC and the Dawn of the Computer Age (HarperBusiness) by Michael A. Hiltzik and The Connected Company (O’Reilly Media) by Dave Gray with Thomas Vander Wal.

The post 5 Ways to Facilitate Failure appeared first on Silicon Valley Data Science.

Jean Francois Puget

An ode to the analytics grease monkeys (analytics deployment = ROI)

Here is a guest post by my colleague Erick Brethenoux, Director, IBM Analytics Strategy, Decision Management & Initiatives.  He provides a new and interesting angle on a very important topic that I discussed several time here: analytics and data science provide business value only when business actions are taken.  I like the way Erick discusses it, and I hope you'll agree with me.

Analytics has value only when it is actionable

Analytics provide a significant business (i.e., monetary) impact for organizations when analytical assets are deployed within its business processes. (as exemplified by deploying analytics in business processes is for smart buildings and elevators, as explained by IBM's Harriet Green and KONE's Larry Wash at InterConnect) – i.e., when analytical assets are consumed by people or processes; like a call center operator making a recommendation in real-time to a customer on the line (a recommendation promoted by an analytical propensity model), or a machine accepting or declining millions of credit-card transaction every minute (acceptance calculated in micro-seconds by an analytical fraud-detection model).

Embedding those models inside the appropriate business processes deliver that value; making sure that those models (that have supported concrete decisions) can be legally traced, monitored, and secured requires discipline and management capabilities. Deploying analytical assets within operational processes in a repeatable, manageable, secure, and traceable manner requires more than a set of APIs and a cloud service; a model that has been scored (executed) has not necessarily been managed.

Developing & deploying analytical assets: complementary cycles.

In the mid-90’s we have already been through this phase of wild and creative programmable development.  Today, aside from a salutary democratization of analytics in general and predictive analytics in particular, and an unprecedented development of analytical talents, the analytical open-source movement is also producing a wide number of analytical assets that will eventually have to be consumed – but current open-source real-time deployment techniques provide “dissemination means” not “managed deployment means”.   

To make a timely analogy for this Summer:  you do not win the Tour de France by building bikes, you win it by riding those bikes within organized teams.

In the last decade where the “industrial” deployment of analytics have started to boost the competitiveness of early adopters, we have learned, often the hard way, that releasing a model to production is not enough; that “release moment” is followed by many steps that make the deployment cycle as important as the development cycle (see Figure 1).


Figure 1.  Complementary and interlinked analytical development & deployment cycles


Capabilities for the deployment of analytical assets

The development cycle has been battle tested for more than two decades, and originates from the open methodology CRISP-DM (Cross-Industry Standard Process for Data Mining) which has been refined last year by IBM, through an extended version called Analytics Solutions Unified Method for Data Mining/Predictive Analytics (also known as ASUM-DM).  Many development efforts at IBM enable the development cycle through technologies including our Data Science Experience (DSX) and our ML-as-a-Service expansive upcoming machine learning workflow capability.

The deployment cycle, on the other end, has not been as formalized; but successful organizations have adopted various technologies to provide the foundation to manage analytical assets in a reliable, repeatable, scalable and secured way.  That foundation should provide:

  1.  A “centralized” way to store and secure analytical assets to make it easier for analysts to collaborate, allow them to re-use models or other assets as needed (it could be a secured community or a collaboration space)
  2. The possibility to house the repository in an organization existing database management system (regardless of its provisioning) and establish security credentials through integration with existing security providers, while accommodating models from any analytical tool
  3. Protocols to ensure adherence to all internal and external procedures and regulations, not just for compliance reasons but, as an increasing amount of data gets aggregated, to address potential privacy issues
  4. Automated versioning, fine-grained traceable model scoring and change management capabilities (including champion/challenging features) to closely monitor and audit analytical asset lifecycles
  5. Bridges to eventually link both the development and deployment cycles to: external information life cycle management systems (to optimize a wider and contextual re-use of assets) and enhanced collaboration capabilities (to share experiences and information assets beyond the four walls of the organization).  

The development cycle has always been the glamorous part of the full analytical process, and developers, especially coders (now fully enabled through robust sets of development tools), have thrived on that first cycle.  However, neglecting the deployment cycle has often prevented organizations to fully realize the business value of their analytical efforts while generating distrustful attitudes towards analytics within those organizations.  In the deployment cycle, the “industrialization” of analytics is less glamorous, but it is where the rubber meets the road.

Deployment benefits

Beyond fostering collaboration, delivering more effective analytical results, controlling and deploying those results to people and systems to support decision-making in a broad range of operational areas, automating analytical processes for greater consistency and auditing the development, refinement and usage of models and other assets throughout their lifecycle, the formalization of the deployment cycle provides a bigger picture advantage: more consistent organization’s decisions.

Next analytic market leader

The current hype around the development cycle is overshadowing the upcoming need for organized analytics consumption – and that need, through the advent of external data and the tsunami of Internet of Things data and processes will be overwhelming. IBM has the benefit of having foundational capabilities through its Collaboration & Deployment Services technology, but the next wave will require some significant updates around that foundation – including the extension (i.e., bridges) toward information life cycle management capabilities and wider collaboration capabilities (i.e., beyond the organization’s analytics community). We need a stronger focus and a formalized approach (with its accompanying technology) for the “deployment cycle” – let’s bring together the conversations on the analytical deployment theme.   I predict that the next analytic market leader will dominate (and strongly financially benefit from) the managed deployment of analytical assets; who will take that hill first?


The Data Lab

What The Data Lab MSc project means for Waracle's clients

Waracle MSc project

You probably hear about ‘Big Data’ in the news every day and perhaps wonder what it’s all about. Waracle Optimisation team are pretty much obsessed with data and swimming in it! Whether it’s data for client apps, websites, digital marketing campaigns (or even their own Nest or FitBit stats!) our team are keen to remain on the cutting edge of data science techniques – hence their interest in this project.

It certainly didn’t take long for Waracle to propose a data visualisation project via e-Placement Scotland and The Data Lab which could really benefit our mobile app optimisation clients in the long-run and before we knew it, summer arrived along with Jean Vallee a student of the University of Dundee who quickly became part of the Waracle Optimisation team participating in daily stand-up meetings (he even got to see the Queen when she visited Dundee – but that’s another story!)


Jean’s mission at Waracle!

Waracle Mobile App Developers have been in mobile apps from the very beginning, building mobile apps on all platforms, across all sectors and working with a large number of clients. The mobile app optimisation team in Dundee monitor all apps and create regular data reports with analysis and recommendations to help optimise apps and mobile marketing campaigns.

Ian Treleaven, Head of Optimisation says, “It’s become increasingly important for us to present app data in a way that stands out and helps our clients to improve their apps. We’ve always known that one of the best ways to get our message across to clients is to use a visualisation. By presenting data visually it’s possible to uncover surprising trends and observations that wouldn’t be obvious from looking only at stats. Our goal of this project was to apply data science techniques to create automated reporting including visualisations. One of our goals was to create a stacked doughnut visualisation to show a month’s data compared to the previous month, that month last year and an annual average for various channel grouping acquisition sources and other dimensions.”


About the technical environment of the prototype

Jean’s visualisation app consists of a web server that generates scalable charts for the monthly reports sent to clients. These customised charts have been designed to be more meaningful to the client than built-in charts. The application uses Javascript packages for data analytics extraction, loading and visualisation. It has been designed for a fast generation of images and for their easy and simple inclusion in the reports.


Working with the Optimisation Team

Jean Vallee, Data Lab MSc student says, “This industry placement at Waracle has been a very enriching experience. It’s a human-size and dynamic company open to new ideas. I’ve learned how an optimisation team works: how its members Ian, Caroline, Frank and Raul all collaborate with their different skills in Management, Marketing, Data Analysis and Development. My contribution to the team was mainly to develop visualisation applications using their state-of-the-art software tools and hardware.

Ian Treleaven, Head of Optimisation said, “We were delighted with the interest shown in the Waracle project and received a number of student applications. It’s such a great opportunity for data scientists, like Jean, to further develop their data visualisation skills along with teamwork and project management during the placement. Having Jean as part of our team has given us some awesome data visualisation tools to optimise (excuse the pun) our Optimisation reports for clients.”

If all this has whet your appetite for mobile app optimisation, why not contact the Waracle Optimisation team today to talk about your data and reporting requirements? We’d love to hear from you!


Working with Waracle

Waracle provides a fast-paced work environment where we nurture our marketing, optimisation, graphics, sales and software engineering talent with weekly brown bag sessions, monthly get togethers and all the fruit you can eat! Waracle puts its people at the heart of everything we do, making us a great environment in which to do great work, have fun and go home happy!

If you fancy working with Waracle why not get in touch with our Head of Talent to chat about all we have on offer at Waracle Mobile App Developers.


Google Plus

Optimized Hadoop Linux Kernel Script

For ages, I've been looking for someone to produce a nice general purpose script for Linux hosts that sets the optimal kernel settings for a Hadoop node maintained to the current Hadoop release.  While there are a number of scripts out there they tend to be pretty inconsistent or very specific to a particular implementation.
Additionally, we needed something that was non-destructive that could be safely run multiple times against the same host without fear of any issues.  Clearly the echo >> /etc/sysctl.conf approach wasn't going to cut it.
What I've put together is a script that do the above based upon the current (as of August 2016) Hadoop norms.  It is optimised for 10Gbe networking but should still function fine for 1Gbe.
Hopefully, others can also correct, modify and update as we go along.

August 17, 2016

Revolution Analytics

Extract tables from messy spreadsheets with jailbreakr

R has some good tools for importing data from spreadsheets, among them the readxl package for Excel and the googlesheets package for Google Sheets. But these only work well when the data in the...

Ronald van Loon

Machine Learning Becomes Mainstream: How to Increase Your Competitive Advantage


First there was big data – extremely large data sets that made it possible to use data analytics to reveal patterns and trends, allowing businesses to improve customer relations and production efficiency. Then came fast data analytics – the application of big data analytics in real-time to help solve issues with customer relations, security, and other challenges before they became problems. Now, with machine learning, the concepts of big data and fast data analytics can be used in combination with artificial intelligence (AI) to avoid these problems and challenges in the first place.

So what is machine learning, and how can it help your business? Machine learning is a subset of AI that lets computers “learn” without explicitly being programmed. Through machine learning, computers can develop the ability to learn through experience and search through data sets to detect patterns and trends. Instead of extracting that information for human comprehension and application, it will use it to adjust its own program actions.

What does that mean for your business? Machine learning can be used across industries, including but not limited to healthcare, automotive, financial services, cloud service providers, and more. With machine learning, professionals and businesses in these industries can get improved performance in a number of areas, including:

  • Image classification and detection
  • Fraud detection
  • Facial detection/recognition
  • Image recognition/tagging
  • Big data pattern detection
  • Network intrusion detection
  • Targeted ads
  • Gaming
  • Check processing
  • Computer server monitoring

In their raw data, large and small data sets hide numerous patterns and insights. Machine learning gives businesses, organizations, and institutions the ability to discover trends and patterns faster than ever before. Practical applications include:

  • Genome mapping
  • Enhanced automobile safety
  • Oil reserves exploration

Intel has worked relentlessly to develop libraries and reference architectures that not only enable machine learning but allow it to truly take flight and give businesses and organizations the competitive edge they need to succeed.

In fact, according to a recent study by Bain[1], companies that use machine learning and analytics are:

  • Twice as likely to make data-driven decisions.
  • Five times as likely to make decisions faster than competitors.
  • Three times as likely to have faster execution on those decisions.
  • Twice as likely to have top-quartile financial results.
intel1Machine learning is giving businesses competitive advantages.

In other words, predictive data analytics and machine learning are becoming necessities for businesses that wish to succeed in today’s market. The right machine learning strategy can put your business ahead of the competition, reduce your TCO, and give you the edge your business needs to succeed.


Background on Predictive Analytics and Machine Learning

You already know that machine learning is essentially a form of data analytics, but where did it come from and how has it evolved to become what it is today? In the past couple of decades, we have seen a rapid expansion and evolution of information technology. In 1995, data storage cost around $1000/GB; by 2014 that cost had plummeted to $0.03/GB[2]. With access to larger and larger data sets, data scientists have made major advances in neural networks, which have led to better accuracy in modeling and analytics.

As we mentioned earlier, the combination of data and analytics opens up unique opportunities for businesses. Now that machine learning is entering the mainstream, the next step along the path is predictive analytics, which goes above and beyond previous analytics capabilities.


The Path to Predictive Analytics

intel2With predictive analytics, companies can see more than just “what happened” or “what will happen in the future.”

Machine learning is a part of predictive analytics, and it is made up of deep learning and statistical/other machine learning. For deep learning, algorithms are applied that allow for multiple layers of learning more and more complex representations of data. For statistical/other machine learning, statistical algorithms and algorithms based on other techniques are applied to help machines estimate functions from learned examples.

Essentially, machine learning allows computers to train by building a mathematical model based on one or more data sets. Then those computers are scored when they may make predictions based on the available data. So when should you apply machine learning?

There are a number of times when applying machine learning can give you a competitive advantage. Some prominent examples include:

  • When there is no available human expertise on a subject. Recent navigation to Pluto relied on machine learning, as there was no human expertise on this course.
  • When humans cannot explain their abilities or expertise. How do you recognize someone’s voice? Speech recognition is a deep-seated skill, but there are so many factors in play that you cannot say why or how you recognize someone’s voice.
  • When solutions change over time. Early in a rush-hour commute, the drive is clear. An hour later, there’s a wreck, the freeway comes to a standstill, and side streets become more congested as well. The best route to getting to work on time changes by the minute.
  • When solutions vary from one case to another. Every medical case is different. Patients have allergies to medications, multiple symptoms, family histories of certain diseases, etc. Solutions must be found on an individual basis.

These are just a few of the uses that you’ll find across industries and institutions for machine learning. Not only is the demand for machine learning growing, though, but there is now an evolving ecosystem of software dedicated to furthering machine learning and giving businesses and organizations the benefits of instantaneous, predictive analytics.



An evolving ecosystem of machine learning software*


In this ecosystem, Intel is the most widely deployed platform for the purposes of machine learning. Intel® Xeon® and Intel® Xeon Phi™ CPUs provide the most competitive and cost efficient performance for most machine learning frameworks.


Challenges to Adoption of Machine Learning

There are a few barriers to adoption of machine learning that businesses need to overcome to take advantage of predictive analytics. These include:

  • Understanding how much data is necessary
  • Adapting and using current data sets
  • Hiring data scientists to create the best machine learning strategy for your business
  • Understanding potential needs for new infrastructure vs. using your existing infrastructure.

With the right machine learning strategy, the barriers to adoption are actually fairly low. And, when you consider the reduced TCO and increased efficiency throughout your business, you can see how the transition can pay for itself in very little time. As well, Intel is dedicated to establishing a developer and data science community to exchange thought leadership ideas across disciplines of advanced analytics. Through these articles and information exchanges, we hope to further help businesses and organizations understand the power of predictive analytics and machine learning”.


What is your opinion and how do you apply data analytics and machine learning?
Let us know what you think.



nidhic^B80F59591A2100F48EE59ECD71951001661FAA3008ADEA5442^pimgpsh_fullsize_distr copy






Nidhi Chappell is the director of machine learning strategy at Intel Corporation. Connect with Nidhi on LinkedIn and Twitter to find out more about how machine learning can give your business a competitive edge.

Ronald van Loon is director at Adversitement If you would like to read more from Ronald van Loon on the possibilities of Big Data please click ‘Follow’ and connect on LinkedIn and Twitter.

Intel, the Intel logo, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Machine Learning Becomes Mainstream: How to Increase Your Competitive Advantage appeared first on Ronald van Loons.

Teradata ANZ

How Come NPS (Net Promoter Score) Data Doesn’t Rate Ben Affleck Movies?

Of all the customer-satisfaction metrics, Net Promoter Score (NPS) is the real McCoy. Probably because NPS influences a company’s bottom line. That’s the theory anyway.

After all, asking a customer ‘How likely are you to recommend ‘XYZ’?’ right after they’ve bought the product, may not be rocket science. But it can be effective.

As simple as NPS

But, the simplicity of NPS is also its downside. Success doesn’t necessarily guarantee consistency, as in the case of Ben Affleck. Going by his Oscar wins (‘Good Will Hunting’ and ‘Argo’), he should rate highly. However, using NPS surveys to chart his movie career, ratings fluctuate wildly (‘Shakespeare in Love’, ‘Chasing Amy’, ‘Pearl Harbor’, ‘Phantoms’, ‘Reindeer Games’, ‘Runner, Runner’…).

Road movies not roadside snapshots

Understanding customer experience at a single point-in-time using NPS surveys can give you a false perspective. So companies need to look at customer interaction as a continuous journey rather than a scattergun plot of customer-survey responses.

NPS suffers from other shortcomings, too. It doesn’t drive action. For instance, the time delay between contact event and survey could mean losing the chance to rectify poor customer perception at the touch point.

2 + 3 = 23

NPS scores can be problematic also. Often, NPS operates at an organisational level, whereas, customer interaction is chiefly about product and/or service. So, because the same score can be computed in different ways, outcomes don’t necessarily describe customer loyalty. Moreover, measuring events post-interaction doesn’t reflect the customer’s experience of using a service such as network quality, unless the customer complains about cell-phone signal, a dropped call, or poor download speed, of course.

Be proactive

It makes sense to be proactive and manage the customer experience rather than to wait for after-the-event NPS surveys, so:

  • Analyse customer opinion about new products / issues with a current product or service via social media and Big Data analytics, to reduce call-centre traffic.
  • Use customer-contact notes and logs to perform text analytics, acting on the insights to extend customer lifetimes.
  • Big Data Discovery Analytics will uncover new insights about your, and your competitors’, products and services.
  • Correlate customer experience with NPS and predict perceptions and intentions well-ahead of customer contact at touch points, or subsequent NPS surveys.

Taken in isolation, NPS scores provide little value to an organisation. However, when combined with Big Data & enterprise data, and run through multichannel analysis, NPS can help enrich customer satisfaction and lifetime value.

Remember that next time you go to the movies.

Adapted from a blog that appeared, previously, on

This post first appeared on Forbes TeradataVoice on 24/03/2016.

The post How Come NPS (Net Promoter Score) Data Doesn’t Rate Ben Affleck Movies? appeared first on International Blog.


August 16, 2016


SoftBase Announces General Availability for Batch Healthcare Release 3.3.1

SoftBase Announces General Availability for Batch Healthcare Release 3.3.1 Latest release of DB2 optimization utility enhances user productivity; adds additional features to further improve...


Revolution Analytics

The inexorable growth of student debt, charted with R

Len Kiefer, Deputy Chief Economist at Freddie Mac, recently published the following chart to his personal blog showing household debt in the United States (excluding mortgage debt). As you can see,...

Big Data University

This Week in Data Science (August 16, 2016)

Here’s this week’s news in Data Science and Big Data. Olympics Bike

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

The post This Week in Data Science (August 16, 2016) appeared first on Big Data University.


Open Source Data Provider: Web Data Technology Partnerships to Fuel Data Innovation

Tyson Johnson here. I’m the Vice President of Business Development here at BrightPlanet. We recently held an event that really made it clear to me how a true partnership between an open source data provider and a data technology company benefits both parties and fuels continued innovation. Find out why in this post. Building Partnerships […] The post Open Source Data Provider: Web Data Technology Partnerships to Fuel Data Innovation appeared first on BrightPlanet.

Read more »
Jean Francois Puget

A Practical Guide to Machine Learning: Understand, Differentiate, and Apply

Co-authored by Rob Thomas (@robdthomas)

Machine Learning represents the new frontier in analytics, and is the answer of how many companies can capitalize on the data opportunity. Machine Learning was first defined by Arthur Samuel in 1959 as a “Field of study that gives computers the ability to learn without being explicitly programmed.” Said another way, this is the automation of analytics, so that it can be applied at scale. What is highly manual today (think about an analyst combing thousand line spreadsheets), becomes automatic tomorrow (an easy button) through technology. If Machine Learning was first defined in 1959, why is this now the time to seize the opportunity? It’s the economics.

A relative graphic to explain:



Since the time that Machine Learning was defined and through the last decade, the application of Machine Learning was limited by the cost of compute and data acquisition/preparation. In fact, compute and data consumed the entirety of any budget for analytics which left zero investment for the real value driver: algorithms to drive actionable insights. In the last couple years, with cost of compute and data plummeting, machine learning is now available to anyone, for rapid application and exploitation.


It is well known that businesses must constantly adapt to changing conditions: competitors introduce new offerings, consumer habits evolve, and the economic and political environment change, etc. This is not new, but the velocity at which business conditions change is accelerating. This constantly accelerating pace of change places a new burden on technology solutions developed for a business.

Over the years, application developers moved from V shaped projects, with multi-year turnaround, to agile development methodologies ( turnaround in months, weeks, and often days). This has enabled businesses to adapt their application and services much more rapidly. For example:

a) A sales forecasting system for a retailer: The forecast must take into account today's market trends, not just those from last month. And, for real-time personalization, it must account for what happened as recently as 1 hour ago.

b) A product recommendation system for a stock broker: they must leverage current interests, trends, and movements, not just last months.

c) A personalized healthcare system: Offerings must be tailored to an individual and their unique circumstance. Healthcare devices, connected via The Internet of Things (IoT), can be used to collect data on human and machine behavior and interaction.

These scenarios, and others like them, create a unique opportunity for machine learning. Indeed, machine learning was designed to address the fluid nature of these problems.

Firstly, it moves application development from programming to training: instead of writing new code, the application developer trains the same application with new data. This is a fundamental shift in application development, because new, updated applications can be obtained automatically on a weekly, if not daily basis. This shift is at the core of the cognitive era in IT.

Secondly, machine learning enables the automated production of actionable insights where the data is (i.e. where business value is greatest). It is possible to build machine learning systems that learn from each user interaction, or from new data collected by an IoT device. These systems then produce output that takes into account the latest available data. This would not be possible with traditional IT development, even if agile methodologies were used.


While most companies get to the point of understanding machine learning, too few are turning this into action. They are either slowed down by concerns over their data assets or they attempt it one-time and then curtail efforts, claiming that the results were not interesting. These are common concerns and considerations, but they should be recognized as items that are easily surmounted, with the right approach.

First, let’s take data. A common trap is to believe that data is all that is needed for successful machine learning project. Data is essential, but machine learning requires more than data. Machine learning projects that start with a large amount of data, but lack a clear business goal or outcome, are likely to fail. Projects that start with little or no data, yet have a clear and measurable business goal are more likely to succeed. The business goal should dictate the collection of relevant data and also guide the development of machine learning models. This approach provides a mechanism for assessing the effectiveness of machine learning models.

The second trap in machine learning projects is to view it as a one-time event. Machine learning, by definition, is a continuous process and projects must be operated with that consideration.

Machine learning projects are often run as follows:

1) They start with data and a new business goal.

2) Data is prepared, because it wasn’t collected with the new business goal in mind.

3) Once prepared, machine learning algorithms are run on the data in order to produce a model.

4) The model is then evaluated on new, unforeseen, data to see whether it captured something sensible from the data. If it does, then it is deployed in a production environment where it is used to make predictions on new data.

While this typical approach is valuable, it is limited by the fact that the models learn only once. While you may have developed a great model, changing business conditions may make it irrelevant. For instance, assume machine learning is used to detect anomaly in credit card transactions. The model is created using years of past transactions and anomalies are fraudulent transactions. With a good data science team and the right algorithms, it is possible to obtain a fairly accurate model. This model can then be deployed in a payment system where it flags anomalies when it detects them. Transactions with anomalies are then rejected. This is effective in the short term, but clever criminals will soon recognize that their scam is detected. They will adapt, and they will find new ways to use stolen credit card information. The model will not detect these new ways because they were not present in the data that was used to produce it. As a result, the model effectiveness will drop.

The cure to avoid this performance degradation is to monitor the effectiveness of model predictions by comparing them with actuals. For instance, after some delay, a bank will know which transactions were fraudulent or not. Then it is possible to compare the actual fraudulent transactions with the anomalies detected by the machine learning model. From this comparison one can compute the accuracy of the predictions. One can then monitor this accuracy over time and watch for drops. When a drop happens, then it is time to refresh the machine learning model with more up to date data. This is what we call a feedback loop. See here:


With a feedback loop, the system learns continuously by monitoring the effectiveness of predictions and retraining when needed. Monitoring and using the resulting feedback are at the core of machine learning. This is no different than how humans perform a new task. We learn from our mistakes, adjust, and act. Machine learning is no different.


Companies that are convinced that machine learning should be a core component of their analytics journey need a tested and repeatable model: a methodology. Our experience working with countless clients has led us to devise a methodology that we call DataFirst. It is a step-by-step approach for machine learning success.


Phase 1: The Data Assessment
The objective is to understand your data assets and verify that all the data needed to meet the business goal for machine learning is available. If not, you can take action at that point, to bring in new sources of data (internal or external), to align with the stated goal.

Phase 2: The Workshop
The purpose of a workshop goal is to ensure alignment on the definition and scope of the machine learning project. We usually cover these topics:
- Level set on what machine learning can do and cannot do
- Agree on which data to use.
- Agree on the metric to be used results evaluation
- Explore how the machine learning workflow, especially deployment and feedback loop, would integrate with other IT systems and applications.

Phase 3: The Prototype
The prototype aims at showing machine learning value with actual data. It will also be used to assess performance and resources needed to run and operate a production ready machine learning system. When completed, the prototype is often key to secure a decision to develop a production ready system.


Leaders in the Data era will leverage their assets to develop superior machine learning and insight, driven from a dynamic corpus of data. A differentiated approach requires a methodical process and a focus on differentiation with a feedback loop. In the modern business environment, data is no longer an aspect of competitive advantage; it is the basis of competitive advantage.

Data Digest

The Growing Generational Gap in the CDO Community

If you Google 'chief data officer', you'll come across a large number of links to articles, white papers etc. entitled "The Emergence of the CDO" or something along those lines.

How long is the CDO going to be emerging for? There are many Chief Data Officers/Offices that have been in situ for a number of years - who've fully emerged.

At Corinium, we pride ourselves on the work we've done supporting the emerging C-suites in the data & analytics community. Just over 3 years ago, when the CDO and CAO titles were mostly new, the networks we created were invaluable to those just getting to grips with the role.

But now, many of those new CDOs are not-so-new CDOs anymore. Their roles, challenges and objectives have evolved. And yet, new CDOs come into existence all the time as companies realise the importance of their data and the need to have someone take leadership of it.

Before I get into this further, I want to clarify that by CDO, I refer not only to the actual Chief Data Officer but also to the Chief Data Office. In some instances, the Chief Data Officer in a company may have been a CDO for 2-3 years yet the company they currently work for has just entered the world of CDO. The generational gap applies to people and companies.

First Generation CDOs

I define a First Generation CDO as someone who has recently taken on their first CDO role and/or a company that has recently employed their first CDO. Person and company have the same challenges at this point.

Our research indicates that First Generation CDOs are typically responsible for the following:

  • Assessing the state of the company's data which, more often than not, is in some state of disrepair
  • Developing a data warehouse that acts as a central repository for all of the company's data
  • Defining the data governance, management and ownership frameworks that will ensure data quality is created and maintained
  • Working with business to understand data requirements and then how to provide them with this data - whilst maintaining control (see point above)
  • Assisting in developing reporting tools through BI, visualisation tools, etc.
  • Ensuring compliance to data or sector-specific regulations - this is mostly true in the financial services sector where most companies employ a CDO to make sure that they're compliant to regulations

In addition to these key responsibilities, the First Generation CDO is concerned with communicating his/her value to the organisation, evangelising the importance of data and building business processes to ensure efficient workflows.

These are all the critical, foundation-building baby steps that need to be taken before moving into the next phase or generation of CDO-ship. The CDO is producing step-changes within the organisation's data architecture and usage. Change is big but perhaps does not deliver significant business value.

Second Generation CDOs

Our research has uncovered that, in the second phase, there is almost a shift away from being a CDO toward being a Chief Analytics Officer. The ultimate end goal for data in an organisation is to have the ability to use it to make good decisions in real time.

It's based on this that I assess the Second Generation CDO as being increasingly responsible for, and involved with, advanced business analytics. The first phase was all about getting data in shape for it to be utilised accurately across the business.

It's at this point where the CDO is involved in the optimisation and transformation of the business.
Our research has uncovered that, in the second phase, there is almost a shift away from being a CDO toward being a Chief Analytics Officer. The ultimate end goal for data in an organisation is to have the ability to use it to make good decisions in real time.

Innovation through data!

The Second Generation CDO becomes an integral part of eking out the hard-fought percentages of business improvement and profitability. Innovation across the organisation has come to rely on data. Product innovation takes a look at how the product is performing, who buys it, how it is manufactured etc. to evolve it and keep its competitive edge. None of this could be done with access to good quality data in real time - or at least in a short space of time.

I've spoken to a number of CDOs who are at the forefront of their company's digital transformation since data (and analytics) plays such a critical role in digitising a business. It's perhaps for this reason that a number of Chief Data Officers have moved in Chief Digital Officer roles.

What does the Future Look Like?

It's not easy to predict what the next generation will look like and how they'll act. I don't think Facebook anticipated that the current youngest generation would not be big users of the service.

I am intrigued by the development of the CDO - and for that matter the CAO and the Chief Digital Officer. Could they all blend into one? Or will they be obsolete in a few years and replaced with a new emerging C-suite?

For now, we'll keep focused on understanding the different needs of the First Generation and the Second Generation of CDOs to ensure we can provide relevant information and networks. It is likely that the next generation will only emerge in 4-5 years time.

By Craig Steward: 

Craig Steward is the Managing Director for Corinium’s EMEA business. His research is uncovering the challenges and opportunities that exist for CDOs and CAOs and the Forums will bring the market together to map the way forward for these important roles. For more information contact Craig on
Roaring Elephant Podcast

Episode 22 - Big Data in Small Business

Big Data in Small BusinessThe main subject in this episode features answer to a listener question we received a couple of months ago: How can big data help small businesses? What ways can small business use big data? At the moment all the talk is about big data helping enterprise firms. Apart from that, we introduce a new section which we hope you will enjoy! Read more »
Teradata ANZ

Can we defeat DDoS using analytics?

Distributed Denial of Service (DDoS) attacks have been in the news recently with one particular prominent incident garnering national attention in the past week. Whilst the jury is still out on the nature and cause of that alleged attack it should be remembered that DDoS attacks have been occurring for many years. In fact, you could say that students calling the White House on masse in the 60’s, to protest against President Johnson’s involvement in the Vietnam war was an attempt to flood the switchboard of the White House and prevent telephone communications, was an early DDoS attack.

Yet we now live in a connected era where there are billions of devices connected to the internet and these can be commandeered to participate in a DoS attack. Attacks can be coordinated by foreign countries against another countries’ infrastructure, by organised criminal groups or even by a kid down the road in his bedroom on his laptop. Of course the sophistication of these attacks vary widely and state sponsored attacks are generally well funded and executed by highly skilled teams of individuals.

So are we ever going to see an end to these types of attacks? Most probably not. Instead, expect to see more and more of these attacks as they mutate and find new ways to flood foreign networks. Major events held online are going to be obvious targets for DDoS attacks because of the kudos the attackers can claim within their communities. However, you should assume any site or service connected to the net could be a target.

It is very hard to defend against these attacks because of the many different ways in which hackers may strike. Distinguishing between legitimate and malicious traffic is a complex task. Setting up filtering by hand is often impossible due to the large number of hosts involved in the attack.

Ben Davis - DDoS 1.pngEach organisation has multiple front-end points connected to the internet including email, web and name servers. But there’s also a range of back-end servers that are also at risk such as databases simply through hitting the front end functions that then impose a high load on the back-end sources. So our first problem area is to identify each of the potential attack points in our organisation. Secondly attackers may use new methods or modify existing attacks to circumvent established defence mechanisms. Static defences do not work if a yet-unknown attack is used. Instead our systems need to adapt to new types of attack.

Also keep in mind that there still is a proportion of bona fide service requests to use the service. This makes it harder to inspect the traffic and to work out a classification scheme for traffic filtering. Since not all incoming requests can be assumed to be part of the attack it is more complex to derive appropriate filtering rules. If the filters chosen are too specific they do not block the attack, and if they are made too general they may block legitimate traffic.

However, as defenders of good, we seek to solve these problems through the application of analytical techniques to detect DDoS attacks. A widely diverse range of statistical methods and machine learning techniques could be used to detect abnormal changes in the resource usage that are indicative of a DDoS attack. However, both approaches have their limitations. I’ll outline a few below:

Decision Trees
In Decision tree learning a decision tree is used as a predictive model in which observations about an item are mapped to conclusions about the item’s target value. In this algorithm, the data set is learnt and modelled. Therefore, whenever a new data item is given for classification, it will be classified accordingly learned from the previous dataset. Analysing network packets based on a given outcome can be used in a Decision Tree.

K-means clustering
K-means clustering is a method commonly used to automatically partition a data set into k groups. K-means and naïve bayes classification techniques can be used to classify whether network packets are normal or are a DoS attack.

Support Vector Machines (SVM)
SVMs are set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM algorithm builds a model that predicts whether a new example falls into one of the two categories

Naïve Bayes
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

There are a host of other approaches available, but the ones listed above are by far the most popular/common.

A key feature in detecting DDoS attacks in real-time is to access the data as soon as it is collected. This is where an architecture that leverages streaming technologies and a platform for large scale processing is needed. Products like Apache NiFi are built for the Internet of Things (IoT) and to stream data from source to processing platform. Processing platforms such as Hadoop or Aster are perfect platforms for analysing the data at scale. Scale is important because of the sheer volume of data that needs to be sorted and sifted through for analysis. Detecting events 1-2 hours after the event is simply not good enough for an Enterprise grade platform. If you are serious about stopping DDoS against your organisation before you end up in the media, then you need to consume and process at scale.

The post Can we defeat DDoS using analytics? appeared first on International Blog.


August 15, 2016

M. Kinde

Discovering New Opportunities for Urban Design in American Cities

“What a city has to say must find expression in its architecture.” Walter Wallmann, Lord Mayor of Frankfurt/Main Here in the U.S., we tend to think of our built environment –- our cities and towns...


David Corrigan

Customer IQ – You Know it When you See It

Sometimes it is just obvious that an organization has a high Customer IQ.  I once experienced it from an electronics manufacturer.  My LCD TV was starting to break down – there were horizontal lines...


August 14, 2016

Simplified Analytics

9 traits to become successful Digital Transformation leader

Digital Transformation is inevitable. Across all industries, from consumer goods to health care, manufacturing to financial services, companies are going digital. Digital technologies from social...


August 12, 2016

Revolution Analytics

Mapping medical cannabis dispensaries: from PDF table to Google Map with R

by Sheri Gilley, Microsoft Senior Software Engineer In 2014, Illinois passed into law the creation of a medical cannabis pilot program. As my son has cancer and marijuana could greatly help with some...


Revolution Analytics

Because it's Friday: The squirrel's POV

A squirrel through a GoPro Camera was a nut, and grabbed it and took it up a tree, providing a squirrel's eye view of the branch-based highways of squirrel-land (via Gizmodo): Keep an eye out for...


Revolution Analytics

Tuning Apache Spark for faster analysis with Microsoft R Server

My colleagues Max Kaznady, Jason Zhang, Arijit Tarafdar and Miguel Fierro recently posted a really useful guide with lots of tips to speed up prototyping models with Microsoft R Server on Apache...


August 11, 2016

Revolution Analytics

R Packages for Data Access

by Joseph Rickert Data Science is all about getting access to interesting data, and it is really nice when some kind soul not only points out an interesting data set but also makes it easy for you to...