Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


April 25, 2017

Revolution Analytics

Using checkpoint with knitr and RStudio

The knitr package by Yihui Xie is a wonderful tool for reproducible data science. I especially like using it with R Markdown documents, where with some simple markup in an easy-to-read document I can...

Silicon Valley Data Science

Creating Simple Data Visualizations as an Act of Kindness

Editor’s note: Julie recently spoke on this subject at TDWI Accelerate Boston. To sign up for her slides (and our other Accelerate presentations), visit our event page

A lot of people who are new to data visualization feel that they have to design something novel and amazing in order for it to have impact. The field of data visualization is still quite young and evolving rapidly—and tools like the web and VR are continuing to expand the possibilities. So there is a lot of room for exploring new possibilities and creating new formats, as well as many examples of novel and amazing visualizations. And this should indeed inspire us as designers. But novelty is the icing, not the cake. It is not the core of a what makes a data visualization successful (or not).

Irene Ros, a colleague and friend and someone whom I admire very much, recently wrote a post about her own evolution as a designer, titled In Defense of Simplicity, A Data Visualization Journey. In it, she describes how she came to a design philosophy anchored around usefulness:

Sure, it’s really wonderful to have recognition from my peers in the industry, but it’s actually even more wonderful to build a really simple tool for small clinic practitioners to track their patient experience data in a digital way for the first time; to show and explain to them a box plot and suddenly see them make use of it. A box plot is never going to win awards, but a well crafted tool that is simple to use is going to make someone’s life better, or at least a little easier.

I cannot second this statement any more strongly. A successful data visualization is one that communicates simply and well, so that the reader can make better decisions and take more useful actions. In order for this to happen, design needs to get out of the way.

Other designers have also touted the importance of keeping design spare in the service of usefulness. Edward Tufte, whom many consider the forefather of modern data visualization, famously coined the term chartjunk to refer to any line, color, or other element that does not directly contribute to the reader’s fundamental understanding of a graph. But designer Sha Hwang put it the most plainly: “Simplicity is clarity is kindness.”

In the quest for unique design, many new designers avoid visualization formats they find to be cliche or boring—bar charts and line graphs especially. But these formats are classics for a reason: they are highly effective! And because they are familiar, a lot of people can read them quickly.

Most of the time, making a reader decipher your new format or learn a new visual shorthand just slows them down, or even frustrates them to the point of walking away. It obfuscates the insight. Better to use a known format clearly and simply. It’s a bit like public speaking: if you stick to common vocabulary, your chance of being understood is much greater. You can stretch the audience by using a broad vocabulary if necessary or for interest, but you shouldn’t use a complicated word where a simpler one will do.

Sometimes, however, there really isn’t a “word” that means what you’re trying to convey, and you need to coin a new one. In these situations, a custom design may allow the data to become intuitive in a way that it couldn’t otherwise. My favorite classic example of this is the Periodic Table of the Elements: the physical shape of the table and the placement of each element, along with secondary signals such as color, reveal that certain characteristics of each element recur in regular intervals—periodically. But students often need to be taught how to read the table; it’s not as familiar or immediately accessible as a bar chart or line graph.

This is the trade-off of custom formats. What you gain in eventual insight and impact, you lose in the time it takes a reader to get to that insight, and possibly in attention and focus. Make that trade-off very carefully.

Your true goal should be to communicate with your reader and to give them that “aha!” moment of insight, not to impress them with your design skills. Good design is nearly unnoticeable, because it does not distract or call attention to itself.

And make no mistake: there is still plenty of room for good design within familiar formats. A great deal of skill and editing are required to make the supporting elements of a chart fade away and feel invisible, while allowing the data to shine. Think of it like the difference between a duck—appearing to float serenely on the water while paddling like mad underneath—and a dog, overeager and splashing everywhere.

The temptation to be the splashing dog is ubiquitous, even for experienced designers. In The Practical Guide to Information Design, Ronnie Lipton writes, “The quest for style, creativity, and peer awards often drives designers away from clarity, even when they know how to achieve it.” But resisting this temptation, and putting away your design ego in favor of serving the reader, is the right thing to do.

When the design gets out of the way of the data—when it serves and supports the data, rather than decorating it—insights can happen for the reader without their having to jump a lot of hurdle or perform extra mental work. Decisions can be made, actions taken. Instead of something amazing, you will have designed something useful. Clear. And kind.

Do you have thoughts on how to simplify data visualization? Share them in the comments.

The post Creating Simple Data Visualizations as an Act of Kindness appeared first on Silicon Valley Data Science.


The Rise of Troy: How Great Leaders Impact Our Deep Web Technologies

Troy Mentzer has worked for big companies and state government, but neither really challenged him enough. Troy wanted something different. He wanted a career where his daily work would directly impact the business. As our Data Acquisition Department Manager, Troy accomplishes this and so much more. Like a team captain, Troy works closely with our engineer team […] The post The Rise of Troy: How Great Leaders Impact Our Deep Web Technologies appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (April 25, 2017)

Here’s this week’s news in Data Science and Big Data. analytic-maturity

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News


Featured Courses From BDU


Upcoming Data Science Events


Cool Data Science Videos

The post This Week in Data Science (April 25, 2017) appeared first on BDU.


April 24, 2017

Revolution Analytics

SQL Server 2017 to add Python support

One of the major announcements from yesterday's Data Amp event was that SQL Server 2017 will add Python as a supported language. Just as with the continued R support, SQL Server 2017 will allow you...

Cloud Avenue Hadoop Tips

Creating a Linux AMI from an EC2 instance

In one of the earlier blog, we created a Linux EC2 instance with a static page. We installed and started the httpd server, created a simple index.html in the /var/www/html folder. The application in the EC2 is too basic, but definitely we can build complex applications.

Lets say, we need 10 such instances. Then it's not necessary to install the softwares, do the necessary configurations multiple times. What we can do is create an AMI, once the softwares have been installed and the necessary configuration changes have been made. The AMI will have the OS and the required softwares with the appropriate configurations. The same AMI can be used while launching the new EC2 instances. More about the AMI here.

So, here are the steps

1. Create a Linux EC2 instance with the httpd server and index.html as shown here.

2. Once the different softwares have been installed and configured, the AMI can be created as shown below.

3. Give the image name and description and click on `Create Image`.

4. It takes a couple of minutes to create the AMI. The status of the same can be seen by clicking on the AMI link in the left side pane. Initially the AMI will be in a pending state, after a few minutes it will change into available state.

5. On launch a new EC2 instance, we can select the new AMI which was created in the above steps in the `My AMIs` tab. The AMI has all the softwares and configurations in it, so there is no need to repeat the same thing again.

6. As we are simply learning/trying things and not anything in production, make sure to a) terminate all the EC2 instances b) dregister the AMI and c) delete the Snapshot.

Now that we know how to create an AMI with all the required softwares/configurations/data, we will look at AutoScaling and ELB (Elastic Load Balancers) in the upcoming blogs.

Revolution Analytics

R 3.4.0 now available

R 3.4.0, the latest release of the R programming language (codename: "You Stupid Darkness"), is now available. This is the annual major update to the R language engine, and provides improved...

Cloud Avenue Hadoop Tips

Creating an alarm in CloudWatch for a Linux EC2 instance

AWS CloudWatch does a couple of things as mentioned in the AWS documentation here. But, one of the interesting thing is that it can gather the metrics (CPU, Network, Disk etc) for the different Amazon resources.

In this blog, we will look how to look at the CPU metrics on a Linux EC2 instance and trigger an Alarm with a corresponding action. To make it simple, if the CPU Utilization in the Linux instance goes beyond 50% then we will be notified through an email. We can watch different metrics, we will use CPU Utilization for now. CloudWatch also allows to create custom metrics, like the application response time.

I will go with the assumption that the Linux instance has already been created as shown in this blog. So, here are the steps.

1. In the Linux instance make sure that the detailed monitoring is enabled. If not then enable it from the EC2 management console as shown below.

2. Go to the CloudWatch management console and select Metrics on the left pane. Select EC2 in the the `All Metrics` tab.

3. Again select the `Per-Instance Metrics` from the `All Metrics` tab.

4. Go back to the EC2 management console and get the `Instance Id` as shown below.

5. Go back to the CloudWatch management console. Search for the InstanceId and select CPUUtilization. Now the line graph will be populated as shown below for the CPUUtilization. We should be also able to select multiple metrics at the same time.

6. The graph is shown with a 5 minute granularity. By changing it to minute granularity, we would be getting a much smoother graph. Click on `Graphed Metrics` tab and change the Period to `1 minute`.

7. Now, lets try to create an Alarm. Click on Alarm in the left pane and then on the `Create Alarm` button.

8. Click on the `Per-Instance Metrics` under EC2. Search for the EC2 instance id and select CPUUtilization as shown below.

9. Now that the metric has been selected, we have to define the alarm. Click on `Next`. Specify the Name, Description and CPU % as shown below.

10. When the alarm is breached (CPU > 50%), then the notification has to be done. In the same screen, go a bit down and specify the notification by click on `New list` as shown below.

11. Click on `Create Alarm` in the same screen. To avoid spamming, AWS will send an email with the confirmation link which has to be clicked. Go ahead and check your email for confirmation.

12. Once the link has been clicked in the email, we will get a confirmation as below.

13. Now, we are all setup to receive an email notification when the CPU > 50%. Login to the Linux instance as mentioned in this blog and run the below command to increase the CPU on the Linux instance. The command doesn't do anything useful, but simply increases the CPU usage. There are more ways to increase the CPU usage, more here.
dd if=/dev/urandom | bzip2 -9 >> /dev/null

14. If we go back to the Metrics then we will notice that the CPU has spiked on the Linux instance because of the above command.

15. And the Alarm also moved from the OK status to the ALARM status as shown below.

16. Check your email and there would be a notification from AWS about the breach of the Alarm threshold.

17. Make sure to delete the Alarm and terminate the Linux instance.

The procedure is a bit too long and it took me good amount of time to write this blog, but once you do it a couple of times and understand whats being done, then it would be a piece of cake.

April 22, 2017

Simplified Analytics

A to Z of Analytics

Analytics has taken world by storm & It it the powerhouse for all the digital transformation happening in every industry. Today everybody is generating tons of data – we as consumers leaving...


Simplified Analytics

Beyond SMAC – Digital twister of disruption!!

Have your seen the 1996 movie Twister, based on tornadoes disrupting the neighborhoods? A group of people were shown trying to perfect the devices called Dorothy which has hundreds of sensors to be...


April 21, 2017

Revolution Analytics

Because it's Friday: Secrets of the London Underground

I guess these have been around for a few years now, but I recently stumbled across these short videos dedicated to each of the 11 lines of the London Underground. Presented by Geoff Marshall (who...


Revolution Analytics

Reproducible Data Science with R

Yesterday, I had the honour of presenting at The Data Science Conference in Chicago. My topic was Reproducible Data Science with R, and while the specific practices in the talk are aimed at R users,...


April 20, 2017

Silicon Valley Data Science

How to Choose a Data Format

Editor’s Note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. We’ll also be at DataEngConf on April 28th, talking about data formats; learn more and sign up for our slides here

It’s easy to become overwhelmed when it comes time to choose a data format. Picture it: you have just built and configured your new Hadoop Cluster. But now you must figure out how to load your data. Should you save your data as text, or should you try to use Avro or Parquet? Honestly, the right answer often depends on your data. However, in this post I’ll give you a framework for approaching this choice, and provide some example use cases.

There are different data formats available for use in the Hadoop Distributed File System (HDFS), and your choice can greatly impact your project in terms of performance and space requirements. The findings provided here are based on my team’s past experiences, as well as tests that we ran comparing read and write execution times with files saved in text, Apache Hadoop’s SequenceFile, Apache Avro, Apache Parquet, and Optimized Row Columnar (ORC) formats.

More details on each of these data formats, including characteristics and an overview on their structures, can be found here.

Where to start?

There are several considerations that need to be taken into account when trying to determine which data format you should use in your project; here, we discuss the most important ones you will encounter: system specifications, data characteristics, and use case scenarios.

System specifications

Start by looking at the technologies you’ve chosen to use, and their characteristics; this includes tools used for ETL(Extract, Transform and Load) processes as wells as tools used to query and analyze the data. This information will help you figure out which format you’re able to use.

Not all tools support all of the data formats, and writing additional data parsers and converters will add unwanted complexity to the project. For example, as of the writing of this post, Impala does not offer support for ORC format; therefore, if you are planning on running the majority of your queries in Impala then ORC would not be a good candidate. You can, instead, use the similar RCFile format, or Parquet.

You should also consider the reality of your system. Are you constrained on storage or memory? Some data formats can compress more than others. For example, datasets stored as Parquet and ORC with snappy compression can reduce their size to a quarter of the size of their uncompressed text format counterpart, and Avro with deflate compression can achieve similar results. However, writing into any of these formats will be more memory intensive, and you might have to tune the memory settings in your system to allocate more memory. There are many options that can be tweaked to adapt your system, and it is often desirable to run some tests in your system before fully committing to using a format. We will talk about some of the tests that you can run in a later section of this post.

Characteristics and size of the data

The next consideration is around the data you want to process and store in your system. Let’s look at some of the aspects that can impact performance in a data format.

How is your raw data structured?

Maybe you have regular text format or csv files and you are considering storing them as such. While text files are readable by humans, easy to troubleshoot, and easy to process, they can impact the performance of your system because they have to be parsed every time. Text files also have an implicit format (each column is a certain value) and if you are not careful documenting this, it can cause problems down the line.

If your data is in an xml and json format, then you might run into some issues with file splitability in HDFS. Splitability determines the ability to process parts of a file independently which in turn enables parallel processing in Hadoop, therefore if your data is not splittable we lose the parallelism that allows fast queries. More advanced data formats (Sequence, Avro, Parquet, ORC) offer splitability regardless of the compression codec.

What does your pipeline look like, and what steps are involved?

Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.

How many columns are being stored and how many columns are used for the analysis?

Columnar data formats like Parquet and ORC offer an advantage (in terms of querying speed) when you have many columns but only need a few of those columns for your analysis since Parquet and ORC increase the speed at which the queries are performed. However, that advantage can be foregone if you still need all the columns during search, in which case you could experiment within your system to find the fastest alternative. Another advantage of columnar files is in the way they compress the data, which saves both space and time.

Does your data change over time? If it does, how often does it happen and how does it change?

Knowing whether your data changes often is important because then we have to consider how a data format handles schema evolution. Schema evolution is the term used for denoting when the structure of a file has changed after being previously stored with a different structure, such changes in structure can include the change of data type for a column, the addition of columns, and the removal of columns. Text files do not explicitly store the schema, so when a new person joins the project is up to them to figure out what columns and column values the data has. If your data changes suddenly (addition of columns, deletion of columns, changes on the data types) then you need to figure out how to reconcile older data and new data with the format.

Certain file formats handle the schema evolution more elegantly than others. For example, at the moment Parquet only allows the addition of new columns at the end of columns and it doesn’t handle deletion of columns, whereas Avro allows for addition, deletion, and renaming of multiple columns. If you know your data is bound to change often (maybe developers add new metrics every few months to help tracking usage of an app) then Avro will be a good option. If your data doesn’t change often or won’t change, schema evolution is not needed.

Additional things to keep in mind with schema evolution are the trade-offs of keeping track of the newer schemas. If the schema for a data format like Avro or Parquet needs to be specified (rather than extracted from the data) then we will require more effort storing and creating the schema files.

Use case scenarios

Each of the data formats has its own strengths, weaknesses, and trade-offs, so the decision on which format to use should be based on your specific use cases and systems.

If your main focus is to be able to write data as fast as possible and you have no concerns about space, then it might be acceptable to just store your data in text format with the understanding that query times for large data sets will be longer.

If your main concern is being able to handle evolving data in your system, then you can rely on Avro to save schemas. Keep in mind, though, that when writing files to the system Avro requires an pre-populated schema, which might involve some additional processing at the beginning.

Finally, if your main use case is analysis of the data and you would like to optimize the performance of the queries, then you might want to take a look at a columnar format such as Parquet or ORC because they offer the best performance in queries, particularly for partial searches where you are only reading specific columns. However, the speed advantage might decrease if you are reading all the columns.

There is a pattern in the mentioned uses cases: if a file takes longer to write, it is because it has been optimized to increase speed during reads.


We have already discussed how choosing the right data format for your system depends on several factors; to provide a more comprehensive explanation, we set out to empirically compare the different data formats in terms of performance for writing and reading files.

We created some quantitative tests comparing the following five data formats available in the Hadoop ecosystem:

– Text
– Sequence
– Avro
– Parquet

Our test measure execution times for a couple different exploratory queries, using the following technologies:

– Hive
– Impala

We tested against three different datasets:

  • Narrow dataset – 10,000,000 rows of 10 columns resembling an Apache log file
  • Wide dataset – 4,000,000 rows of 1000 columns composed by the first few columns of personal identification data and the rest set by random numbers and booleans.
  • Wide dataset large – 1TB of the wide dataset and 302,924,000 rows.

Full results of the testing along with the appropiate code that you can try on your own system can be found here.

The post How to Choose a Data Format appeared first on Silicon Valley Data Science.

Teradata ANZ

We love SAP (data) !

We love SAP! Who wouldn’t, and who can argue with this sentiment given their success in utilities and many other industries. For all the bad press they occasionally get about failed projects they were central to such as the well-publicised situation at RWE Npower[1], many utilities operations run relatively smoothly with SAP at their core. And just to prove I am being genuine in my praise, I have worked on very successful legacy to SAP migration projects myself in a past life!

Back to today, a little more about exactly why I really do love Germany’s best export since the Volkswagon Golf GTI Mk1[2]:
• SAP looks after all that day to day operational stuff which means people like us can focus on innovation and pushing the boundaries.
• In both the customer and asset worlds SAP often holds a static data view of customers and assets which we can use alongside lots of other data at the heart of the new digital utility.

Building on this second point, you hear a lot about the Internet of Things (IoT) and the ability of these things to create and send back data to be used for cool new stuff, often as part of a digital transformation programme (as an aside and on the subject digitalisation – check out this summary of the presentation by Enedis at our EMEA conference last week[3]). To simplify the IoT for the purposes of this blog, these things are usually assets (large or small) or people about which basic information, or if you like specifications are often held as data in SAP.


The digital utility is built on data and relies on integrating this data from SAP[4] with a wide range of other data such as IoT sensor data, maintenance logs, corporate and financial data, HR data, marketing and communications history, external data about the weather, or cities, or…. the list is genuinely infinite. Again coming back to my second bullet on digital innovation, this is how we roll at Teradata. To clear up another myth, SAP HANA is not an option here. It will help in that day to day operational space to provide extra horsepower and performance for baseline BI, but this integration of SAP data with other data at scale remains an issue.

The other important point to note is how hard it is to get data out of SAP. Looking back at that migration project to get data into SAP, there was a combined team of 50+ people working on that alone for well over two years! Flip this around and you can see why getting data out of SAP in order to use it alongside other data becomes such an issue. Luckily we have some sharp guys and girls at Teradata that have made it their lives work to get at this data quickly and automatically so you do not need to worry about how to do it, or the cost of employing a small army!

To conclude, getting data out of SAP requires some unique tools and capabilities. But it’s well worth the investment. Used alongside other data it can fix today’s problems and lower costs in your asset businesses, keep customers and boost margins in retail and do much more besides. And this is all is a key part of your digital future. SAP is a key component of any business, but especially from a data perspective it is not the be all and end all.



[1] I am not expressing a view on how culpable SAP may or may not be! I am sure they dispute these claims!

[2] No, the Mk2 is not better than the original… and yes I will own a Mk1 one day!

[3] This presentation had nothing to do with SAP specifically, but is an excellent example of why digitalisation is so important and how important data is for digitalisation.

[4] Other customer and asset repositories, and wider ERP systems are available clearly! SAP is very dominant however.

The post We love SAP (data) ! appeared first on International Blog.


April 19, 2017

Revolution Analytics

Microsoft R Server 9.1 now available

During today's Data Amp online event, Joseph Sirosh announced the new Microsoft R Server 9.1, which is available for customers now. In addition the updated Microsoft R Client, which has the same...

Silicon Valley Data Science

Realize the Business Power of Your Data

If you are on the path to being a data-driven company, you have to be on the path to being a development-enabled company. While people often spend a lot of effort choosing software components for their enterprise data infrastructure, they overlook the tools and processes that will make those software choices valuable.

How can you avoid this pitfall? Use DevOps principles to develop truly strong data systems. We detail the steps to do this in our report “How to Establish Software Capabilities”. The report walks you through:

  • why you should care about software development best practices
  • understanding the impact of data and distributed systems on development and operations
  • the capabilities and practices that will help your efforts succeed.

At the end of the report, we’ve included a DevOps checklist—use it to ensure you’re approaching your data systems correctly. The image below gives a taste of our checklist; download the PDF for the whole thing.

Portion of our DevOps checklist

Are you hitting any walls while setting up your data systems? Share your story in the comments.

The post Realize the Business Power of Your Data appeared first on Silicon Valley Data Science.


WEBINAR – Tyson Johnson: Building an Online Anti-Fraud Open Source Monitoring Program

If you want to give a business professional a headache by uttering a single word, one word that will surely suffice is “fraud”. Dealing with fraud is a problem that is as time-consuming as it is frustrating. Unfortunately, many financial institutions suffer major losses with no way to counteract after falling victim to this unpredictable crime. […] The post WEBINAR – Tyson Johnson: Building an Online Anti-Fraud Open Source Monitoring Program appeared first on...

Read more »

April 18, 2017

Revolution Analytics

Warren Buffett Shareholder Letters: Sentiment Analysis in R

Warren Buffett — known as the "Oracle of Omaha" — is one of the most successful investors of all time. Wherever the winds of the market may blow, he always seems to find a way to deliver impressive...

Silicon Valley Data Science

Data Pipelines in Hadoop

As data technology continues to improve, many companies are realizing that Hadoop offers them the ability to create insights that lead to better business decisions. Moving to Hadoop is not without its challenges—there are so many options, from tools to approaches, that can have a significant impact on the future success of a business’ strategy. Data management and data pipelining can be particularly difficult. Yet, at the same time, there is a need to move quickly so that the business can benefit as soon as possible.

Many companies have vast experience in traditional data management, using standard tools and designing tables in third normal form. They may be tempted, when using the Hadoop ecosystem, to fall back to what they are familiar with, losing so much of the value of the new technologies. Others embrace the concept of the new environment, but are unsure of the data management process and end up just dumping data into the ecosystem. Data availability and cleanliness is frustrating for many users who are going through the growing pains of switching to Hadoop.

If this sounds familiar, you are not alone. While there is no magic solution, in this post we’ll look at some real world examples to give you a place to start.

Hitting walls with Hadoop

During one of our projects, the client was dealing with the exact issues outlined above, particularly data availability and cleanliness. They were a reporting and analytics business team, and they had recently embraced the importance of switching to a Hadoop environment. However, they did not know how to perform the functions they were used to doing in their old Oracle and SAS environments. The engineering team supporting them Sqooped data into Hadoop, but it was in raw form and difficult for them to query. The engineering team could not keep up with the constant requests for new tables, additional fields, new calculations, additional filters, etc. Frustrations grew on both sides. The reporting and analytics team began to see Hadoop as an inflexible architecture that hurt their ability to deliver value to the business—exactly opposite of what it is designed to do.

In this case, there was a need for structured data in Hadoop. There was also a need to speed up the process of making data available for the business while keeping the data pipeline controlled. We looked at and considered some off-the-shelf ETL tools that were available, but determined that they were not yet ready for implementation in this particular environment. The final solution ended up being a custom ETL tool.

Our custom ETL tool

The core functionality of the custom tool was to provide the right data in the right format to teams so that they could produce their analysis and reporting as needed.

The key requirements identified were:

  • usability by people unfamiliar with Hadoop
  • workflow descriptions that allowed scheduling and productionalizing
  • extensibility for implementing minor changes readily
  • flexibility to pull data from alternate sources into varying locations

We built a framework that allowed the user to write a workflow to extract data, transform the data, load the data, and other functions as needed. The most common set of steps often involved saving complex SQL queries to run in a designated folder. The user could then modify a config file to execute that query into the workflow in a specific way. The solution was simple, elegant, and very user friendly.

As we developed this framework, we observed some best practices that we felt were critical to the success of the project and hope may help you.

Standard development environment

In order for the process to be repeatable, simple and easy to maintain, we readily saw the importance of standardizing the development of ETL pipelines in the Hadoop/Hive/Impala environment.

This included 4 key components: a common framework, common tools, an edge node, and JupyterHub.

Common framework. ETL engineers used a common development framework so ETL jobs can be written using common modules that allow them to enumerate the steps to be executed rather than re-implementing similar logic for the same type of steps .

As an example, in order to execute a HIVE query, an ETL engineer would only need to provide the SQL query, rather than writing a shell script containing hive credentials and hive commands, in addition to the SQL query that has to be executed.

With this approach:

  • There was no need for the same problem to be solved multiple times by different people, coming up with similar but yet different solutions.
  • Common credential storage systems were used (property files), instead of having the very same credentials placed in multiple different places.
  • Files were organized in a standardized way, so that any engineer or data scientist could easily navigate the directory hierarchy in order to find code.

Common tools. ETL engineers used standard software engineering processes and tools. This includes source code control system (such as git), unit and regression testing as well as automated deployment processes and procedures.

This helped significantly with improved collaboration, especially in comparison to what was happening at the time—an ETL engineer would write a script (shell/Python/Ruby) and deploy it by herself to the target machine.

Edge node. All ETL processes and data science applications were run on an edge node in a Hadoop cluster.

This allowed engineers and data scientists to access Hadoop data. That might sound trivial, but we have seen a lot of cases in which data scientists simply were not able to access the data, preventing them from doing their jobs.

By using an edge node, they were able to standardize libraries and tools, instead of having each of them developing in their own isolated island, using their own libraries and tools.

The edge node also enabled centralized monitoring and control. It prevented issues with people running a job on their own computers, then leaving a company and then having nobody familiar how to use or maintain such a job.

JupyterHub. JupyterHub was set up on one of the Hadoop’s edge nodes. For this company, their days of being able to analyze data only by using SQL query tools were long gone. Beyond just running SQL queries, they needed more complex data analytics and data science processing. The Jupyter Notebook is a widely accepted tool by the data science community that was able to provide those capabilities.

Jupyterhub (as a centralized version of Jupyter Notebook) allowed the data scientists to access data stored on the Hadoop cluster with a tool they were familiar with. Jupyterhub/Jupyter was also used initially to investigate and analyze the data. As mentioned above, applications were setup to run on an edge node without need for human intervention. The Notebook is really an excellent tool and having it available enabled engineers and analysts to make a big difference for their company.

Architecture of the ETL tool

The framework we developed has four major components, detailed below.

ETL workflow definition

The ETL workflow we developed consists of two types of files.

ETL Workflow configuration file—ETL workflow configuration files contain workflows defined by a list of steps that should be executed in order to run an ETL process. The files are plain text, and in this case we used INI format.

An ETL workflow configuration file contains two types of objects: workflows and workflow steps. An ETL workflow contains a list of steps that are supposed to be executed in order to execute an ETL process. Steps can be executed in sequential order, but branching functionality as well as “waiting for a condition to happen in order to continue” functionality are supported. We can also run steps in parallel (fork and join).

ETL step artifacts—ETL step artifacts are files containing SQL statements, one liner shell/Python/sed scripts, or sometimes custom written executables.

Some SQL statements are used to create tables, others are supposed to load tables from either files from a HDFS/regular filesystem, or from other SQL tables. Other SQL statements select data out of existing tables.

Custom written executables are used to in order to provide specialized processing. For example, we enabled data scientists to write an R or Python script to do data science from within the same framework.

The ETL framework is aware of some commonly used variables. Some examples of those variables include: ${TODAY}, ${YESTERDAY}, ${THIS_MONTH}, and ${PREVIOUS_MONTH}. Many of the ETL processes run once a day, and many depend on notions of TODAY, YESTERDAY, etc. An ETL developer can use those predefined ETL framework variables relying on the ETL framework to calculate the values in runtime. In addition to variables, there are user defined functions (UDF) UDF. These variable-like entities are backed by code instead of a simple value.

All ETL workflow definition files are being written by ETL engineers and data scientists. There can be as many ETL workflow definitions as it required by the business needs.

Runtime environment configuration file

This file contains variables whose values are different for different runtime environments: development, staging, or production. It also includes controlled variables such as usernames and passwords that need to be secured separately than the rest of the framework. In production, this file is maintained by a sys admin.

Execution framework file

This is an executable whose task is to read an ETL workflow configuration file. It, then, executes an ETL workflow defined in the ETL workflow configuration file, one step at the time using the runtime environment configuration file variables, as well as ETL runtime variables.

The framework supports data backfill. This means that it is possible to run an ETL workflow for a day (or a period) in the past. In that case, ETL variables ${TODAY}, ${YESTERDAY} refer to dates relative to the date for which the ETL workflow is being run.

The execution framework was written by our team and, code-wise, is the most complicated component of what we delivered.

Automated deployment

ETL scripts will need to be deployed to staging and production, which requires some kind of automated deployment process. In this case, we wrote custom shell scripts to check code out from a git repository, and then compile, package, deploy, and extract code at the target machine.

Automated deployment must support easy and fast rollbacks. Eventually, you will release code that, due to some bugs, you will have to rollback. In those moments of terror, you will appreciate ability to roll back bad code quickly and safely to the previous release.

Job monitoring and support

There is also the management portal, which allows a user (with authentication) to view job logs and run statistics. You can also modify job configurations through the portal. Data scientists have more control this way, as they don’t require support from IT.

Business impact of empowered pipelines

This framework allowed the company to take control of their business needs. Their teams were able to build their own data pipelines, quickly gaining access to important business data. As modifications were needed, they were able to quickly modify the code, test, and deploy it.

The data scientists who were previously new to Hadoop were now able to use it with languages more familiar to them through JupyterHub. Any models they built were able to be automated fairly quickly, empowering them to focus on data science and solving business problems.

Interested in learning more? Download our report on data systems.

sign up for our newsletter to stay in touch

The post Data Pipelines in Hadoop appeared first on Silicon Valley Data Science.

Big Data University

This Week in Data Science (April 18, 2017)

Here’s this week’s news in Data Science and Big Data. ibmwatsonresearcher

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News


Featured Courses From BDU


Upcoming Data Science Events


Cool Data Science Videos

The post This Week in Data Science (April 18, 2017) appeared first on BDU.


April 17, 2017

Rob D Thomas

iPad Pro: Going All-in

Here is my tweet from a few weeks back: I have given it a go, going all-in with the iPad Pro. In short, I believe I have discovered the future of personal computing. That being said, in order to do...


Revolution Analytics

Free AI Workshop, May 9 in Seattle

There will be free AI workshop in Seattle on May 9, presented by members of the Microsoft Data Science team. The AI Immersion Workshop includes five specializations to choose from (in parallel...


Forrester Blogs

Is Business Intelligence (BI) Market Finally Maturing? Forrester Three Big BI Market Predictions

No. The buy side market is nowhere near maturity and will continue to be a greenfield opportunity to many BI vendors. Our research still shows that homegrown shadow IT BI applications based on...


Curt Monash


Interana has an interesting story, in technology and business model alike. For starters: Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics...


April 14, 2017

Cloud Avenue Hadoop Tips

Got through `AWS Certified Developer - Associate`

Today, I got through the AWS Certified Developer - Associate. This is the second certification around AWS (first certification here). It had been a bit more tough than I expected, but it was fun. Anyway, here is the certificate.

There were 55 questions which had to be solved in 80 minutes. It was a bit tough and I did spend till the last minute solving and reviewing the questions. As usual, nothing beats the practice to get through the certification.

The next certification I am planning is `AWS Certified SysOps Administrator - Associate` and then the `AWS Certified Big Data - Specialty` Certification. The Big Data Certification is in Beta and I am not sure when Amazon would make it available to the public. This is the certification which I am really interested, as it is an intersection of Big Data and the cloud.

April 13, 2017

Silicon Valley Data Science

Minding Your Data Gaps

In an earlier post, we explained that to understand data gaps, you must start with your strategic business objectives (what you want to do with the data), understand the data being used, analyze the dimensions of data that are reflective of your needs, and look at how your current data fulfills these needs. The next step is being able to represent this visually, including some of the multi-dimensional information necessary to portray your business data needs. This is a powerful way to engage senior leadership and get the resources to enable “minding the gap.”

The hardest part usually isn’t plugging the gap; it is knowing which gaps to plug when you can’t possibly do it all at once. Let’s look at an illustrative example: fraud detection.

Preventing fraud detection by plugging gaps

According to the 2012 “Faces of Fraud” Survey conducted by Information Security Media Group, in 82% of cases involving identity fraud, the consumer uncovered the theft before the company. Not surprisingly, 26% of organizations surveyed reported losing consumers to competitors following a fraud incident. Our strategic business objective here, therefore, is to prevent fraud.

As mentioned in Is Your Data Holding You Back?, preventing account takeovers is a common way of addressing fraud risks. In the case of account takeover, a fraudster obtains the online account credentials of a legitimate customer with the intent of misusing the account (e.g., wiring funds out of the account into their own). The victim feels understandably violated and may blame the account holder. We can identify preventing account takeovers as a functional requirement, then, of our objective to prevent fraud. There will likely be other functional requirements tied to this business objective as well.

Let’s dig a little deeper and consider how account takeover can be prevented. There are several ways our theoretical organization could approach this, and those ways become our technical use cases. One such use case involves device recognition. This is often employed at login, to compare a laptop, desktop, or mobile phone’s signature to that of the legitimate owner, and/or to a database of devices previously associated with fraud. So, device recognition requires both customer data and device data.

Can the organization employ device recognition, or do they have data gaps holding them back? Since this is an assessment of data adequacy from the perspective of application architects (as opposed to, say, a view you’re providing to the CDO), we’ll focus on application-relevant dimensions like breadth, depth, latency, and frequency. The customer and device data needs to be sufficiently broad to enable the necessary attributes required for associating the correct information with the login event, have enough depth or history to have a decent number of known fraudulent devices to compare to, have a relatively high refresh frequency to ensure the latest information is being used, and be low latency to be able to trigger the appropriate action if fraud is suspected.

Therefore, when considering what is needed to build the capability of device recognition, we would need to assess whether there are currently gaps in these specific dimensions—breadth, depth, frequency, and latency—for both customer and device data. We can extend this exercise by performing the same assessment for the other functional requirements and related use cases for preventing account takeover. This begins to help shape a larger understanding of what efforts are needed to start reducing fraud.

Communication data gaps

Imagine that you’ve gone through this process for all of your important business objectives and you’ve identified several areas with data gaps. Now what? Closing data gaps is a complex, multi-dimensional problem. Finding a way to manage that complexity is key. It isn’t easy to alter people’s mindset from “we must clean the data until it is pristine” to “we need data good enough to solve this important problem.”

Communicating what you’ve found visually is a great way to advise and persuade your stakeholders. Here is an example visualization we often use that shows whether gaps exist along the key identified dimensions (such as breadth, depth, frequency, and latency identified for our fraud example) that would prevent a this business objective from being met.

Visualization of data gaps

You would typically have a more granular view of the data categories, but this example is illustrative of the larger information categories of customer, device, etc., and how these support the capabilities, shown in prioritized order. Questions to consider when deciding how to categorize your data for assessment are:

  • What is creating the data (e.g. a person, a sensor, software)?
  • Who is collecting the data (e.g. a division within the organization, a third party)?
  • What type of data is it (e.g. unstructured, structured, geospatial, image)?
  • What is the data describing (e.g. your customer, your vendors, manufacturing processes, lab experiments)?

You also need to customize the different dimensions shown in the four-box key (latency, frequency, breadth, depth) for example, to focus on the ones that are meaningful for what your business is trying to accomplish. The assessment of four dimensions not only allows for the lovely figure above, but we have also found that the exercise of reducing the dimensions in focus to four helps an organization to identify the key blockers that are holding them back from meeting their objectives.

This view can be read in many different ways to spot trends – box-by-box, row-by-row, column-by-column.

  • Box by box: This illustrates the gaps in one data requirement for achieving one capability. For example, the box in the first row and column shows gaps in breadth and depth of customer data for the capability “Login Score Tuning.” These gaps exist because not enough information has been collected from the customer over enough time to measure loyalty.
  • Column by column: This illustrates the impact gaps in data requirements have on one business objective. For example, no category of information meets the “latency” requirement for meeting the business objective “Mobile Specific Monitoring.” Monitoring is typically a real-time activity that requires near immediate data collection. If mobile-specific monitoring was a high-priority capability, this column would indicate that one of the highest-priority next steps would be to build the infrastructure necessary to process data in real time.
  • Row by row: This illustrates the opportunity closing the gaps in data requirements for one data source has across all the business objectives. Device data is collected on an inconsistent basis; each device sends information at a different frequency. Collecting device information on a consistent and more frequent manner will close the gaps in frequency across almost all business objectives. A row-by-row view of this figure can help to prioritize next steps that would have the most impact across the business objectives.


Protecting your customers from fraud shows your customers that you value them, builds trust, and helps protect the bottom line. Evaluating data gaps and making informed choices on where to invest in plugging data gaps is a smart way to proceed. You can’t possibly do it all at once, so focus on where you can make the highest impact.

Interested in building out your data strategy? Start by taking our data maturity assessment to see where you stand.

The post Minding Your Data Gaps appeared first on Silicon Valley Data Science.

Revolution Analytics

Seeing Theory: Learn Statistics through simulation

There's an ongoing debate in the academic community about whether Calculus is a necessary pre-requisite for teaching Statistics. But in age of ubiquitous computing resources (not to mention open...


Curt Monash

Analyzing the right data

0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset...

Cloud Avenue Hadoop Tips

Which EBS type to use to meet our requirements?

In the previous blog, I pointed to the approach to figure out the best way to pick the best EC2 instance to meet our requirements. Here is a video to pick the best EBS type. EBS (Elastic Block Storage) is like a Hard Disk Drive which can be attached to the EC2 instance.

The way AWS provides different EC2 instances, there are different EBS types which are basically categorized into SSD and Magnetic. Depending upon if we are looking at high IOPS or throughput, the appropriate EBS type can be picked. Here is a flow chart in the same video to figure out the type of EBS to pick.

One interesting thing to note about EBS is that the EC2 instance and EBS need not be on the same physical machine. That is, the processing and the disk need not be on the same machine. There is a low latency and high throughput between the EC2 and EBS machines.

A good amount of the AWS videos are really good and some of them are so-so. I would be promoting the videos which are really interesting in this blog for the readers.
Cloud Avenue Hadoop Tips

Which EC2 instance to use to meet our requirements?

In the previous blogs, we looked at creating a Linux and a Windows instance. There are lots of parameters which can be configured during the instance creation. One of the important parameter is in the instance type and size. AWS provides many instance types as mentioned here. While some of them are optimized for CPU, others are optimized for Memory etc. It's important to pick up a right instance type to meet our requirements. If we pick a smaller instance, we won't be able to meet the requirements. On the other end, if we pick a larger instance then we need to pay more.

So, which one do we pick? Here is a nice video from the nice AWS folks on picking the right EC2 instance to match the requirements. In the same video here , there is also a flow chart to help you pick the right instance type.

Without the Cloud, we need to do the sizing and the capacity planning carefully as per our requirements before ordering the hardware, as we are stuck to it. With the Cloud, we can try an instance, identify the bottle necks and move to another instance type without any commitment. Once, we are sure of the proper EC2 type, then we can also go with the Reserved Instances. That's the beauty of the cloud.

Amazon has been introducing new EC2 instance types regularly, so picking the right EC2 instance is a continuous exercise to leverage the benefits of the new EC2 instance types.
Teradata ANZ

Data Science, Art of Analytics, Burning Leaf, Business Value Framework, RACE and other buzz words

Optimise the End-to-End Customer Experience with Business Analytics Solutions

By now, almost all companies understand the importance of using data analytics to better understand the customer, process or product performance. However, many are still struggling with getting full value from the vast amounts of available data. Every company is in a different stage of analytical maturity, but to my experience most of the analytics are done in silos and usually one-off, domain or problem specific. Departments within an organisation do not share their Data Scientists/Analysts, in many cases would not share their data or know what the other department is doing. The missing part is the holistic approach to analytics, the clear view of capabilities, data and a production road map across the whole business. Many of them are also limited by their existing technologies and skill sets, which prevents the business from moving forward through analytics. I hear many variations of: “we build this really great customers cohort for targeted marketing and then we go and target everyone anyway because (don’t know how to put it in production, don’t have the skills, etc.)”.

There are many contributing factors to this disconnect from financial to politics. However, one major contributor to the analytical success is a team who is independent, technology agnostic and provides a full range of services and analytic solutions, including analytics business consulting and data science. I personally agree with authors from [1] that having a decentralised organisational model where analysts are scattered across the organisation in different functions and business units with little coordination would do little to propel a company towards analytical maturity.

Structured Engagement Accelerates Time to Value
How can a structured, well-coordinated and multidisciplinary team help a company to progress into the next stage of maturity and to leverage full benefits that the analytics and data can offer? One way is to use a Business Value Framework that prioritizes and aligns business analytic use cases with the company’s strategic objectives. One example that we use at Teradata is a technology-agnostic methodology called the Rapid Analytic Consulting Engagement (RACE) [2]. I have seen many similar variations of it: scrum, experiment, etc. RACE like methodology offers several unique advantages. It is a highly structured, strategic way to identify business processes with the highest ROI, data sources, and analytics to solve a problem. The entire engagement happens rapidly—unlike consultancies that require several months or more to review processes, make recommendations, and implement solutions. RACE also gives flexibility. It allows for a pivot when a more urgent problem comes up. In fact, in many cases we start with one problem and then the client requests another use case or ask us to expand the scope of the current one. The success of such short engagements also lies in working closely with the stakeholders who understand their business and with the combined expertise in technology and analytics; it delivers the high value outcomes for the business.

Effective Customer Journey
What are the examples of the RACE like engagements? Most customer centric companies put a great emphasis on building and analysing “Customer Journeys”. One-off RACE engagements will never deliver the full benefits of it but having a clear purpose for creating the journey and methodical plan for building and analysing it will.
Great customer journey maps are rooted in data-driven research, and it represents the different life phases of customers, their experience and touch points. The full value of the journey is archived only if it based on a variety of dimensions such as sentiment, goals, touch points, and more.

The sources of these dimensions come from a variety of data:
Digital: The most obvious is website analytics, which provides a lot of information on where users have come from and what they are trying to achieve. It will also help you to identify points in the process where they have given up.
Front-line staff: Speaking and recording of the front-line staff feedbacks that interact with customers daily, such as those in support and sales, is another useful way to understand customer needs.
Transactional data: Spending habits, budgets incoming and outgoing payments, etc.
• Social media: Sentiment while talking about your product/institution, connection that a customer has and their influence.
Customer history: Number of accounts a customer holds, tenure, relative income, NPS scores, etc.

Each dimension will require different techniques, analytical approaches and technology. Ultimately every customer journey is built upon life events and customer touch-points. The importance of a particular event or a touch-point varies between industries and from company to company.

There are many ways that we can look at the customer. For example, one way of identifying life events is to look at a customer’s behavioural patterns. ‘Burning Leaf Of Spending’ is one such example and looks at significant variations in customer-weekly spending patterns.

This Art of Analytics piece was born out of the RACE engagement with a global bank.

The Analytics
The ‘Burning Leaf’ was built across different technologies. Teradata Aster was used to integrate and process rich transactional accounts and credit card spending data. The Change Point Detection algorithm (CPD) was used to detect the change points in a time series and R was used to produce the visualisation.

Tatiana Bokareva - The Burning Leaf of Spending v1
Watch the ‘Art of Analytics: Burning Leaf of Spending’ video here.

The graph is read starting on the left and ending on the right, each line on the graph represents spending habits of an individual customer, you can think of it as a customer-spending time series. Each rise on the graph represents a significant deviation from customer’s average weekly spending. Such a rise represents a potential life event and the point in time when this event occurred: school fees; a new baby; a significant purchase (car, house deposit, expensive holiday, etc.). Once such a jump or fall is identified it triggers the event classification procedure.

Most customers have less than 8 major changes in a year, few have 10, hence the “tail” to the right of the graph, which reduces events’ search space. The system does not have to trace the entire yearly transaction history of a customer; it only needs to react when the change is detected.

Beyond Eye Candy
Art of Analytics goes beyond being a striking visual, the picture tells a story. Art can bring people together across traditional barriers such as age, income, education, race and religion. It acts as a powerful marketing tool and instrumental in helping to grow and attract businesses. It breaks the ice, ease and starts conversations with organizations about data and how it relates to their business problem. The reaction to an Art of Analytics picture ranges from “I never thought data could look like this”, “ I never saw my business through this angle” to “I want this for my business.”

Unlock Sales Opportunities
Understanding life events gives a more complete picture of who the customers are, where they are in their life journey, and what their financial needs are. Any major life event can trigger a sales opportunity. A marriage, kids enrolling in a private school or going to college, buying a house, having a child, or traveling overseas are all points along the customer lifecycle that are ripe for a targeted sale.

To identify events for up-sale or cross-sale offers, companies must have visibility across the entire customer lifecycle; from the first touch point until the present day. Moreover, the events library, the ability to expand, analyse and act upon it needs to be accessible to the business across an entire organisation.

An ability to act on these insights develops stronger business relationships with customers, encourages and rewards loyalty, improves NPS scores and pinpoints which particular financial products are most appropriate at any given point in time. In the end, all of these will lead to larger customer lifetime value and increased profits.

Repeatable Methodology Across Business and Industries
A RACE approach is a repeatable i.e. the same steps can be used to analyse the entire end-to-end customer lifecycle to benefit any customer-centric business. It can be applied to telecom, retail, airline and insurance companies that need a much better customer view.

For some businesses, like telecom, that journey can be very short and extending that journey and increasing customer value is of paramount importance. Another great opportunity for new sales is in the insurance industry. If an insurance company knows a customer is expecting a child or recently gave birth, then they know that is the best time to offer life insurance.

For others, such as banks and airlines, the journey can span decades. Because customers’ needs change—their buying habits at 20 years of age are not the same when they’re 50—companies must offer what a customer wants at specific point in their live and based on their particular situation. Businesses with such a capability will retain customers and continue to be profitable.

“Do we really know what our customers want, and how are we meeting those needs?”



[1] “Analytics at Work: Smarter Decisions, Better Results,” by Thomas H. Davenport, Jeanne G. Harris and Robert Morison

The post Data Science, Art of Analytics, Burning Leaf, Business Value Framework, RACE and other buzz words appeared first on International Blog.


April 12, 2017

Revolution Analytics

Data Amp: a major on-line Microsoft event, April 19

This coming Wednesday, April 19 at 8AM Pacific Time (click for your local time), Microsoft will be hosting a major on-line event of interest to anyone working with big data, analytics, and artificial...

Silicon Valley Data Science

Understanding Your Data Maturity

Businesses today are experiencing rapid change, inside and outside of Silicon Valley. Even coffee isn’t immune—just last year, Starbucks crossed the threshold of more than 20% of transactions being done in their mobile application. Data is the currency of digital transformation, and data capabilities are the battlefield influencing market share and profitability.

No two situations are the same, but at SVDS we have found one truism: making a data transformation successful requires much more than simply getting the technology right. Across a variety of industries and operations, we consistently find the influence of systems and people to be deciding factors. This is one of the reasons why we developed our Data Maturity Model—a tool for understanding how well your data capabilities create value for your business, across people, process, and systems.

Assess Your Data Maturity

Data maturity in practice

Data maturity is a useful tool for measuring the progress being made against your transformation. Recently, we were working with a multi-billion dollar industrial device company that was just beginning their Internet of Things (IoT) transformation: integrating software services with their physical devices. The vision was set and small experiments were being run throughout the organization—but the real work of building had not yet begun.

The company’s overarching strategy made a lot of sense—building a higher-margin services business from the key position their devices play in their customers’ workflows. We were asked to help them develop a data strategy and accompanying architecture to make that possible. Fundamentally, they needed a plan to use analytics and device data to create new services their clients would value.

While this company certainly had major technology investments in its future, some of the most urgent things necessary for this transformation to be successful had little to do with technology:

  • New incentives to change required: Like many established product companies, the organization was structured around distinct, mature product lines. With very different P&Ls and incentives, we were told: “Careers are made within the business units, not the company.” There was an inherent skepticism for investing in unproven growth, especially if the effort was performed by a centralized function.
  • Lack of experience with data rights: At this time, a very small percentage of overall revenue was produced through software. To sidestep questions of license management, the default behavior was to “make stuff free so that we do not have to create licenses.”
  • New customer relationships (and approach to product development): Getting the most value from device data meant integrating it with some of the customers’ own data sources to provide greater context to end users. This meant expanding or revisiting customer relationships to embrace data sharing, requiring updates to processes in sales management (i.e., around channel or partnership agreements) and overarching application strategies (e.g., creating standards and cultivating ISV ecosystems).

In working together, we helped the industrial product company identify solutions for incentivizing technical investment. We identified policies required around data ownership, custody, and consent as a precursor to becoming an integrator and reseller in the supply chain of data services. We helped them update their architecture to support these services. It is not uncommon for an SVDS maturity assessment to reveal a broad set of opportunities.

Ultimately, our client was equipped with a broader perspective on what they needed to launch their software business and a roadmap to build all aspects of their data capabilities, ensuring that their transformation would be more successful. A clear roadmap helped them fill in the gaps between here and there, illuminating the concrete steps they needed to achieve their vision. Although transformation does not have a fixed end point, milestones are important—change should be incremental enough that the organization can see progress.

What’s next?

If your company is at the beginning of its data transformation, it is easy to get caught up in all the upcoming technological change. Resist that urge, as honestly evaluating the maturity of your people and processes will help you identify where you need other investments to ensure a successful transformation. Start by taking our assessment today to learn more about where you stand.

The post Understanding Your Data Maturity appeared first on Silicon Valley Data Science.

Ronald van Loon

How to Become an Omni-Channel Data-Driven Retailer

In today’s digital age where customers are as likely to buy a product from an eCommerce website as from a brick and mortar store, delivering a seamless and value-adding shopping experience has become more important than ever before. The multiple shopping channels available to customers and the competition posed by other retailers have made it an absolute necessity for a retail business to integrate data inputs from different channels and use to it define an omni-channel shopping experience.

As a Big Data and BI influencer, I have worked with a large number of businesses within and beyond the retail industry. Using this knowledge and expertise, I have developed a 5-step approach that businesses operating in the retail sector can adopt to meet the expectations of their customers and become an omni-channel, data-driven retailer.

Step 1: Collect the Right Type of Customer Data

In their journey to becoming a data-driven organization, businesses are required to collect the right type of data — data that can help them improve the customer experience and maximize the profit they gain from their online and offline channels.

For example, a business that operates brick and mortar stores would probably like to try different layouts and floor plans to determine which maximizes the chances of a sale. To do this, they will require sales breakdown for each store. On the other hand, since in online context you can personalize customer journeys, a business with an eCommerce website will be required to collect data about their purchasing history, browsing behaviors, and more.

Apart from customer data, competitor data is another type of data that you need to collect. Collect data regarding price, customer reviews and ratings, and sizing and use the intelligence gain to optimize your product catalogue and customer journey.

Gathering data without analyzing is not only a waste of time, it will also raise doubt and suspicion among your customers. Therefore, it is also important not to collect data that you are not going to use.

Step 2: Integrate Your Data Sources

Once you have determined the type of data you want to collect, the next step is to identify the channels that can provide you the data. Retail businesses collect massive amounts of information from a wide range of channels. In fact, each successful transaction presents a data collection responsibility to retail businesses. However, the sheer amount of data and the disparate sources from which they are collected may make it quite an impossible task for businesses to organize the data and generate valuable insights from it.

Since a large amount of data collected by a retailer is processed through its Enterprise Resource Planning (ERP) system, the platform can be used for the purpose of data organization. Alternatively, there are a number of other BI platforms available that businesses can use to improve their data collection, organization, visualization, and analytical capabilities. These platforms can collect data from a large number of sources, including:

  • Point-of-sale data
  • Customer feedback
  • Web Analytics data
  • Customer Relationship Management data
  • Supply chain data

In our next webinar Ian Macdonald, Principal Technologist at Pyramid Analytics will show how a BI platform can help you amalgamate data and get a unified and accurate view of all your customer, competitor, and corporate data.

Step 3: Create a Data-Driven Culture

Succeeding with data is not just a matter of investing in a BI solution or hiring a data analyst or scientist. Instead, it requires you to develop a data-driven culture that involves people from all the departments of the organization. This is particularly important for retail businesses because every single department of a retail business can add something significant to the data value chain.

For example, the finance department can help you determine if the prices are consistent cross all channels or if there is a particular store or platform that’s not delivering the desired performance. Similarly, the people responsible for supply chain management can provide you information regarding supply chain issues and how they are impact your order fulfillment capabilities. To summarize, from sales to marketing to HR and supplier relationship management, every single department of your organization can provide you answer to an important question. Therefore, establishing a data-driven culture and involving all the functions is imperative to become a data-driven omni-channel retailer.

Step 4: Do Not Be Confined By Reporting Cycles

While automated pre-built reports offered by a BI and analytics solution can help you analyze data In real-time manner in a more efficient manner, the routine reporting capabilities may also limit your ability to extract maximum value from your BI investment.

Ad-hoc analysis is a viable solution to this problem. It will allow your employees to adopt an innovative approach towards data analysis and get a deeper, more precise, and comprehensive view of your existing customers. Also, since Ad-hoc analysis solutions are built specifically for users, they ensure a high adoption rate. This, in turn, makes business intelligence accessible to every single person in your organization, allowing them to contribute to the data value chain in their own unique way.

Step 5: Look beyond Your Business

In order to become a true data-driven omni-channel enterprise, businesses should look beyond the data sources present within the organization. Consider the data sources located externally. The most common yet an important external data source is social media. Customers’ feedback about your business on various social media platforms can offer you a wealth of information, which you can use to optimize your customer journey.

Other unconventional external data sources that your business could benefit from include:

  • Search result data
  • Demographics information collected by different surveys
  • Online rankings of web pages and TV ads

Once you have collected information from all these channels, integrate it with your existing data and use a BI tool to identify trends and patterns and to predict customer behavior.


From the evolution of ecommerce to the increasing use of social media and smartphones, the retail environment has gone through a multitude of changes over the past few years. This has led retailers to adopt an omni-channel approach that can enable customers to interact with and buy a product from a retailer at any time via any platform. However, in order to leverage the profit maximization potential of omni-channel strategy, businesses must utilize the data available to them in the right manner.

While the 5-step strategy mentioned above can help a business make progress towards their ultimate objective of becoming a data-driven, omni-channel enterprise, each and every business has its own unique needs, and therefore, a thorough evaluation of these needs and a customized approach to fulfill them is necessary.

If you are interested in learning more multi-channel retail analytics, you can register for my upcoming webinar. You can also follow me on Twitter and LinkedIn to stay updated with the latest in BI and journey science.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How to Become an Omni-Channel Data-Driven Retailer appeared first on Ronald van Loons.

Revolution Analytics

New features in the checkpoint package, version 0.4.0

by Andrie de Vries In 2014 we introduced the checkpoint package for reproducible research. This package makes it easy to use R package versions that existed on CRAN at a given date in the past, and...


How Customer Segmentation improves customer retention and conversion rates

Effective communication helps us better understand and connect with those around us. It allows us to build trust and respect, and to foster good, long-lasting relationships. Imagine having this ability to connect with every customer (or potential customer) you interact with through communication that addresses their motivators and desires. In this blog post, I take a brief look at ‘customer segmentation’ and how it can foster the type of communication that leads to greater customer retention and conversion rates.


April 11, 2017

Revolution Analytics

R is for Archaeology: A report on the 2017 Society of American Archaeology meeting

by [Ben Marwick](, Associate Professor of Archaeology, University of Washington and Senior Research Scientist, University of Wollongong The [Society of American...


Revolution Analytics

Prepare real-world data for analysis with the vtreat package

As anyone who's tried to analyze real-world data knows, there are any number of problems that may be lurking in the data that can prevent you from being able to fit a useful predictive model:...


Rob D Thomas

Gather the Fruit and Burn the Tree

There is an old story about a gentleman walking through the countryside and he comes upon a plum orchard. As he is walking through the orchard, he notices a plum tree with fruit that is ripe on the...


Rob D Thomas

Decentralized Analytics for a Complex World

In 2015, General Stan McChrystal published Team of Teams, New Rules of Engagement For a Complex World. It was the culmination of his experience in adapting to a world that had changed faster than the...


Rob D Thomas

Pattern Recognition

Elements of Success Rhyme The science of pattern recognition has been explored for hundreds of years, with the primary goal of optimally extracting patterns from data or situations, and effectively...


Rob D Thomas

12 Attributes of a Great Leader

"A managers output is the output of the organization under her supervision or influence." - Andy Grove I believe that most managers want to be great managers. In fact, many aspire to transcend...


Rob D Thomas

Machine Learning and The Big Data Revolution

I had the opportunity to speak at TDWI in Chicago today. It was a tremendous venue and a well organized event. Thanks to the TDWI team. I spoke on the topic of machine learning and the big data...


Rob D Thomas

The Fortress Cloud

In 1066, William of Normandy assembled an army of over 7,000 men and a fleet of over 700 ships to defeat England's King Harold Godwinson and secure the English throne. King William, recognizing his...

Silicon Valley Data Science

Building Connections

In previous posts, I shared conversations with Travis Oliphant, co-founder of Continuum Analytics, where we talked about new developments in tech and building communities. In this final installment, I talk with Travis about his current project to extend the concept of interoperability between multiple libraries in Python into other programming languages, and the pain points this will address.

We’ve talked about a number of projects that you have done, or are working on now. Do you have anything in mind next?

Yes, I do. To me it is the culmination of my SciPy work. So I’m super excited about it. I feel like with all of the experience gained by going through different aspects of bringing SciPy, NumPy, Numba, and Conda to life, and the community aspect and organization—it is breaking down barriers.

Are you familiar with the buffer protocol? Not many people are in Python, but you know the concept of fixing it twice? That is a general concept of, hey, if you find the problem, don’t just fix the proximate reason, fix the ultimate reason. Fix it twice. So NumPy became a necessity because there was a split with Numarray and Numeric and all these libraries were growing up with these different silos inside of Python itself—that is what drove me to stop trying to get tenure and jump off a cliff into writing NumPy. But I was driven mostly by a sense of duty and love for this community of people I had been interacting with for about eight years at the time.

So I wrote NumPy, but didn’t just write NumPy. I actually also worked with the Python world to create a buffer protocol. This is a way for different kinds of structures in Python to talk about data in the same way. NumPy is a common array structure, but we also created a common underlying interface so even if there is are new array objects sometime in the future, they can talk to each other easily without copying data. So we created an array interface—a buffer protocol to cover Python (and also an array interface that is kind of outside of Python core). All this was done with PEP 3118—I intentionally worked with Guido and folks in the Python community, and got Python core dev rights to do it. I don’t think they took them away yet, but that’s why I was a Python core committer for a while—because I wrote this stuff to make it into the Python core. This was intentionally a “fix-it twice” thing, that got it so that the Python imaging library could talk to NumPy arrays. It took another six years from that creation of the interface for people to start writing the code that takes advantage of it. Now when people write a plug-in for any data-centric library, they can use memoryviews and the buffer protocol so that NumPy and other array-like structures can just see it immediately and not have to copy data around. This work broke down barriers of how people interpret data together.

That buffer protocol was the thing I’m most proud of. You know, I love what happened with NumPy. I’m so happy that it became a useful thing that could benefit a lot of people, but I was also proud of the Python work. Even though there are so many things happening that I didn’t feel like I finished it. A few other people actually helped finish it—Antoine Pitrou and Stefan Krah. They actually work with me now because I was so impressed with their work of finishing this initial effort I started that was a foundational effort.

The next thing is actually to extend that concept of interoperability between multiple libraries in Python to every language. We want to allow Scala, R, Python, Ruby, Lua, everybody to talk about data in the same way. You can have a pipeline of interoperating function calls, where you don’t have to copy data to different language run-times.If you want to use a thing, right now you have to copy data out of whatever you are doing and put it into whatever they are doing. Because we don’t have common ways to talk about data in a shared memory segment. Languages have abstracted that away in different ways.

So it is undoing an oversight. Or an issue that wasn’t an issue then but is now. It was good idea at the time, but subsequent events have left that idea to be a problem. The oversight of encapsulating type and what something is and what is a memory and embedding that implicitly in the language. So somebody wants to say I have a dictionary or I have an array, or I have a hash table—there is no uniform notion of that. There is a semantic concept of that, but not how it is actually implemented in memory. Even though we may be on the same hardware—you know, I have the same MacBook and I have got applications running and they both have a hash table. But those two applications—one in Scala, one in Python, one in R—they have no clue what it means to be a hash table together. What it means is embedded implicitly in those specification of the language. That is fundamentally what creates the siloization. The fix is to create a common data description language for type and shape (a data-shape).

One of the pieces of buffer protocol was the type system and we used the struct module’s crude definitions of type (it wasn’t complete). One of the things I am working on next is figuring out, OK, how should array-oriented computing work? It is actually three things, separated into a type or a data description language—a way to talk about how things are in memory or bytes, you know, hash table, an array, a data frame. What does it mean to be one of those, all the way down to the bits?

You basically need a concrete way that it is shared across multiple language silos of how things are laid out and represented in memory.. Then you have a function infrastructure—a generalized (multiple-dispatch) pipeline infrastructure that is simply a way to stitch together function calls whose arguments are defined by this type system that is shared across languages. And then you have a backing store where all the data lives in a shared memory accessible by many run-times.. This concept internally goes under the concept of a memory- or a data- fabric.

So type, function, and data objects—those are computer science things that can be re-used in many, many ways, and then the array container, which is already in Python (the buffer protocol) and then NumPy itself is just a library with computation on top. But that library of computation on top could also be another client. It could be in Scala, it could be in R, it could be in whatever, but you are using the same fundamental notion of what a data object is, and therefore you can have a system where your Scala code loads this thing from that silo into a shared memory segment and then I call out to this R library, which just points to that same shared memory segment and does the transformation, and I pull another Python library that does the same thing, and they are all talking to each other without having to copy data back and forth unnecessarily.

Copying data back and forth doesn’t matter if you are talking about kilobytes and even megabytes, but when you start to get gigabytes, terabytes, and petabytes, the speed of light is the limit. You are not going to copy bits faster than a certain amount. And as the data sizes grow, that becomes an impediment so the premature typed encapsulation ends up becoming the siloization root cause. And it is broken down by essentially doing what we did with the buffer protocol in Python—the fix it twice notion in NumPy and the fix it twice in Python, and taking that general principle and applying it across the board. So that is what I’m excited about next.

Sounds like you’ll be solving some headaches.

Yeah. And the value to people would be you don’t have to redo everything. The client you use to interact with this system, then becomes a personal choice. You love Scala and you want to stay with Scala, OK, that’s fine. You can still use a library of functionality from another system. It is easy to take libraries in R or Python or wherever they come up. You are not paying an integration penalty, which you do today. Now people can make individual choices. Someone can say, “I like R better, I’m used to it.” OK, great, stay with R. But you can still benefit from all the innovation happening in whatever language it is happening in.

I suspect there will be lots of opinions around it all too.

Oh, yeah, as you can imagine. I see how it works and I have a history to know how it can work, but I also understand that communicating it and how people understand it to the point where they interpret it correctly—because everyone comes to things with their own lens. When I have explained this to some people, they don’t get it because I’m undoing some confirmation bias they already have. It takes me a while to figure out they have that bias. They end up assigning incorrect meaning. Meaning that I’m not talking about. Because they haven’t had the same lived experience.

It is a foundational idea but it requires a cathedral—some real effort up front that is fundamentally hard to pay for, frankly. Nobody wants to pay for the foundation, but they will love the end result. Once you get it, then it will snowball. And it becomes universal pretty quickly once you hit a tipping point.

That will take five years. To get to that point where it is like you look back and say, oh, isn’t that obvious? NumPy was written in 2005, but it wasn’t until 2009 when everybody thought, oh, it’s obvious. So if I can pull this off by the end of 2017 that would be great, but it will take until 2020 before everyone sees, oh, of course, why didn’t we do this earlier. If you look at what we are doing with data-shape, blaze, numba, odo and pluribus. All of the efforts we have done here are dancing around this fundamental core thing. Data fabric is a recent one with a simple prototype out there. But, the real story is just beginning.

Editor’s note: The above has been edited for length and clarity.

sign up for our newsletter to stay in touch

The post Building Connections appeared first on Silicon Valley Data Science.


New Rosoka Release: Improving Data Enrichment and Entity Extraction

No great operation is ever done alone. Lucky enough for BrightPlanet, we are incredibly fortunate to be able to work alongside many technical companies that care deeply about their craft. Without their contributions, we wouldn’t have gained such valuable insight into expanding our own capabilities. One of BrightPlanet’s long-time partners is Rosoka. We love working with […] The post New Rosoka Release: Improving Data Enrichment and Entity Extraction appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (April 11, 2017)

Here’s this week’s news in Data Science and Big Data. ibm

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Featured Courses From BDU

Cool Data Science Videos

The post This Week in Data Science (April 11, 2017) appeared first on BDU.

InData Labs

InData Labs founders on specifics of building a Data Science company.

In just three years data science company InData Labs became a significant player on data research and artificial intelligence market both in Belarus and overseas. In their big interview to, this start-up from Minsk share successful projects experience, talk about technological stack and intricacies in working with data. “Back in 2014, when the idea...

Запись InData Labs founders on specifics of building a Data Science company. впервые появилась InData Labs.

Ronald van Loon

How Machine Learning is Revolutionizing Digital Enterprises

According to the prediction of IDC Futurescapes, Two-thirds of Global 2000 Enterprises CEOs will center their corporate strategy on digital transformation. A major part of the strategy should include machine-learning (ML) solutions. The implementation of these solutions could change how these enterprises view customer value and internal operating model today.

If you want to stay ahead of the game, then you cannot afford to wait for that to happen. Your digital business needs to move towards automation now while ML technology is developing rapidly. Machine learning algorithms learn from huge amounts of structured and unstructured data, e.g. text, images, video, voice, body language, and facial expressions. By that it opens a new dimension for machines with limitless applications from healthcare systems to video games and self-driving cars.

In short, ML will connect intelligently people, business and things. It will enable completely new interaction scenarios between customers and companies and eventually allow a true intelligent enterprise. To realize the applications that are possible due to ML fully, we need to build a modern business environment. However, this will only be achieved, if businesses can understand the distinction between Artificial Intelligence (AI) and Machine Learning (ML).

Understanding the Distinction Between ML and AI

Machines that could fully replicate or even surpass all humans’ cognitive functions are still a dream of Science Fiction stories, Machine Learning is the reality behind AI and it is available today. ML mimics how the human cognitive system functions and solves problems based on that functioning. It can analyze data that is beyond human capabilities. The ML data analysis is based on the patterns it can identity in Big Data. It can make UX immersive and efficient while also being able to respond with human-like emotions. By learning from data instead of being programmed explicitly, computers can now deal with challenges previously reserved to the human. They now beat us at games like chess, go and poker; they can recognize images more accurately, transcribe spoken words more precisely, and are capable of translating over a hundred languages.

ML Technology and Applications for Life and Business

In order for us to comprehend the range of applications that will be possible due to ML technology, let us look at some examples available currently:

  • Amazon Echo, Google Home:
  • Digital assistants: Apple’s Siri, SAP’s upcoming Copilot

Both types of devices provide an interactive experience for the users due to Natural Language Processing technology. With ML in the picture, this experience might be taken to new heights, i.e., chatbots. Initially, they will be a part of the apps mentioned above but it is predicted that they could make text and GUI interfaces obsolete!

ML technology does not force the user to learn how it can be operated but adapts itself to the user. It will become much more than give birth to a new interface; it will lead to the formation of enterprise AI.

The limitless ways in which ML can be applied include provision of completely customized healthcare. It will be able to anticipate the customer’s needs due to their shopping history. It can make it possible for the HR to recruit the right candidate for each job without bias and automate payments in the finance sector.

Unprecedented Business Benefits via ML

Business processes will become automated and evolve with the increasing use of ML due to the benefits associated with it. Customers can use the technology to pick the best results and thus, reach decisions faster. As the business environment changes, so will the advanced machines as they constantly update and adapt themselves. ML will also help businesses arrive on innovations and keep growing by providing the right kind of business products/services and basing their decisions on a business model with the best outcome.

ML technology is able to develop insights that are beyond human capabilities based on the patterns it derives from Big Data. As a result, businesses would be able to act at the right time and take advantage of sales opportunities, converting them into closed deals. With the whole operation optimized and automated, the rate at which a business grows will accelerate. Moreover, the business process will achieve more at a lesser cost. ML will lead businesses into environs with minimal human error and stronger cybersecurity.

ML Use Cases

The following three examples show how ML can be applied to an enterprise model that utilizes Natural Language Processing:

  • Support Ticket Classification

Consider the case where tickets from different media channels (email, social websites etc.) needs to be forwarded to the right specialist for the topic. The immense volume of support tickets makes the task lengthy and time consuming. If ML were to be applied to this situation, it could be useful in classifying them into different categories.

API and micro-service integration could mean that the ticket could be automatically categorized. If the number of correctly categorized tickets is high enough, a ML algorithm can route the ticket directly to the next service agent without the need of a support agent.

  •  Recruiting

The job of prioritizing incoming applications for positions with hundreds of applicants can also be slow and time consuming. If automated via ML, the HR can let the machine predict candidate suitability by providing it with a job description and the candidate’s CV. A definite pattern would be visible in the CVs of suitable candidates, such as the right length, experience, absence of typos, etc. Automation of the process will be more likely to provide the right candidate for the job.

  • Marketing 

ML will help build logo and brand recognition for businesses in the following two ways:

  1. With the use of a brand intelligence app, the identification of logos in event sponsorship videos or TV can lead to marketing ROI calculations.
  2. Stay up to date on the customer’s transactions and use that behavior to predict how to maintain customer loyalty and find the best way to retain them.

How Enterprises Can Get Started Implementing Machine Learning

Businesses can step into the new age of ML and begin implementing the technique by letting the machines use Big Data derived from various sources, e.g. images, documents, IoT devices etc to learn. While these machines can automate lengthy and repetitive tasks, they can also be used to predict the outcome for new data. The first step in implementation of ML for a business should be to educate themselves about its nature and the range of its applications. A free openSAP course can help make that possible.

Another step that can bring a business closer to ML implementation is data preparation in complex landscapes. The era of information silos is over and there is an imperative need for businesses to gather data from various sources, such as customers, partners, and suppliers. The algorithms must then be provided open access to that data so they can learn and evolve. The Chief Data Officer of the company can oversee the ML integration process.

To start with completely new use cases for Machine Learning is not easy and requires a good understanding of the subject and having the right level of expertise in the company. A better starting point for many companies would be to rely on ML solutions already integrated into standard software. By that it will connect seamless with the existing business process and immediately start to create value.

Lastly, businesses should start gathering the components necessary for building AI products. Among the requirements would be a cloud platform capable of handling high data volume that is derived from multiple sources. The relevant people are as important to this step as are the technology and processes. After all, they would be the ones who will be testing the latest digital and ML technologies.

If you want more information on SAP Machine Learning, then go here to subscribe to the webinar on Enabling the intelligent Enterprise with Machine Learning.

The presenters include Dr. Markus Noga: VP Machine Learning Innovation Center Network, SAP SE. You can follow him on Twitter. Ronald van Loon is the other presenter for the webinar. Mr. van Loon is counted among the Top 10 Big Data expert and is an IoT Influencer.


Markus Noga




Ronald van Loon


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How Machine Learning is Revolutionizing Digital Enterprises appeared first on Ronald van Loons.


April 10, 2017

Revolution Analytics

In case you missed it: March 2017 roundup

In case you missed them, here are some articles from March of particular interest to R users. A tutorial and comparison of the SparkR, sparklyr, rsparkling, and RevoScaleR packages for using R with...

Cloud Avenue Hadoop Tips

Creating a Billing Alarm in AWS

AWS provides a couple of free resources for 1 year from the time of account creation (free tier). Not all the services are free, it's very easy to use AWS resources unknowingly and start the billing. Like creating an EC2 instance, other than t2.micro.

To get around this problem, AWS provides Billing Alarms. I have created a Billing Alarm for when my monthly AWS expenses cross 10$ to send me an email alarm as shown below. Here is the documentation for the same. Wherever possible, I would be providing the references to AWS documentation, instead of repeating the same here.

Cloud Avenue Hadoop Tips

AWS Regions and Availability Zones

The EC2 instance which we created was in North Virginia. This is called a Region in the AWS terminology, which is a separate geographic area. The Region can be selected from the top right of the AWS management console as shown below.

The price for the different services change from Region to Region. For ex., the hourly cost of creating an EC2 instance in Mumbai Region is different from the North Virginia Region. Usually, I do select the North Virginia Region when I try to explore some thing new in AWS as resources in North Virginia Region are the cheapest when compared to the other Regions. You can check the EC2 on-demand pricing for the different Regions here. By changing the Region, the price will change automatically.

For the sake of HA (High Availability) we can create an EC2 instance in North Virginia Region which acts like a primary server and the backup server can be created in Mumbai Region. If there is a problem in one Region, still we have the servers in another Region.

Each Region is a separate geographic area. And within each Region there are multiple Availability Zones (AZs) as shown below. A Region has at least 2 AZs. As of this writing there are 16 Regions and 42 AZs and Amazon is expanding them on a regular basis. Each AZ will have redundant power, networking and other resources. This way there is no common point of failure across any two AZs. Note that, each AZ need not be a single Data Center, it can more than one Data Center. More details here and here.

The Region can be selected by using the drop down option as mentioned above and the AZ can be selected at the time of the resource creation like an EC2 as shown below.

Lets say say we want to create 4 EC2 instances in a particular Region. Instead of creating them in a single AZ, it's a best practice to create them across multiple AZs in a balanced fashion as it provides better HA. By default, when we create any resource in AWS, we have to go with the assumptions that THINGS WILL FAIL and architect for HA.

Lets look in a bit more detail. In the Mumbai Region (ap-south-1), there are two AZs (ap-south-1a and ap-south-1b). Within Ohio (us-east-2) there are three AZs (us-east-2a, us-east-2b and us-east-2c). The requirement is to create 4 EC2 instances.

In the below, we are creating all the EC2 instances in a single AZ (ap-south-1a). So, if there is any problem in that particular AZ then the entire service will be down.

To avoid the above mentioned problem, it's recommended to create the instances in different AZs as shown below and also have a backup instances in a entirely different Region.

Another reason besides HA to have instances in different Regions, might be for the sake of compliance. The industry regulations might suggest, that the servers should be in different Regions.

Then how to dynamically shift the load from the primary instances to the backup instances in the case of a failure? This can be done using Route53, which we will be looking in a future blog.

April 09, 2017

Cloud Avenue Hadoop Tips

Creating a static website on a Linux EC2 instance

Now that we know how to create a Linux EC2 instance in the AWS Cloud and access the same, we will create a simple static web site on the same.

1. When we created a Security Group, the port 22 was opened. This will enable the access to the remote machine. But, a WebServer listens to port 80 and the same has to be opened.

2. Select the Security Group and click on the Inbound tab.

3. Click on Edit and the rule for the HTTP inbound traffic as shown below.

Now, the Security Group rules should appear like this.

Note that if the Linux or the Windows instance is running, then any changes to the Security Group will take place immediately. There is no need to restart the EC2 instance.

4. The best practice is to update the package on the Linux instance with the `sudo yum update -y`. When prompted select yes.

5.  Install the WebServer and start it. The first command to elevate the user to root, the second command will install the WebServer and the final one will start the same.
sudo su
yum install -y httpd
service httpd start
The WebServer won't start automatically, if the Linux instance is rebooted. If it has to be started automatically on boot/reboot use the below command
chkconfig httpd on
6. Go to the /var/www/html folder and create an index.html file using the below commands
cd /var/www/html
echo "Welcome to" > index.html
7. Get the ip address of the Linux instance from the EC2 management console.

8. Open the same in the browser to get the web page as shown below.

The above sequence of steps will simply install a WebServer on the Linux instance and put a simple web page in it. Much more interesting things can be done. One thing to note is that all the above commands were executed as root, which is not a really good a good practice, but was done for the sake of simplicity.

April 08, 2017

Simplified Analytics

From Bullock Cart to Hyperloop – Digital Transformation of Travel

Remember when you were teenager and wanted to go on vacation with parents-you were asked to go to travel agent and get all the printed brochures of exotic locations?   Then came the wave...


April 07, 2017

Revolution Analytics

Because it's Friday: Art Collective

Reddit conducted an interesting social experiment last weekend. It provided all of its users with a blank canvas, and the ability to color its pixels according to just three simple rules: You can...


Revolution Analytics

The faces of R, analyzed with R

Maëlle Salmon recently created a collage of profile pictures of people who use the #rstats hashtag in their Twitter bio to indicate their use of R. (I've included a detail below; click to see the...

Cloud Avenue Hadoop Tips

Creating a Windows EC2 and logging into it

In the previous blog, we looked at creating a Linux EC2 instance and logging into it. This time it would be a Windows EC2 instance. The steps are more or less the same with some minor changes here and there. I will highlight the changes in this blog, so I would recommend going through the blog where we created a Linux instance and come back to this blog.

1. Login to the EC2 management console. Create a Key Pair, if you haven't already done as shown here. The same Key Pair can be used for Linux and Windows instance. There is no need to create two different Key Pairs. And also, there is no need to convert the pem file into a ppk file for logging into a Windows instance.

2. Create a Security Group as shown here. Instead of opening port 22 for ssh, open port 3389 as shown below.

3. Click on `Instances` in the left pane and click on `Launch Instances`.

4. Select the Windows AMI as shown below.

5. Select the EC2 instance type as shown below.

 6. Click on `Next : Configure Instance Details`.

7. Click on `Next : Add Storage`. Note that in the case of Linux instance the storage defaults to 8GB, but in the case of Windows it's 30GB. Windows eats lot of space.

8.Click on `Next : Add Tags`.

9. Click on `Next : Configure Security Group`. Click on `Select an existing security group` and select the Security Group which has been created for the Windows instance.

10. Click on `Review and Launch`.

11. Make sure all the settings are proper and click on `Launch`.

12. Select the Key Pair which has been created earlier and click the `I acknowledge .....` check box. Finally, click on `Launch Instances`.

13. Click on `View Instances`. In a couple of minutes, the Instance State should change to running as shown below.

14. Make sure that the instance is selected and click on the `Connect Button` to download the `Remote Desktop File`. Save it somewhere, where you can find it easily.

15. Click on the `Get Password` button. Click on `Browse` and point to the pem file which was generated during the Key Pair creation. Click on `Decrypt Password`.

16. A random password is generated and displayed as shown below. Note down the password.

17. Double Click on the `Remote Desktop File` which was created in Step 14. Click on `Connect` when prompted with a warning.

18. Enter the password which we got in Step 16 and click on OK. Click on `Yes` when prompted with a warning.

19. Now, we are connected to the Windows EC2 instance in the AWS cloud.

20. Make sure to shutdown the Windows instance and terminate the EC2 instance from the AWS EC2 console to stop the billing.

Similarly, we would be able to start multiple Linux and Windows instances based on our Non Functional Requirements (NFR). Instead of creating the instances manually, Auto Scaling can be used. In Auto Scaling, we can specify conditions when the number of instances should scale up or scale down. Like, if CPU > 80% then add 2 Linux instances, if CPU < 30% then remove 2 Linux instances. Here we specify CPU, but a lot of other metrics can used.

AWS is fun, this is just the beginning and we will look into AWS in the upcoming blogs.

April 06, 2017

Revolution Analytics

Microsoft R Open 3.3.3 now available

Microsoft R Open (MRO), Microsoft's enhanced distribution of open source R, has been upgraded to version 3.3.3, and is now available for download for Windows, Mac, and Linux. This update upgrades the...

Silicon Valley Data Science

Is Your Customer Journey Set Up for Success?

A senior marketer’s ability to find and create valuable experiences for customers has grown dramatically in recent years. Beyond the traditional responsibilities of brand and creative management, senior marketers (such as CMOs, Brand Managers, and Product Marketing Managers) now use analytics to track customer interactions, measure the quality of engagement, and determine the effectiveness of an enormous range of different marketing tactics.

Marketers often map out a “customer journey” in order to manage successful engagements. The customer journey is the complete sum of experiences that your customers go through when interacting with your company and brand—mapping out these interactions gives you a holistic view of how customers engage with your company. While many marketers focus on developing positive interactions, a customer journey is a plan that focuses on how a series of engagements can generate momentum from awareness, to sale, to ongoing loyalty and advocacy.

According to a Salesforce report, nearly all senior level marketers agree that a comprehensive journey map is absolutely critical or very important to their business. At the same time, the report mentions that this map has largely been an aspiration for marketers—in part due to siloed business teams and a disjointed view of customer data, only 29% of enterprise companies would rate themselves as very effective or effective at creating a cohesive journey.

Senior marketers should take the responsibility for this challenge head on; to be successful in creating a useful map, you will also need to be the leader of the technical and analytical development of their teams. You should have an intuition for how data can enhance, track, and articulate the customer experience—as this intuition creates new possibilities for the type of relationships companies can have with their customers.

In this post, we’ll walk through some examples of how we have seen data capabilities determine the success of customer journey initiatives for our clients. We’ll also offer guidance on the data-related initiatives that you can start today to begin fostering closer ties with your customers—regardless of where you currently are in your specific state of development.

What can your data do for you?

We’ve seen data play various roles in creating strong customer engagements. Here’s a look at just a few.

Data integration helps marketers optimize analyses for different, more holistic, outcomes (Social Gaming)

For many businesses, legacy and heterogeneous systems are a challenge for creating an integrated customer experience—the data is often structured with a narrow lens on a specific product or domain. We worked with a social gaming company facing this challenge: they wanted to create a cohesive customer experience across all their games by extending the preferential treatment that loyal customers receive for their favorite games to new games on the platform.

On a per-game basis, our client was technically sophisticated—they could build out events, correlate performance with targeted marketing strategies, and articulate the effectiveness of different campaigns. However, this sophistication fell apart at the organization-wide level. After several acquisitions and the use of third party licensing for games, the company found itself with a broken analytical architecture—each game optimized for itself, but there was a lot of opportunity in optimizing across the business. By supporting the integration of different data from different silos, the new architecture enabled the company to:

  • Increase customer satisfaction by rewarding loyalty. By not being able to establish a single view of their customers, our client was continually losing opportunities to tailor experiences for their customers. For example, high-paying customers that had “preferred” status for certain games would return to being “unknown” when they started playing new games. By helping the client carry customer status across games, players will have greater satisfaction and loyalty.
  • Improve game development by understanding interaction patterns. Some of the most important metrics in gaming—e.g., alliances and teaming—are challenging to measure. Breaking down silos allowed the client to utilize novel sources of data, like gaming chats, to articulate a “web of influence” and its role in engagement and profitability.

By reframing the customer journey as an experience that transcends individual games—and developing a supporting data architecture—our client was able to both develop an engine for growth and improve profitability on an individual basis, by both reducing the acquisition cost (UAC) of customers and increasing their lifetime value (LTV).

Modern architectures help marketers redefine the product suite and their customer relationships (Digital Entertainment)

Take a look at the data sources you’re using in your marketing efforts and you may find some unexpected insights. For example, is there untapped value in your existing customer relationships? A digital entertainment company wanted to develop a modern database architecture that would allow them to understand user consumption at both a customer and population-level—at a microsecond granularity.

Redesigning the client’s architecture to identify and articulate the customer’s consumption patterns ultimately gave the client vastly more usable data about their customers, which led to:

  • New customer offerings. Our client used these new capabilities to develop more effective cross-selling opportunities and to develop products that provide guidance to content providers.
  • Improved strategic decision making. This view of consumption informed large strategic bets for the organization—for example, the decision to give away existing products for free in return for increased engagement.

The client had always had tremendous potential to understand customer engagement and consumption patterns—they were a portal for millions of users. However, their existing platform was limited by its underlying technology and a myopic view of the role of data. In their original product offering, the collected data was not necessarily perceived to have inherent value. We find this type of oversight to be common for marketers beginning to build analytics within their teams.

Note: This concept of “instrumentation”—the process of logging and tracking customer interactions—is important when creating an engagement plan. Instrumentation creates a more nuanced understanding of what customers find valuable about your products and services. For our client, this instrumentation influenced their entire business: from feature development, to sales, to pricing, and even to marketing copy about the efficacy of their existing product suite. Instrumentation is so important, in fact, that it is something we at SVDS specifically assess when considering the data maturity of a business.

Digitization can require updating back-office processes to meet customer expectations (National Retailer)

Work that begins in the marketing department often extends to influence other parts of a business. In one example, we worked with a traditional brick and mortar retail client that was in the process of developing their digital presence.

As their customers began to spend more time purchasing items online, our client realized that they would need new capabilities to support new types of interactions—for example, granularity in tracking inventory. When creating its online presence, our client found itself frequently selling under-stocked items from the website, and then had to follow them up with costly gift card apologies to disappointed customers.

We helped the client gain an understanding of their inventory baseline and establish a real-time view of changes in supply and demand. This allowed the business as a whole to establish new customer relationships and a leaner efficiency:

  • Customer interaction with the brand. E-commerce has developed some great features to incent purchasing behavior—we have all experienced messages like, “There are only 3 items left in stock!” or “Order by 11:59pm Tuesday to get by Christmas.” Our client’s new modern inventory infrastructure allowed them to create similar features, increasing trust and confidence in the brand.
  • Reduced working capital. In the past, a granular level of detail—across stores, channels, and partners (e.g., third party sellers)—was not required from legacy operations. In developing a new system, our client was able to serve items that were stocked-out in their web distribution centers from stores in close proximity to the buyers. This allowed our client to reduce their buffer stock and protect against obsolescence.

For our retail client, marketing’s strategic influence and digital leadership forced growth in other parts of the business that benefited customers and improved the company’s competitive position.

Creating opportunities with data

The examples described above all involve companies we would consider to be leaders in the use of data to drive more useful customer engagements—first and foremost by recognizing their need to embrace change. There is a big shift taking place, and that shift will become the new normal.

You should be trying to learn quickly, and fail fast. Companies that are able make the best use of their data and infrastructure earlier than their competition are at an advantage—both with regard to increased customer loyalty and improvements in new product development.

In a November 2015 Harvard Business Review article on the customer journey, the authors stated that, “Best practitioners aim not just to improve the existing journey but to expand it, adding useful steps or features.” As mentioned in the examples earlier in this post, we have seen the same trend: clients who can harness their data to create effective customer experiences often make further investments toward developing their capabilities.

There are data-related initiatives that you can can begin pursuing today to develop stronger relationships with your customers. Make an honest assessment of where you stand now, and find yourself in the sections below.

If you are starting from scratch

Possible problems

  • Identifying how to make decisions based on data-driven insights
  • Identifying single points of ownership within the organization

Suggested approach

  • Focus on top-down buy-in. Without recognition from leadership, data initiatives will struggle to get relevance with business users and may be piecemeal efforts, diminishing the value of investment.

If you are performing early project identification

Possible problem

  • Mapping out customer journey initiatives

Suggested approaches

  • Get alignment on the full picture. It is important to be able to articulate the “full view” of the customer experience as it has a large effect on decision making. For example, only tracking successful interactions would lead to very different conclusions than understanding users that “turn away.”
  • Plan for iteration. It often takes time to understand where your map does and does not match your customers’ realities.
  • Start small. Change can be incremental—look for low-hanging fruit.

If you are establishing visibility

Possible problem

  • Data integration

Suggested approaches

  • Prioritize data collection in tandem with journey mapping. In instrumentation, it is important to know what behaviors you can collect directly from customers and what behaviors you have to infer based on their actions. This influences what additional data sources you include to support decisions you make for the business.
  • Seek to reuse and extend data services. Developing known, validated, and consistent data assets for your business increases their utility and dramatically improves trust in the developed insights across the organization.

If you are optimizing for growth

Possible problems

  • Personalization and automation

Suggested approach

  • Enable advanced analytics. Optimize for automation to create feedback loops and self-learning capabilities that make it easy to identify and capitalize on growth opportunities.

Customer preferences and expectations have changed dramatically as businesses learn how to develop new digital experiences, collect feedback, and augment existing offerings. Success requires understanding what information you need to collect about your customers, instrumenting processes to be able to track customer behavior, and developing the analytics that allow your business to optimize customer interactions. As a marketer, you must be willing to grow with the needs of a modern business.

Has your company taken any steps to strengthen their customer engagement strategies through better use of data? Share your story in the comments.

Download our data strategy position paper

The post Is Your Customer Journey Set Up for Success? appeared first on Silicon Valley Data Science.