Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


March 23, 2017

Silicon Valley Data Science

TensorFlow RNN Tutorial

On the deep learning R&D team at SVDS, we have investigated Recurrent Neural Networks (RNN) for exploring time series and developing speech recognition capabilities. Many products today rely on deep neural networks that implement recurrent layers, including products made by companies like Google, Baidu, and Amazon.

However, when developing our own RNN pipelines, we did not find many simple and straightforward examples of using neural networks for sequence learning applications like speech recognition. Many examples were either powerful but quite complex, like the actively developed DeepSpeech project from Mozilla under Mozilla Public License, or were too simple and abstract to be used on real data.

In this post, we’ll provide a short tutorial for training a RNN for speech recognition; we’re including code snippets throughout, and you can find the accompanying GitHub repository here. The software we’re using is a mix of borrowed and inspired code from existing open source projects. Below is a video example of machine speech recognition on a 1906 Edison Phonograph advertisement. The video includes a running trace of sound amplitude, extracted spectrogram, and predicted text.

Since we have extensive experience with Python, we used a well-documented package that has been advancing by leaps and bounds: TensorFlow. Before you get started, if you are brand new to RNNs, we highly recommend you read Christopher Olah’s excellent overview of RNN Long Short-Term Memory (LSTM) networks here.

Speech recognition: audio and transcriptions

Until the 2010’s, the state-of-the-art for speech recognition models were phonetic-based approaches including separate components for pronunciation, acoustic, and language models. Speech recognition in the past and today both rely on decomposing sound waves into frequency and amplitude using fourier transforms, yielding a spectrogram as shown below.

Training the acoustic model for a traditional speech recognition pipeline that uses Hidden Markov Models (HMM) requires speech+text data, as well as a word to phoneme dictionary. HMMs are generative probabilistic models for sequential data, and are typically evaluated using Levenshtein word error distance, a string metric for measuring differences in strings.

These models can be simplified and made more accurate with speech data that is aligned with phoneme transcriptions, but this a tedious manual task. Because of this effort, phoneme-level transcriptions are less likely to exist for large sets of speech data than word-level transcriptions. For more information on existing open source speech recognition tools and models, check out our colleague Cindi Thompson’s recent post.

Connectionist Temporal Classification (CTC) loss function

We can discard the concept of phonemes when using neural networks for speech recognition by using an objective function that allows for the prediction of character-level transcriptions: Connectionist Temporal Classification (CTC). Briefly, CTC enables the computation of probabilities of multiple sequences, where the sequences are the set of all possible character-level transcriptions of the speech sample. The network uses the objective function to maximize the probability of the character sequence (i.e., chooses the most likely transcription), and calculates the error for the predicted result compared to the actual transcription to update network weights during training.

It is important to note that the character-level error used by a CTC loss function differs from the Levenshtein word error distance often used in traditional speech recognition models. For character generating RNNs, the character and word error distance will be similar in phonetic languages such as Esperonto and Croatian, where individual sounds correspond to distinct characters. Conversely, the character versus word error will be quite different for a non-phonetic language like English.

If you want to learn more about CTC, there are many papers and blog posts that explain it in more detail. We will use TensorFlow’s CTC implementation, and there continues to be research and improvements on CTC-related implementations, such as this recent paper from Baidu. In order to utilize algorithms developed for traditional or deep learning speech recognition models, our team structured our speech recognition platform for modularity and fast prototyping:

Importance of data

It should be no surprise that creating a system that transforms speech into its textual representation requires having (1) digital audio files and (2) transcriptions of the words that were spoken. Because the model should generalize to decode any new speech samples, the more examples we can train the system on, the better it will perform. We researched freely available recordings of transcribed English speech; some examples that we have used for training are LibriSpeech (1000 hours), TED-LIUM (118 hours), and VoxForge (130 hours). The chart below includes information on these datasets including total size in hours, sampling rate, and annotation.

In order to easily access data from any data source, we store all data in a flat format. This flat format has a single .wav and a single .txt per datum. For example, you can find example Librispeech Training datum ‘211-122425-0059’ in our GitHub repo as 211-122425-0059.wav and 211-122425-0059.txt. These data filenames are loaded into the TensorFlow graph using a datasets object class, that assists TensorFlow in efficiently loading, preprocessing the data, and loading individual batches of data from CPU to GPU memory. An example of the data fields in the datasets object is shown below:

class DataSet:
    def __init__(self, txt_files, thread_count, batch_size, numcep, numcontext):
        # ...

    def from_directory(self, dirpath, start_idx=0, limit=0, sort=None):
        return txt_filenames(dirpath, start_idx=start_idx, limit=limit, sort=sort)

    def next_batch(self, batch_size=None):
        idx_list = range(_start_idx, end_idx)
        txt_files = [_txt_files[i] for i in idx_list]
        wav_files = [x.replace('.txt', '.wav') for x in txt_files]
        # Load audio and text into memory
        (audio, text) = get_audio_and_transcript(

Feature representation

In order for a machine to recognize audio data, the data must first be converted from the time to the frequency domain. There are several methods for creating features for machine learning of audio data, including binning by arbitrary frequencies (i.e., every 100Hz), or by using binning that matches the frequency bands of the human ear. This typical human-centric transformation for speech data is to compute Mel-frequency cepstral coefficients (MFCC), either 13 or 26 different cepstral features, as input for the model. After this transformation the data is stored as a matrix of frequency coefficients (rows) over time (columns).

Because speech sounds do not occur in isolation and do not have a one-to-one mapping to characters, we can capture the effects of coarticulation (the articulation of one sound influencing the articulation of another) by training the network on overlapping windows (10s of milliseconds) of audio data that captures sound from before and after the current time index. Example code of how to obtain MFCC features, and how to create windows of audio data is shown below:

# Load wav files
fs, audio =

# Get mfcc coefficients
orig_inputs = mfcc(audio, samplerate=fs, numcep=numcep)

# For each time slice of the training set, we need to copy the context this makes
train_inputs = np.array([], np.float32)
train_inputs.resize((orig_inputs.shape[0], numcep + 2 * numcep * numcontext))

for time_slice in range(train_inputs.shape[0]):
    # Pick up to numcontext time slices in the past,
    # And complete with empty mfcc features
    need_empty_past = max(0, ((time_slices[0] + numcontext) - time_slice))
    empty_source_past = list(empty_mfcc for empty_slots in range(need_empty_past))
    data_source_past = orig_inputs[max(0, time_slice - numcontext):time_slice]
    assert(len(empty_source_past) + len(data_source_past) == numcontext)

For our RNN example, we use 9 time slices before and 9 after, for a total of 19 time points per window.With 26 cepstral coefficients, this is 494 data points per 25 ms observation. Depending on the data sampling rate, we recommend 26 cepstral features for 16,000 Hz and 13 cepstral features for 8,000 hz. Below is an example of data loading windows on 8,000 Hz data:

If you would like to learn more about converting analog to digital sound for RNN speech recognition, check out Adam Geitgey’s machine learning post.

Modeling the sequential nature of speech

Long Short-Term Memory (LSTM) layers are a type of recurrent neural network (RNN) architecture that are useful for modeling data that has long-term sequential dependencies. They are important for time series data because they essentially remember past information at the current time point, which influences their output. This context is useful for speech recognition because of its temporal nature. If you would like to see how LSTM cells are instantiated in TensorFlow, we’ve include example code below from the LSTM layer of our DeepSpeech-inspired Bi-Directional Neural Network (BiRNN).

with tf.name_scope('lstm'):
    # Forward direction cell:
    lstm_fw_cell = tf.contrib.rnn.BasicLSTMCell(n_cell_dim, forget_bias=1.0, state_is_tuple=True)
    # Backward direction cell:
    lstm_bw_cell = tf.contrib.rnn.BasicLSTMCell(n_cell_dim, forget_bias=1.0, state_is_tuple=True)

    # Now we feed `layer_3` into the LSTM BRNN cell and obtain the LSTM BRNN output.
    outputs, output_states = tf.nn.bidirectional_dynamic_rnn(
        # Input is the previous Fully Connected Layer before the LSTM

    tf.summary.histogram("activations", outputs)

For more details about this type of network architecture, there are some excellent overviews of how RNNs and LSTM cells work. Additionally, there continues to be research on alternatives to using RNNs for speech recognition, such as with convolutional layers which are more computationally efficient than RNNs.

Network training and monitoring

Because we trained our network using TensorFlow, we were able to visualize the computational graph as well as monitor the training, validation, and test performance from a web portal with very little extra effort using TensorBoard. Using tips from Dandelion Mane’s great talk at the 2017TensorFlow Dev Summit, we utilize tf.name_scope to add node and layer names, and write out our summary to file. The results of this is an automatically generated, understandable computational graph, such as this example of a Bi-Directional Neural Network (BiRNN) below. The data is passed amongst different operations from bottom left to top right. The different nodes can be labelled and colored with namespaces for clarity. In this example, teal ‘fc’ boxes correspond to fully connected layers, and the green ‘b’ and ‘h’ boxes correspond to biases and weights, respectively.

We utilized the TensorFlow provided tf.train.AdamOptimizer to control the learning rate. The AdamOptimizer improves on traditional gradient descent by using momentum (moving averages of the parameters), facilitating efficient dynamic adjustment of hyperparameters. We can track the loss and error rate by creating summary scalars of the label error rate:

  # Create a placeholder for the summary statistics
  with tf.name_scope("accuracy"):
      # Compute the edit (Levenshtein) distance of the top path
      distance = tf.edit_distance(tf.cast(self.decoded[0], tf.int32), self.targets)

      # Compute the label error rate (accuracy)
      self.ler = tf.reduce_mean(distance, name='label_error_rate')
      self.ler_placeholder = tf.placeholder(dtype=tf.float32, shape=[])
      self.train_ler_op = tf.summary.scalar("train_label_error_rate", self.ler_placeholder)
      self.dev_ler_op = tf.summary.scalar("validation_label_error_rate", self.ler_placeholder)
      self.test_ler_op = tf.summary.scalar("test_label_error_rate", self.ler_placeholder)

How to improve an RNN

Now that we have built a simple LSTM RNN network, how do we improve our error rate? Luckily for the open source community, many large companies have published the math that underlies their best performing speech recognition models. In September 2016, Microsoft released a paper in arXiv describing how they achieved a 6.9% error rate on the NIST 200 Switchboard data. They utilized several different acoustic and language models on top of their convolutional+recurrent neural network. Several key improvements that have been made by the Microsoft team and other researchers in the past 4 years include:

  • using language models on top of character based RNNs
  • using convolutional neural nets (CNNs) for extracting features from the audio
  • ensemble models that utilize multiple RNNs

It is important to note that the language models that were pioneered in traditional speech recognition models of the past few decades, are again proving valuable in the deep learning speech recognition models.

Modified From: A Historical Perspective of Speech Recognition, Xuedong Huang, James Baker, Raj Reddy Communications of the ACM, Vol. 57 No. 1, Pages 94-103, 2014

Training your first RNN

We have provided a GitHub repository with a script that provides a working and straightforward implementation of the steps required to train an end-to-end speech recognition system using RNNs and the CTC loss function in TensorFlow. We have included example data from the LibriVox corpus in the repository. The data is separated into folders:

  • Train: train-clean-100-wav (5 examples)
  • Test: test-clean-wav (2 examples)
  • Dev: dev-clean-wav (2 examples)

When training these handful of examples, you will quickly notice that the training data will be overfit to ~0% word error rate (WER), while the Test and Dev sets will be at ~85% WER. The reason the test error rate is not 100% is because out of the 29 possible character choices (a-z, apostrophe, space, blank), the network will quickly learn that:

  • certain characters (e, a, space, r, s, t) are more common
  • consonant-vowel-consonant is a pattern in English
  • increased signal amplitude of the MFCC input sound features corresponds to characters a-z

The results of a training run using the default configurations in the github repository is shown below:

If you would like to train a performant model, you can add additional .wav and .txt files to these folders, or create a new folder and update `configs/neural_network.ini` with the folder locations. Note that it can take quite a lot of computational power to process and train on just a few hundred hours of audio, even with a powerful GPU.

We hope that our provided repo is a useful resource for getting started—please share your experiences with adopting RNNs in the comments. To stay in touch, sign up for our newsletter or contact us.

The post TensorFlow RNN Tutorial appeared first on Silicon Valley Data Science.

Revolution Analytics

Announcing R Tools 1.0 for Visual Studio 2015

by Shahrokh Mortazavi, Partner PM, Visual Studio Cloud Platform Tools at Microsoft I’m delighted to announce the general availability of R Tools 1.0 for Visual Studio 2015 (RTVS). This release will...


The 2 critical things to keep in mind when moving from Predictive Modelling to Machine Learning

Everyone is wanting to learn more about how machine learning can be used in their business. What’s interesting though is that many companies may already be using machine learning to some extent without really realising it. The lines between predictive analytics and machine learning are actually quite blurred. Many companies will have built up some machine learning capabilities using predictive analytics in some area of their business. So if you use static predictive models in your business, then you are already using machine learning, albeit of the static variety.  

Cloud Avenue Hadoop Tips

Got through `AWS Certified Solutions Architect - Associate`

Today I got through the AWS Certified Solutions Architect - Associate. Recently I got an opportunity to work on the AWS Services and so decided to take the certification. It took me close to 60 hours for the preparation. It was fun. So, here it is

I took the extended version of the exam. It had 20 additional questions, 30 more minutes with a nice 50% discount on the price. The additional questions were mixed in the exam and were a bit tough. They were not included in the pass grade. One can't clearly say which were the additional questions. I think Amazon was doing some sort of A/B testing on the certifications and so the discount.

There are a total of 9 certifications (1, 2) including the 3 beta certifications which were introduced this year. For some reason the links to the AWS beta certifications (security, big data and networking) are no more working. I am planning to complete as many as I can, especially the AWS Big Data Specialty Certificate.

Here are a few tips

- Read the user guide and the FAQ for the different services
- Watch the videos in the AWS YouTube channel
- Do practice a lot
Next, I am planning for the `AWS Certified Developer - Associate` for the next certificate. Will update the blog later on the same.

March 22, 2017

Silicon Valley Data Science

Spark Summit: Ignition in the Enterprise

Apache Spark has been causing a lot of excitement since its release in 2014, and is becoming established as the big data analytics engine of choice. It has been quickly embraced by business, thanks to its performance, wide range of applications, and friendliness to developers and data scientists.

One reason Spark has gained rapid adoption is its ability to integrate with diverse sources of data, presenting a common platform for analytics. As enterprises start working more with Spark, it’s important to ensure that this philosophy of integration is also reflected in the community and resources.

In this spirit, I’m very excited to announce that for Spark Summit 2017 in San Francisco, I will be joining Reynold Xin as co-chair of the Spark Summit program. As Spark grows in the enterprise, I’ll be helping ensure that the conference grows too, bringing in experiences and best practices from real enterprise users.

I hope you’ll join us in San Francisco from June 5-7 in Moscone West. The conference schedule includes training, along with a broad range of tracks, including Streaming, Developer, Machine Learning, Spark Ecosystem, Spark Experience, Enterprise Applications and Research.

Register here for Spark Summit 2017: for 15% off the regular registration fee, use the code EDD2017.

The post Spark Summit: Ignition in the Enterprise appeared first on Silicon Valley Data Science.


Citi Features BrightPlanet in Newest White Paper

The evolution of Big Data has become a ubiquitous and crucial component for a multitude of sectors in the economy. One industry in particular that is finding success with Big Data and Deep Web searches is the investment industry. The question is: what does finding success with Big Data mean? In Searching for Alpha: Big Data, Citi […] The post Citi Features BrightPlanet in Newest White Paper appeared first on BrightPlanet.

Read more »

Revolution Analytics

Running your R code on Azure with mrsdeploy

by John-Mark Agosta, data scientist manager at Microsoft Let’s say you’ve built a model in R that is larger than you can conveniently run locally, and you want to take advantage of Azure’s resources...

Silicon Valley Data Science

Four Data Capabilities for Telecommunications

One of the big themes at Mobile World Congress this year was how the telecommunications (telco) industry can benefit from data. Telcos are increasingly looking to develop new applications on top of their data in an ongoing race to escape commoditization of their core connectivity business. While many have invested heavily in analytic technologies, they still struggle with translating the insights into actions.

This post looks at four business analysis capabilities that connect the dots between promising applications of data assets:

  • integrated customer view
  • optimization of network capex and opex
  • improved contact center productivity
  • real-time location intelligence

Integrated customer view

A data lake is an architectural approach that consolidates many disparate data sources across the enterprise into a single repository, or lake. This removes friction from an organization’s data value chain, making it easier and faster to surface insight across operational silos. Most telcos have established one or more data lakes across their organization. The anchor use case that prompted these investments is often the desire to obtain an integrated customer view.

Specifically, a data lake can be used to integrate the entire transaction and usage history of an individual, household, or corporate account across a broader range of touchpoints. Usually it is the Marketing or the Base Retention team who is tasked with building a “golden record” of each customer that joins transactional and usage data. The golden record provides a relationship and customer lifetime value view of a household. For markets with a high proportion of pre-paid customers, for instance, this integrated view provides exponentially more insight than the prevailing SIM centric view where a customer is literally “just a number.”

Telcos use these insights to drive so-called next-best-actions (NBAs) in their marketing campaigns to this segment. A common NBA for the prepaid segment is to a accelerate “top ups,” enticing consumers to prepay for usage sooner or in larger increments. That action is very profitable because a fair share of prepaid service credits expire unused.

The insights this generates are not limited to the commercial teams. For the first time, it is possible to analyze and report network events at a “true” customer, service, network element, and device level, starting with mobile network data.

Over time, more data sources can be added to provide an exhaustive end-to-end view of all organizational touch points for a customer. Usage data from all products joins the data lake, including broadband, pay TV, and over-the-top services. All channels of interaction come into view, including customer operations, self-care, point-of-sale, and digital. In that way, the golden record that might start as a 90 or 180-degree view of a customer relationship is ultimately extended to a true 360-degree view including all behavioral data, interactions, and observations.

Optimization of network capex and opex

Telcos, even in medium-sized countries, spend hundreds of millions of dollars per year on network rollout, upgrades, and maintenance. What if you could use the insights from better analytics to defer or avoid even a small percentage of that spend? You could, for example, look at how improvements in LTE in specific neighborhoods correlate to likely business outcomes such as revenue growth, customer satisfaction, and churn.

Admittedly, that can be a daunting project. An easier starting point is to look at the operational data generated by your existing network. This is based on an analysis of typical cost categories such as site maintenance and rental, energy consumption of mast sites and base stations, personnel expenses, and equipment replacement. A basic objective of this analysis is to explore how the network operations team can “do more with less.” Since the various physical resources and associated costs often sit in separate silos and disparate systems, an analysis that is based on joined-up data sources often yields unprecedented insights.

A more advanced objective of this analysis is to relate network opex to better customer and business outcomes. For instance, data science can be used to correlate network maintenance with congestion (when, where, from what types of traffic?), perceived call quality, commercially relevant gaps in indoor and small cell coverage, and applications/users/devices exhibiting behavior anomalies.

Improved contact center productivity

For many carriers, an increasing share of support cases are related to mobile data usage and associated charges. Traditionally, contact center agents do not have granular insights into a particular customer’s data usage, and hence are unable to provide effective call resolution. Moreover, customers who are not digitally savvy may consume contact center agent attention because, for example, they are unaware of the battery drain of certain settings and mobile applications. This generates costly caseload for the carrier’s call center, preventing agents from focusing on more valuable customer interactions such as up-selling or actual network issue resolution.

Insights from data science can be used to off-load these cases to cheaper channels of engagement, such as online self-service and interactive voice recognition. Specifically, the insights from data science models running in production are accessed and consumed by any authorized users of the data lake. These users include business analysts, who might leverage the insights for periodic changes to the menu of options or to the script that consumers experience when calling in.

However, making these insights actionable at the front-line in near real time requires many process and systems changes. While machine learning can generate timely and effective recommendations at each step of the customer journey, most carriers lack holistic and flexible customer journey optimization systems that would allow these insights to be fed into a front-line system easily (e.g., a recommendation pops up on the user’s mobile device or on the call center agent’s screen). Hence, while data can create material business value in both network operations and customer operations longer term, the network operations domain may be more suited to agile experimentation, particularly since customer operations are often outsourced.

Real-time location intelligence

Over the last few years, many carriers have created a business unit dedicated to new ventures such as advertising, online media, classified, mobile payments, etc. A starting point for this capability is to visualize on a map the movement of mobile phone subscribers in a segmented way. These insights can also be used by the carrier itself to optimize its media and outdoor advertising spend. For instance, a visualization showing a retailer’s catchment area would be based on insights drawn from the following data sources: where the mobile subscriber is at home, where the mobile subscriber works1, what segment they are part of (e.g., blue collar, high income, out-of-state visitor), what their typical commuting routes and stops are, etc. Naturally, this is served up in an aggregated way that respects the privacy laws of each jurisdiction. Retailers, advertisers, public transit agencies, and others with a keen interest in all things local will subscribe to that real-time intelligence.


In our experience, the capabilities covered here generate substantial value for a carrier’s bottom line. In a future post, we will look at the data sources needed to enable these capabilities. In the meantime, learn more about how we can help you build your own data strategy with our Data Strategy Position Paper, or by getting in touch.

1. Mobile subscriber “at home” BTS (base transceiver station) and mobile subscriber “at work” BTS, respectively.

The post Four Data Capabilities for Telecommunications appeared first on Silicon Valley Data Science.


March 21, 2017

Silicon Valley Data Science

What Are You Doing with Your Data?

The great data rush is well and truly under way. Across virtually every industry, companies large and small are committing serious money to standing up their data infrastructure, beefing up capabilities, and hunting for the value hidden in their data—but often without a clear plan. No wonder that many of the business leaders we speak with suspect their initiatives are underperforming. The complaint we hear most frequently, regardless of industry, is that technology investments aren’t generating the expected returns.

These leaders naturally want to know what their companies are doing right and where they’re missing the mark. For many, the next step is to assess the capabilities already in place. Typically, these assessments focus first on various data management and governance concerns, with most of the questions keying into how data is gathered, cleaned, validated, controlled, and protected. Important topics all, to be sure, but by concentrating on tactical and operational data tasks, organizations risk overlooking important strategic considerations.

To put it another way, benchmarking what your company is doing to data is necessary but insufficient. It’s even more important to know what your company is doing with its data to deliver returns on substantial investments in data capabilities.

Like everything else we do at Silicon Valley Data Science, our approach to data maturity focuses on generating value from data. It explicitly recognizes the central role that data and analytics play in today’s enterprises. It emphasizes business outcomes. And by zeroing in on what matters to a company, it helps lay the foundation for the changes needed to reach business objectives.

The questions that underpin our approach are designed to measure how effectively an organization has developed the people, processes, and systems that enable it to be truly data-driven. We don’t just benchmark operational capabilities, we measure how far an organization has progressed toward generating value.

Having helped clients take these soundings as part of many of our engagements, we realized the need to formalize the approach and organize it into a full-blown maturity model—one that not only helps you know where you stand but also keeps your eye on the ball. Now we’re releasing it into the wild, hoping to change the conversation on maturity. If you want to know how your company’s data capabilities stack up, we’ve posted an abbreviated version of our model for folks to try out. Stop by, answer a few questions, and see your results.

Although it won’t give you a full picture of your organization, it will help you gain a better sense of your company’s data maturity and spark productive exchanges with your colleagues. Is your organization using data just for generating reports or focusing on use cases in mission-critical areas? Or is it putting data first in every business activity? Knowing where you stand will increase your understanding of your organization’s potential, provide a baseline for measuring progress, and give you a framework for thinking about your data operations and how they might compare with the competition’s and your industry’s as a whole.

After a self assessment, you might be interested in a more extensive conversation with us. We’re happy to talk. Whatever form the conversation takes, you’ll have an opportunity to start determining whether your organization has the right capabilities to be competitive, what industry leaders are doing, and where and how to start the data transformation that positions your organization for success. It can be a big step toward a data strategy that delivers real returns on your investments in data.

The post What Are You Doing with Your Data? appeared first on Silicon Valley Data Science.

Revolution Analytics

Alteryx integrates with Microsoft R

You can now use Alteryx Designer, a self-service analytics workflow tool from Alteryx, as a drag-and-drop interface for many of the big-data statistical modeling tools included with Microsoft R....


Revolution Analytics

Give a talk about an application of R at EARL

The EARL (Enterprise Applications of R) conference is one of my favourite events to go to. As the name of the conference suggests, the focus of the conference is where the rubber of the R language...

Big Data University

This Week in Data Science (March 21, 2017)

Here’s this week’s news in Data Science and Big Data. hybrid-cloud

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Featured Courses From BDU

  • Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
  • Predictive Modeling Fundamentals I
    – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data.
  • Using R with Databases
    – Learn how to unleash the power of R when working with relational databases in our newest free course.
  • Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to apply deep learning to different data types in order to solve real world problems.

Cool Data Science Videos

The post This Week in Data Science (March 21, 2017) appeared first on BDU.


March 20, 2017

InData Labs

3 Major Problems of Artificial Intelligence Implementation into Commercial Projects

Artificial intelligence (AI) is a broad term that incorporates everything from image recognition software to robotics. The maturity level of each of these technologies strongly varies. Nevertheless, the number of innovations and breakthroughs that have brought the power and efficiency of AI into various fields including medicine, shopping, finance, news, and advertising is only growing....

Запись 3 Major Problems of Artificial Intelligence Implementation into Commercial Projects впервые появилась InData Labs.

Big Data University

Learn TensorFlow and Deep Learning Together and Now!

I get a lot of questions about how to learn TensorFlow and Deep Learning. I’ll often hear, “How do I start learning TensorFlow?” or “How do I start learning Deep Learning?”. My answer is, “Learn Deep Learning and TensorFlow at the same time!”. See, it’s not easy to learn one without the other. Of course, you can use other libraries like Keras or Theano, but TensorFlow is a clear favorite when it comes to libraries for deep learning. And now is the best time to start. If you haven’t noticed, there’s a huge wave of new startups or big companies adopting deep learning. Deep Learning is the hottest skill to have right now.

So let’s start from the basics. What actually is “Deep Learning” and why is it so hot in data science right now? What’s the difference between Deep Learning and traditional machine learning? Why TensorFlow? And where can you start learning?

What is Deep Learning?

Inspired by the brain, deep learning is a type of machine learning that uses neural networks to model high-level abstractions in data. The major difference between Deep Learning and Neural Networks is that Deep Learning has multiple hidden layers, which allows deep learning models (or deep neural networks) to extract complex patterns from data.

How is Deep Learning different from traditional machine learning algorithms, such as Neural Networks?

Under the umbrella of Artificial Intelligence (AI), machine learning is a sub-field of algorithms that can learn on their own, including Decision Trees, Linear Regression, K-means clustering, Neural Networks, and so on. Deep Neural Networks, in particular, are super-powered Neural Networks that contain several hidden layers. With the right configuration/hyper-parameters, deep learning can achieve impressively accurate results compared to shallow Neural Networks with the same computational power.

Why is Deep Learning such a hot topic in the Data Science community?

Simply put, across many domains, deep learning can attain much faster and more accurate results than ever before, such as image classification, object recognition, sequence modeling, speech recognition, as so on. It all started recently, too; around 2015. There were three key catalysts that came together resulting in the popularity of deep learning:

  1. Big Data: the presence of extremely large and complex datasets;
  2. GPUs: the low cost and wide availability of GPUs made the parallel processing faster and cheaper than ever;
  3. Advances in deep learning algorithms, especially for complex pattern recognition.

These three factors resulted in the deep learning boom that we see today. Self-driving cars and drones, chat bots, translations, AI playing games. You can now see a tremendous surge in the demand for data scientists and cognitive developers. Big companies are recognizing this evolution in data-driven insights, which is why you now see IBM, Google, Apple, Tesla, and Microsoft investing a lot of money in deep learning.

What are the applications of Deep Learning?

Historically, the goal of machine learning was to move humanity towards the singularity of “General Artificial Intelligence”. But not surprisingly, this goal has been tremendously difficult to attain. So instead of trying to develop generalized AI, scientists started to develop a series of models and algorithms that excelled in specific tasks.

So, to realize the main applications of Deep Learning, it is better to briefly take a look at each of the different types of Deep Neural Networks, their main applications, and how they work.

What are the different types of Deep Neural Networks?

Convolutional Neural Networks (CNNs)

Assume that you have a dataset of images of cats and dogs, and you want to build the model that can recognize and differentiate them. Traditionally, your first step would be “feature selection”. That is, to choose the best features from your images, and then use those features in a classification algorithm (e.g., Logistic Regression or Decision Tree), resulting in a model that could predict “cat” or “dog” given an image. These chosen features could simply be the color, object edges, pixel location, or countless other features that could be extracted from the images.

Of course, the better and effective the feature sets you found, the more accurate and efficient image classification you could obtain. In fact, in the last two decades, there has been a lot of scientific research in image processing just about how one can find the best feature sets from images for the purposes of classification. However, as you can imagine, the process of selecting and using the best features is a tremendously time-consuming task and is often ineffective. Further, extending the features to other types of images becomes an even greater problem – the features you used to discriminate cats and dogs cannot be generalized, for example, for recognizing hand-written digits. Therefore, the importance of feature selection can’t be overstated.

Enter convolutional neural networks (CNNs). Suddenly, without having to find or select features, CNNs finds the best features for you automatically and effectively. So instead of you choosing what image features to classify dogs vs. cats, CNNs can automatically find those features and classify the images for you.

Convolutional Neural Network (Wikipedia)

What are the CNN applications?

CNNs have gained a lot of attention in the machine learning community over the last few years. This is due to the wide range of applications where CNNs excel, especially machine vision projects: image recognition/classifications, object detection/recognition in images, digit recognition, coloring black and white images, translation of text on the images, and creating art images,

Lets look closer to a simple problem to see how CNNs work. Consider the digit recognition problem. We would like to classify images of handwritten numbers, where the target will be the digit (0,1,2,3,4,5,6,7,8,9) and the observations are the intensity and relative position of pixels. After some training, it’s possible to generate a “function” that map inputs (the digit image) to desired outputs (the type of digit). The only problem is how well this map operation occurs. While trying to generate this “function”, the training process continues until the model achieves a desired level of accuracy on the training data. You can learn more about this problem and the solution for it through our convolution network with hands-on notebooks.

How does it work?

Convolutional neural networks (CNNs) is a type of feed-forward neural network, consist of multiple layers of  neurons that have learnable weights and biases. Each neuron in a layer that receives some input, process it, and optionally follows it with a non-linearity. The network has multiple layers such as convolution, max pool, drop out and fully connected layers. In each layer, small neurons process portions of the input image. The outputs of these collections are then tiled so that their input regions overlap, to obtain a higher-resolution representation of the original image; and it is repeated for every such layer. The important point here is: CNNs are able to break the complex patterns down into a series of simpler patterns, through multiple layers.

Recurrent Neural Network (RNN)

Recurrent Neural Network tries to solve the problem of modeling the temporal data. You feed the network with the sequential data, it maintains the context of data and learns the patterns in the temporal data.

What are the applications of RNN?

Yes, you can use it to model time-series data such as weather data, stocks, or sequential data such as genes. But you can also do other projects, for example, for text processing tasks like sentiment analysis and parsing. More generally, for any language model that operates at word or character level. Here are some interesting projects done by RNNs: speech recognition, adding sounds to silent movies, Translation of Text, chat bot, hand writing generation, language modeling (automatic text generation), and Image Captioning.

How does it work?

The Recurrent Neural Network is a specialized type of Neural Network that solves the issue of maintaining context for sequential data. RNNs are models with a simple structure and a feedback mechanism built-in. The output of a layer is added to the next input and fed back to the same layer. At each iterative step, the processing unit takes in an input and the current state of the network and produces an output and a new state that is re-fed into the network.

However, this model has some problems. It’s very computationally expensive to maintain the state for large amounts of units, even more so over a long amount of time. Additionally, Recurrent Networks are very sensitive to changes in their parameters. To solve these problems, a way to keep information over long periods of time and additionally solve the oversensitivity to parameter changes, i.e., make backpropagating through the Recurrent Networks more viable was found. What is it? Long-Short Term Memory (LSTM).

LSTM is an abstraction of how computer memory works: you have a linear unit, which is the information cell itself, surrounded by three logistic gates responsible for maintaining the data. One gate is for inputting data into the information cell, one is for outputting data from the input cell, and the last one is to keep or forget data depending on the needs of the network.

If you want to practice the basic of RNN/LSTM with TensorFlow or language modeling, you can practice it here.

Restricted Boltzmann Machine (RBM)

RBMs are used to find the patterns in data in an unsupervised fashion. They are shallow neural nets that learn to reconstruct data by themselves. They are very important models, because they can automatically extract meaningful features from a given input, without the need to label them. RBMs might not be outstanding if you look at them as independent networks, but they are significant as building blocks of other networks, such as Deep Believe Networks.

What are the applications of RBM?

RBM is useful for unsupervised tasks such as feature extraction/learning, dimensionality reduction, pattern recognition, recommender systems (Collaborative Filtering), classification, regression, and topic modeling.

To understand the theory of RBM and application of RBM in Recommender Systems you can run these notebooks.

How does it work?

It only possesses two layers: a visible input layer and a hidden layer where the features are learned. Simply put, RBM takes the inputs and translates them into a set of numbers that represents them. Then, these numbers can be translated back to reconstruct the inputs. Through several forward and backward passes, the RBM will be trained. Now we have a trained RBM model that can reveal two things: first, what is the interrelationship among the input features; second, which features are the most important ones when detecting patterns.

Deep Belief Networks (DBN)

Deep Belief Network is an advanced Multi-Layer Perceptron (MLP). It was invented to solve an old problem in traditional artificial neural networks. Which problem? The backpropagation in traditional Neural Networks can often lead to “local minima” or “vanishing gradients”. This is when your “error surface” contains multiple grooves and you fall into a groove that is not the lowest possible groove as you perform gradient descent.

What are the applications of DBN?

DBN is generally used for classification (same as traditional MLPs). One the most important applications of DBN is image recognition. The important part here is that DBN is a very accurate discriminative classifier and we don’t need a big set of labeled data to train DBN; a small set works fine because feature extraction is unsupervised by a stack of RBMs.

How does it work?

DBN is similar to MLP in term of architecture, but different in training approach. DBNs can be divided into two major parts. The first one is stacks of RBMs to pre-train our network. The second one is a feed-forward backpropagation network, that will further refine the results from the RBM stack. In the training process, each RBM learns the entire input. Then, the stacked RBMs, can detect inherent patterns in inputs.DBN solves the “vanishing problem” by using this extra step, so-called

DBN solves the “vanishing problem” by using this extra step, so-called pre-training. Pre-training is done before backpropagation and can lead to an error rate not far from optimal. This puts us in the “neighborhood” of the final solution. Then we use backpropagation to slowly reduce the error rate from there.


An autoencoder is an artificial neural network employed to recreate a given input. It takes a set of unlabeled inputs, encodes them and then tries to extract the most valuable information from them. They are used for feature extraction, learning generative models of data, dimensionality reduction and can be used for compression. They are very similar to RBMs but can have more than 2 layers.

What are the applications of Autoencoders?

Autoencoders are employed in some of the largest deep learning applications, especially for unsupervised tasks. For example, for Feature Extraction, Pattern recognition, and Dimensionality Reduction. In another example, say that you want to extract what feeling the person in a photography is feeling, Nikhil Buduma explains the utility of this type of Neural Network with excellence.

How does it work?

RBM is an example of Autoencoders, but with fewer layers. An autoencoder can be divided into two parts: the encoder and the decoder.

Let’s say that we want to classify some facial images and each image is very high dimensionally (e.g 50×40). The encoder needs to compress the representation of the input. In this case we are going to compress the face of our person, that consists of 2000 dimensional data to only 30 dimensions, taking some steps between this compression. The decoder is a reflection of the encoder network. It works to recreate the input, as closely as possible. It has an important role during training, to force the autoencoder to select the most important features in the compressed representation. After training, you can use 30 dimensions to apply your algorithms.

Why TensorFlow? How does it work?

TensorFlow is also just a library but an excellent one. I believe that TensorFlow’s capability to execute the code on different devices, such as CPUs and GPUs, is its superpower. This is a consequence of its specific structure. TensorFlow defines computations as graphs and these are made with operations (also know as “ops”). So, when we work with TensorFlow, it is the same as defining a series of operations in a Graph.

To execute these operations as computations, we must launch the Graph into a Session. The session translates and passes the operations represented in the graphs to the device you want to execute them on, be it a GPU or CPU.

For example, the image below represents a graph in TensorFlow. Wx, and b are tensors over the edges of this graph. MatMul is an operation over the tensors W and x, after that Add is called and add the result of the previous operator with b. The resultant tensors of each operation cross the next one until the end, where it’s possible to get the wanted result.

TensorFlow is really an extremely versatile library that was originally created for tasks that require heavy numerical computations. For this reason, TensorFlow is a great library for the problem of machine learning and deep neural networks.

Where should I start learning?

Again, as I mentioned first, it does not matter where to start, but I strongly suggest that you learn TensorFlow and Deep Learning together. Deep Learning with TensorFlow is a course that we created to put them together. Check it out and please let us know what you think of it.

Good luck on your journey into one of the most exciting technologies to surface in our field over the past few years.

The post Learn TensorFlow and Deep Learning Together and Now! appeared first on BDU.

Curt Monash

Cloudera’s Data Science Workbench

0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is: One way to do data science is to repeatedly jump through the hoops of working...


March 18, 2017

Simplified Analytics

Numerous reasons why Digital Transformation fails

Many organizations today have realized that digital transformation is essential to their success. But many of them forget that focus of a digital transformation is not digitization or even...


March 17, 2017

Revolution Analytics

Because it's Friday: Why Memes Survive

Ever wonder why some memes (like jokes, funny pet videos, political rants) successfully propagate on the internet (or in extreme cases, "go viral"), while others just fizzle out? Richard Dawkins, who...


Revolution Analytics

Data Science at StitchFix

If you want to see a great example of how data science can inform every stage of a business process, from product concept to operations, look no further than Stitch Fix's Algorithms Tour. Scroll down...



Aric Boschee: Maximizing Deep Web Technologies as a Senior Software Engineer

Growing up in Woonsocket, SD, Aric Boschee was very involved. He excelled at math and computer science, while playing a variety of sports, including playing collegiate baseball and football. His current hobbies include golfing, bowling, fishing, and pheasant hunting. After graduating from Dakota State University in 2004, Aric was on the job hunt for developer positions. […] The post Aric Boschee: Maximizing Deep Web Technologies as a Senior Software Engineer appeared first on...

Read more »

Revolution Analytics

Book Review: Testing R Code

When it comes to getting things right in data science, most of the focus goes to the data and the statistical methodology used. But when a misplaced parenthesis can throw off your results entirely,...


March 16, 2017

Silicon Valley Data Science

The Data Platform Puzzle

Editor’s Note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. We’re just finishing up at Strata + Hadoop World in San Jose this week, where we discussed data platforms and more.

Building or rebuilding a data platform can be a daunting task, as most questions that need to be asked have open-ended answers. Different people may answer each question a different way, including the dreaded response of “well, it depends.” Truthfully, it does depend. But that doesn’t mean you have to guess and use your gut. At SVDS, we believe that you can safely navigate the waters of building a data platform by knowing beforehand what your priorities are, and what specific outcomes you desire.

Examining Your Data

You must understand your data before you can build anything, but many organizations underestimate the importance of this. Failing to understand your data can mean the difference between choosing to embark on a project that you can accomplish, and taking on a project that is in no way tractable. Take these steps to make sure you’re fully informed.

1. Ask yourself if you have the right data.
If you find that you don’t, then start identifying potential sources of the data you need. You may already have them within your enterprise. For example, what if you want to find out how long customers are spending on different regions of various pages of your website, but all you have are HTTP access logs? Access logs contain only some of the information you need, not all it. You would probably have to add client-side instrumentation to collect the rest of the data.

2. Contemplate how your data will contribute to solving your business problems. For example, will you use it to generate monthly reports that help you decide where to direct quarterly resources? Or, will you use it to construct a dashboard to help direct your responses to social media? The differentiation here is between real-time and batch-oriented data platforms. Because the outcomes are different, each is built differently.

3. Think about how important your data is to the strategic interests of your company. Do you want to use it to drive decision making, or are you happy to let it confirm decisions that have already been made? Could you use it to construct a feedback loop that would let you observe and decide? If data is core to your strategic vision, you probably want to think about ways of making it secure. This is difficult in the currently evolving landscape of big data technologies, as many open source technologies are not secure out of the box.

Understanding your data is a key component to having a comprehensive data strategy. This strategy will drive what you do with your data. Further, you must recognize that time and resource constraints will limit the number of things you can do with your data in a reasonable amount of time. Focus on those projects that are most important or will bring you the most value.

Design With Specific Outcomes in Mind

It would be easy to construct a list of features you would want in an ideal data platform, and then then set out to build or buy it. Many enterprises are guilty of this practice. Unfortunately, this “checklist of features” mentality gives way to bloated applications with features that are used infrequently. Additionally, complicated platforms are expensive to maintain in terms of training and support maintenance.

Focus instead on assembling a platform that will deliver the outcomes you desire. It does not matter if you are building or buying: well-defined business needs should drive your technical decisions. I have outlined a few of these needs below. Note that this is not an inclusive list.

Search Performance

Big data newcomers sometimes have difficulty understanding the distinction between a traditional data store (NoSQL or traditional RDBMS) and a searchable index, thinking they are interchangeable. This mistake becomes obvious when you try conducting full-text searches on a key-value store or relational database. It doesn’t work because every database record must be scanned. Although you can largely get away with the opposite — using a full-text index as a non-relational database — you want to make sure to use the right tool for the job.

If search performance is something you desire, be sure to include a full text index in your platform. Obvious choices these days are Apache Solr and Elasticsearch, both of which can be distributed across servers.

Data Lifecycle

Data can take on a life of its own. During different phases it is written, read, transformed, aggregated, and maybe eventually deleted. Data that is rarely read can be stored cheaply using “cold storage” systems. The trade-off here usually plays out in terms of very slow read performance. Because of this, you should make sure that the data is archived in a format that requires only simple transforms or no transforms at all.

Some data is always “hot,” meaning it needs to stay in memory for fast access. A few storage systems excel at this, but will obviously require significantly more memory for hardware they run on. On the other hand, if you plan to do a lot of data transformation (e.g.: computing percentiles and other statistics), you will need to make sure that machines running the transformations have enough CPU power.

Many organizations fail to see far enough into the future to plan for the eventual demise of their data. Fearful of throwing any of it away, every last byte is kept and contributes to clogging up data pipelines. It may be wise to consider moving older data into cold storage so that the processing pipeline (which operates on more current data) runs unencumbered and with good performance.

Integration and Exposing Data

Oftentimes, big data platforms exist to feed data into other systems (e.g., reporting, EDW). You want to make sure that your data platform can be easily integrated into these external systems. Data can be exposed through the file system layer (networked drive or HDFS), or via an API. If your data is exposed through an API you want to make sure ahead of time that the client software you will be using to access it is high quality and well supported for your particular language or development environment.

Since APIs are exposed over network interfaces, security must be taken into consideration. Please make sure your APIs are sufficiently locked down.

Final Considerations

With whatever platform you end up with, it is important that you retain the ability to iterate and experiment as you construct data pipelines and applications. Any modern platform will be expected to stay online for the next 5+ years, but with a scale and scope more expansive than the previous generation of data systems. Your platform should support both investigative work and your production workloads as the needs of your enterprise change. We’re interested in how the community is tackling this problem — please share your experiences in the comments.

The post The Data Platform Puzzle appeared first on Silicon Valley Data Science.


How Marketers are using Machine Learning to cross-sell and up-sell

McDonalds mastered the upsell with one simple question at the time of purchase: “You want fries with that?” A simple and relevant question at the right time that has likely generated millions of extra dollars in revenue through the years for the company. Ever since then, companies have tried to emulate their success by identifying complementary products in their offering and training sales staff to ask customers the right question at the right time.

Teradata ANZ

Here’s some data. Now amaze me, data scientist!

— Or why discovering insight is often inevitable.

Sure, give me your data, and there is a good chance I can “wow you”. Why am I so confident? Because I am an astronomer by trade, and I believe that the path to discovery in astronomy, and discovery in data science, share some fundamental underlying principles.

Major discoveries in astronomy (and many other branches of science) often occur when a previously unexplored area of observational parameter space is opened up by new instruments, or new ways of analysing data.

What do I mean by “observational parameter space”?

Let me give you an example: When Galileo first turned a telescope to the sky, he was seeing the universe in a way it never previously been seen before, and in doing so, he made arguably some of the most amazing and important discoveries in the history of humankind.


This is what it means to open new areas of observational parameter space – to move beyond the current limitations of data quality, precision, or type of information that is available. That is, to gain visibility to things that were previously invisible. The power of bringing new data, improved data or new analysis to bear for the purpose of discovery is manifest in the history of astronomy.

There are countless stories, too numerous to list here, of major unexpected discoveries that came about simply through recording new types of data (for example the discovery of Gamma ray bursts), analysing data in new ways (for example the discovery of pulsars), or by combining different types of data for the first time (for example the discovery of quasars).

In the world of data science, the situation is no different. When an organisation or government department records new types of data, or enables the combination of different types of data, or significantly improves the accuracy and reliability of existing data, there is a very high probability that new insights will be uncovered. This is simply the result of gaining visibility to things that were previously invisible. Astronomers are so confident in this path to discovery, that it often drives the design and construction of new telescopes.

However, data is a necessary but not sufficient condition for insight discovery: there are other key ingredients.

Discovery happens when data meets the prepared mind: there is no magic algorithm that will sift through the data and provide all the useful insights on a plate. Ultimately, there is no substitute for a deep knowledge of the business problems, and deep knowledge of the data.

An excellent example of precisely this point in the annals of astronomy is the Nobel Prize winning discovery of the cosmic microwave background – the fossil light left over from the big bang. The two researchers who discovered this fossil light thought it was just an annoying source of noise that was hindering their research, and they tried their hardest to avoid seeing it. It was a nearby research group that was able to interpret the “annoying noise in the data” as the smoking gun of the big bang, which ultimately led to the Nobel Prize winning discovery.

The moral of the story: developing intuition, understanding the business, and understanding the data, are of utmost importance. Without it, you may miss that “Nobel Prize winning” insight, no matter how ground-breaking your data!

The post Here’s some data. Now amaze me, data scientist! appeared first on International Blog.


March 15, 2017

Revolution Analytics

Neural Networks: How they work, and how to train them in R

With the current focus on deep learning, neural networks are all the rage again. (Neural networks have been described for more than 60 years, but it wasn't until the the power of modern computing...


Revolution Analytics

Benchmarking rxNeuralNet for OCR

The MicrosoftML package introduced with Microsoft R Server 9.0 added several new functions for high-performance machine learning, including rxNeuralNet. Tomaz Kastrun recently applied rxNeuralNet to...

Silicon Valley Data Science

Models: From the Lab to the Factory

Editor’s note: Mauricio (as well as other members of SVDS) will be speaking at TDWI Accelerate in Boston. Find more information, and sign up to receive our slides here.

In our industry, much focus is placed on developing analytical models to answer key business questions and predict customer behavior. However, what happens when data scientists are done developing their model and need to deploy it so that it can be used by the larger organization?

Deploying a model without a rigorous process in place has consequences—take a look at the following example in financial services.

With its high-frequency trading algorithms Knight was the largest trader in U.S. equities, with a market share of 17.3% on NYSE and 16.9% on NASDAQ. Due to a computer trading “glitch” in 2012, it took a $440M loss in less than an hour. The company was acquired by the end of the year. This illustrates the perils of deploying models to production that are not properly tested and the impact of the bugs that could sneak through.

In this post, we’ll go over techniques to avoid these scenarios through the process of model management and deployment. Here are some of the questions we have to tackle when we want to deploy models to production:

  • How do the model results get to the hands of the decision makers or applications that benefit from this analysis?
  • Can the model run automatically without issues and how does it recover from failure?
  • What happens if the model becomes stale because it was trained on historical data that is no longer relevant?
  • How do you deploy and manage new versions of that model without breaking downstream consumers?

It helps to see data science development and deployment as two distinct processes that are part of a larger model life cycle workflow. The example diagram below illustrates what this process looks like.

Data science development and deployment image

  1. We have end-users interacting with an application, creating data that gets stored in the app’s online production data repository.
  2. This data is later fed to an offline historical data repository (like Hadoop or S3) so that it can be analyzed by data scientists to understand how users are interacting with the app. It can also be used, for example, to build a model to cluster the users into segments based on their behavior in the application so we can market to them with this information.
  3. Once a model has been developed, we’ll want to register it in a model registry to allow a governance process to take place where a model is reviewed and approved for production use, and requirements can be assessed for deployment.
  4. Once the model has been approved for production use, we need to deploy it. To do this, we need to understand how the model is consumed in the organization and make changes to support this, ensure the model can run end to end automatically within specified performance constraints, and that there are tests in place to ensure that the model deployed is the same as the model developed. Once these steps are done, the model is reviewed and approved again prior to going live.
  5. Finally, once the model is deployed, the predictions from this model are served to the application where metrics on the predictions can be collected based on user interaction. This information can serve to improve the model or to ask a new business question which brings us back to (2).

In order to make the life cycle successful, it is important to understand that data science development and deployment have different requirements that need to be satisfied. This is why you need a lab and a factory.

The lab

The data lab is an environment for exploration for data scientists, divorced from application’s production concerns. We may have an eventual end goal of being able to use data to drive decision making within the organization, but before we can get there, we need to understand which hypotheses make sense for our organization and prove out their value. Thus we are mainly focused on creating an environment—“the lab”—where data scientists can ask questions, build models, and experiment with data.

This process is largely iterative, as shown in the diagram below based on the CRISP-DM model.

Data mining process

We will not go into too much detail in this post, but we do have a tutorial that goes in depth on this topic. If you’d like to download the slides for that tutorial, you can do so on our Enterprise Data World 2017 page.

What concerns us here is that we need a lab to enable exploration and development of models, but we also need a factory when we need to deploy that model to apply it to live data automatically, deliver results to the appropriate consumers within defined constraints, and monitor the process in case of failure or anomalies.

The factory

In the factory we want to optimize value creation and reduce cost, valuing stability and structure over flexibility to ensure results are delivered to the right consumers within defined constraints and failures are monitored and managed. We need to provide a structure to the model so that we can have expectations on its behavior in production.

To understand the factory, we’ll look at how models can be managed via the model registry and what to consider when undergoing deployment.

Model registry

To provide a structure to the model, we define it based on its components—data dependencies, scripts, configurations, and documentation. Additionally, we capture metadata on the model and its versions to provide additional business context and model-specific information. By providing a structure to the model, we can then keep inventory of our models in the model registry, including different model versions and associated results which are fed by the execution process. The diagram below illustrates this concept.

Model structure diagram

From the registry, we can:

  • Understand and control which version of a model is in production.
  • Review the release notes that explain what has changed in a particular version.
  • Review the assets and documentation associated with the model, useful when needing to create a new version of an existing model or perform maintenance.
  • Monitor performance metrics on model execution and what processes are consuming it. This information is provided by the model execution process, which sends metrics back to the registry.

You can also choose to include a Jupyter Notebook with a model version. This allows a reviewer or developer to walk through the thought process and assumptions made by the original developer of the model version. This helps support model maintenance and discovery for an organization.

Here is a matrix decomposing the different elements of a model from our work in the field:

Versioning Dimensions Table

The registry needs to capture the associations between data, scripts, trained model objects, and documentation for particular versions of a model illustrated in the figure.

What does this give us in practice?

  • By collecting the required assets and metadata on how to run a model, we can drive an execution workflow, which will apply a particular version of that model to live data to generate predictions to the end user. If it’s part of a batch process, we can use this information to create transient execution environments for the model to pull the data, pull the scripts, run the model, store the results in object storage, and spin down the environment when the process is complete, maximizing resources and minimizing cost.
  • From a governance perspective, we can support business workflows that decide when models get pushed to production and allow for ongoing model monitoring to, for example, make decisions as to whether the model should be retrained if we’ve identified that the predictions are no longer in line with the actuals. If you have auditing requirements, you may need to explain how you produced a particular result for a customer. To do this, you would need to be able to track the specific version of the model that was run at a given time and what data was used for it to be able to reproduce the result.
  • If we detect a bug in an existing model, we can mark that version as “should not be used” and publish a new version of the model with the fix. By notifying all of the consumers of the buggy model, they can transition to using the fixed version of the model.

In the absence of these steps, we run the risk that model maintenance becomes a challenging process of trying to understand the intentions of the original developer, models deployed to production which no longer match those in development producing incorrect results, and disrupting downstream consumers when an existing model is updated.

Model deployment

Once a model has been approved for deployment, we need to go through steps to ensure the model can be deployed. There should be tests in place to verify correctness, the pipeline of extracting raw data, feature generation, and model scoring should be analyzed to make sure that model execution can run automatically, will expose results in the way needed by consumers, and meets the performance requirements defined by the business. Also ensure that model execution is monitored in case of errors or if a model has gone stale and is no longer producing accurate results.

Prior to deployment, we want to ensure that we test the following:

  • Whether the model deployed matches the expectations of the original model developer. By using a test input set identified during development which produces validated results, we can verify that the code being deployed is matching what was expected during development. We illustrated the need for this in a previous blog post.
  • Whether the model being deployed is robust to a variety of different inputs, testing extremes of those outputs and potentially missing values due to data quality issues. The model should have controls in place to prevent these inputs from crashing the model and, in effect, affecting its downstream consumers.

We minimize the risk that the model deployed matches the model developed by 1) running tests prior to deploying the model to production, and 2) capturing the environment specifics such as specific language versions and library dependencies for the model (e.g. a Python requirements.txt file).

Once deployed in production, we want to expose the predictions of the model to consumers. How many users will be consuming this model prediction? How quickly must the feature data be available when scoring the model? For example, in the case of fraud detection, if features are generated every 24 hours, there may be that much lag between when the event happened and when the fraud detection model detected the event. These are some of the scalability and performance questions that need to be answered.

In the case of an application, ideally, we want to expose the results of the model via a web service, either via real-time scoring of the model or by exposing scores that were produced offline via a batch process. Alternatively, the model may need to support a business process and we need to place the results of the model in a location where a report can be created for decision makers to act on these results. In either case, without a model registry, it can be challenging to understand where to find and consume the results of a current model running in production.

Another use case is wanting to understand how the model is performing against live data in order to see whether the model has gone stale or whether a newly developed model outperforms the old one. An easy example of this is a regression model where we can compare the predicted vs actual values. If we do not monitor the results of a model over time, we may be making decisions based on historical data that is no longer applicable to the current situation.


In this post, we walked through the model life cycle, and discussed the needs of the lab and the factory, with the intent to reduce the risk of deploying “bad” models that could impact business decisions (and potentially incur a large cost). In addition, the registry provides transparency and discoverability of models in the organization. This facilitates new model development by exposing existing techniques used in the organization to solve similar problems and facilitates existing model maintenance or enhancements by making it clear which model version is currently in production, what are its associated assets, and a process to publish a new model version to consume.

Here at SVDS, we are developing an architecture that supports onboarding of models and their versions into a registry, and manages how those model versions are deployed to an execution engine. If you’d like to hear more about this work, please reach out.

sign up for our newsletter to stay in touch

The post Models: From the Lab to the Factory appeared first on Silicon Valley Data Science.

Ronald van Loon

The GDPR: 5 Questions Data-Driven Companies Should Ask to Manage Risks and Reputation

Data is rapidly becoming the lifeblood of the global economy. In the world of Big Data and artificial intelligence, data represents a new type of economic asset that can offer companies a decisive competitive advantage, as well as damage the reputation and bottom-line of those that remain unsuccessful at ensuring the security and confidentiality of critical corporate and customer data.

Despite the severe repercussions of compromised data security, until recently, the fines for breach of data protection regulations were limited and enforcement actions infrequent. However, the introduction of a potentially revolutionary European General Data Protection Regulation (GDPR) is likely to transform the way data-driven companies handle customer data by exposing them to the risk of hefty fines and severe penalties in the event of incompliance and data breach.

In this article, I have tried to summarise the implications of GDPR implementation for data-driven companies, as well as the measures businesses can take to ensure the security and privacy of client’s data and avoid the penalties associated with non-compliance.

How Does GDPR Impact Data-Driven Organisations? 

The General Data Protection Regulation (GDPR) stands out from all existing regulations because of its breadth of client data protection. From conditions on cross-border data transfer to the need to implement, review, and update adequate technical and organisational measures to protect customer data, the GDPR introduces several new legislative requirements that will significantly impact the way businesses collect, manage, protect, and share both structured and unstructured data (waarom benoem je beide, SEO?). I have described a few of the most important ones below.

  • Valid and Verifiable Consents — It can be argued that the GDPR is all about consent, it protects European citizens by giving them the means to object or give permission to process their personal data. The GDPR sets out stringent new requirements for obtaining a consent for the processing of personal data from customers. According to the new legislation, companies should make the process of withdrawing a consent as easy as providing a consent. Furthermore, the consent should be explicit and well informed with full transparency on the intended purpose and use.
  • Data Protection by Design and Default — Up until now, businesses were required to take technical and organisational measures to protect personal data. But implementation of the GDPR will require companies to demonstrate that the data protection measures are continuously reviewed and updated.
  • Data Protection Impact Assessment (DPIA) — DPIAs are used by organisations to identify, understand, and mitigate any risks that might arise when developing new solutions or undertaking new activities that involve the processing of customer data, such as data analytics and all data-driven applications, including BI, data warehouses, data lakes, and marketing applications. GDPR makes it a mandatory requirement for all organisations to conduct a DPIA and consult with a Data Protection supervisory authority if the assessment shows an inherent risk.

What are the Possible Consequences of Non-Compliance?

The GDPR subjects data controllers and processors that fail to comply with its requirements to severe consequences. These consequences, contrary to what most people believe, are not just limited to monetary penalties. Instead, they can potentially damage a business’s reputation and bottom-line. There are three factors that together make the GDPR the most stringent regulation in the European data protection regime.

  • Reputational Risk — The reputational risks of any data breach is always severe. However, implementation of the GDPR with obligation to notify authorities in case of data breaches is likely to result in increased enforcement activity. This will consequently bring data protection breaches to light, compromising a company’s market position and reputation.
  • Geographic Risk — All organisations offering goods or services to EU markets or monitoring the behaviour of EU citizens are subject to the GDPR. This includes all data analytics companies as well.
  • Huge Fines — Failure to comply with the new regulations will lead to significant fines of up to 20 million EUR or 4 percent of the company’s global turnover, whichever is higher.

To avoid the huge fines and severe penalties, businesses need to have complete and mature data governance in place. From revising the existing contracts in place to getting a buy in from the key people in organisations, businesses will be required to review their entire data process management approach in order to become compliant and mitigate reputational and financial risks.

5 Questions to Address and Mitigate the Risk of Non-Compliance

1. How can I minimise risks and protect my business’s reputation?

Taking the following measures can help you ensure your compliance to the new data protection legislation.

Define Personal Client Data — Document what types of personal data your company processes, where it came from, and who you share it with to improve documentation. For example, if you have inaccurate personal data and you have shared with it another organisation, you won’t be able to identify the inaccuracy and report it to your business partner unless you know what personal data you hold. Therefore, begin with a thorough review of your existing database.

Manage Data Streams and Processes — Develop a roadmap to determine your sources for data input, data processing tools, techniques, and methodologies that you use, and how the data you hold is shared with other businesses. Once you have listed all the inputs and outputs, evaluate their compliance to the new regulations, and take adequate measures to ensure good data governance.

Designate a Data Protection Officer — Designate a Data Protection Officer who has the knowledge, support, and authority to assess and mitigate non-compliance risks.

Ensure Swift Response to Withdrawal Requests — Respond to the customers’ requests of consent withdrawal in an efficient manner and update the system to flag that the user has withdrawn consent to prevent further direct marketing.

2. How can my business protect personal data? 

The new data protection regulations apply to data that allow direct or indirect identification of an individual by anyone. As a result, cookie IDs, online identifiers, device identifiers, and IP addresses are categorised as personal data under the GDPR. To ensure the security and confidentially of the new defined categories of personal data, businesses can use the following measures:

Adopt a Protection by Design Approach — There are certain ‘protection by design’ techniques that businesses can use to protect the personal data of their customers. These include:

  • Pseudonymisation — Pseudonymisation (such as encryption, tokenisation, hashing) is a technique that involves categorisation of the personal data of customers into two types in such a manner that one type can no longer be attributed to an individual unless accompanied by the second type of information which is kept separately and is subject to various data protection measures.
  • Data Minimisation — As the name implies, data minimisation is about ensuring that only the data that’s necessary for a specific purpose is processed, used, or stored.

3. How can my company implement technical infrastructure that will ensure optimal governance of client data? 

GDPR not only requires businesses to implement a well-built and foolproof infrastructure to collect, store, and process data, but also directs them to continuously review and update the infrastructure. Here are a few ways businesses can ensure their compliance to these new legislations.

Align Data & Analytics Strategy with Policies — Businesses should focus on developing a data and analytics infrastructure that’s CONTROLLED, PORTABLE, and COMPLIANT. To ensure this, data collection should be purpose driven, i.e. only data that is required to fulfill a specific requirement or purpose should be collected and processed. Data collection should be compliant. Customers should be provided with a right to object to data collection and processing for direct marketing processed. Data collected with the consent of clients should be kept in self-controlled storage and processed according to all applicable data protection regulations.


Manage Data Lineage — Certain data governance solutions organised by leading tech companies can help businesses streamline their data handling processes and exercise greater control and get improved visibility throughout data lifecycle. They help businesses adopt a standardised approach to discovering their IT assets and define a common business language to ensure optimal policy and metadata management, create a searchable catalogue of information assets, and develop a point of access and control for data stewardship tasks.

4. How can my business uphold these new regulations and define client data collection and storage?

To enhance the compliance of their client data collection and storage processes, businesses should seek assurance from a data protection officer who can inform and advice the business about its obligations pursuant to the regulation, monitor the implementation and application of adequate data protection policies, and ensure optimal training of staff involved in data collection and processing operations. In addition to this, designating a data protection officer can also help businesses monitor their incoming data streams and how they should be treated.

5. How can my business handle different types of data streams?

To ensure their compliance to the GDPR and avoid the severe consequences of non-compliance, businesses are not only required to ensure optimal control and privacy of static batch data, but also develop means to collect, categorise, and process data provided by high-speed data streams. Data stream management software is a viable solution to this challenge. A data stream manager allows businesses to:

  • Collect and distribute data in a private and compliant way
  • Reduce costs and complexity in data life cycle management
  • Have real-time access to all structured and unstructured data via the cloud or on premise
  • Centralise all data sources for improved visibility and control
  • Develop a controlled environment for data-driven operations

With a data stream manager, Data Protection Officers can define privacy levels, manage user rights, get an insight into how their info is being collected or used, and more.

Manage Data Streams by Data Protection Officers 

Many of the GDPR’s principles are much the same as the current data protection regulations. Therefore, if your business is operating in compliance to the current law, you can use your current approach to data protection as a starting point to build a new, more robust and secure GDPR-compliant data protection infrastructure.

To learn more about GDPR compliance, subscribe to the educational webinar, hosted by BrightTalk and presented by Ronald van Loon.



Janus de Visser  

Janus is Data Privacy Officer and Data Governance Consultant at Adversitement. Feel free to connect with Janus on LinkedIn to learn more about GDPR, Data Govenance, Risk & Reputation.



Ronald van Loon

If you would like to read Ronald van Loon future posts then please click ‘Follow‘ and feel free to also connect on LinkedIn and Twitter to learn more about the possibilities of IoT and Big Data



Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post The GDPR: 5 Questions Data-Driven Companies Should Ask to Manage Risks and Reputation appeared first on Ronald van Loons.

Revolution Analytics

AUC Meets the Wilcoxon-Mann-Whitney U-Statistic

by Bob Horton, Senior Data Scientist, Microsoft The area under an ROC curve (AUC) is commonly used in machine learning to summarize the performance of a predictive model with a single value. But you...


March 14, 2017

Big Data University

This Week in Data Science (March 14, 2017)

Here’s this week’s news in Data Science and Big Data. IBMresearch

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

Featured Courses From BDU

  • Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
  • Predictive Modeling Fundamentals I
    – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data.
  • Using R with Databases
    – Learn how to unleash the power of R when working with relational databases in our newest free course.
  • Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to apply deep learning to different data types in order to solve real world problems.

Cool Data Science Videos

The post This Week in Data Science (March 14, 2017) appeared first on BDU.

Silicon Valley Data Science

Building Tech Communities

In a previous post, I shared the first part of my interview with Travis Oliphant, co-founder of Continuum Analytics, where I asked him what excites him about the future of tech and we discussed how much of that future comes down to people and culture. In this installment, Travis talks about how to balance enterprise and open source, as well as what it takes to build a community.

Can you talk about any challenges you’ve faced while balancing the enterprise and open source parts of your business?

They work beautifully together as long as you understand their roles. Yes, it can be difficult— mostly education-wise—there is misinformation and misunderstanding in multiple places. There are some people who are kind of very steeped in the wool of “all commercial enterprise is bad and nobody should pay anything for anything.” There are very extreme views like that. And so they take any effort to sell anything as an affront. Fortunately, it is a very small percentage of people who have that extreme view. In the Python community it is even smaller. The Python and R communities have largely built some antibodies against that kind of extreme reaction. These communities are much more enterprise-friendly as long as there is a lot of giving back to the community.

There is a fundamental gap of understanding in the minds of some. People who are just in their niche don’t realize that there are a bunch of people in the world with problems that they need solved. The market is this amazing concept where people can voluntarily pay for solutions to their problems. Do you want people to solve real problems and provide jobs for others? Then you want a market to exist. There can be a real misunderstanding of markets and how important the markets are to progress and peace.

A big reason for this blind-spot in the software world is the anemic nature of markets where you have deep proprietary software stacks where network effects have created significant lock-in. The creates the anti-pattern of siloization and feeling trapped by your vendor, which leads to a reactionary response that goes too far the other way and some believe they need to write everything themselves and shouldn’t pay a vendor for anything. Some believe that either it is open-source, you write it yourself, or you do nothing. Doing nothing means you don’t get your pain solved, which helps no one. Writing it yourself typically means you spend too much and don’t take advantage of the cost-savings available by sharing costs across a number of participants. Having everything be open-source may not work either because while many things will be produced as open-source, there will always be needs that are not interesting to the volunteer labor that is behind open-source. It is not economical for you as a person building tractors, making toothpaste, making loans, for you to build the entire software stack needed on top of open source.

Now, what open source has done is said, well, actually there is a fundamental layer that will be open and shared across everybody, and it will become public domain so to speak—something that anybody can use. Great. That’s fantastic. That has made the world a better place because it decreases the initiation cost. However, some people believe that is all that is necessary and then everybody builds from there. But the problem there is you still have that same shared-cost problem. What about the shared stuff in between—that open source hasn’t built yet, may never build—who pays for that? Does every company have to repay and pay the same thing over and over for that too? No. That’s not economical either.

This is the modern open source spender, basically. And it is an emergent concept that many people are getting and starting to show up. For us, Anaconda Enterprise is that thin layer on top of the Anaconda open source. The enterprise customer makes a lot of sense. As a company, open source is life blood. So you have just actually connected a meaningful commercial transaction with supporting open source. And whenever you can do that, you are guaranteeing a continuation of open source. So it is actually in the best interest of fixing the well-documented sustainability problem of open-source.

You hear people talk about it all the time—the sustainability of open source. It has been popular for people to complain about the fact that projects don’t get enough funding and what not. I know that. I live that. And I am a poster child of that. It’s one of the reasons that we started NumFOCUS and started Continuum. At the heart of both organizations is sustainability of open source, because it has been on my mind since I was a poor grad student with three kids and a mortgage writing what would become SciPy. Continuum makes things sustainable by having commercial transactions that are supported by open source with margins from that business going directly to support open source. People’s pain points and problems are solved with commercial software using open source, and therefore they pay money, and therefore you have a sustainable story that supports open source.

We have actually come up with a pretty good model of how that works and how we support open source. I’m pretty excited about that because it is definitely working. That general principle translates to practice in a bunch of different ways—from training, to custom solutions around open source, to grants, and then product. And they all contribute to helping promote open source.

The other part of what open source helps inform us on is that you have to stay abreast of it. It’s a changing dynamic and space. So a lot of times you are basically having to look out and see—and we don’t always do this perfectly but you look out and see where things are headed. To us, open source is not just about code being open, it is about building and maintaining community. Because if code is just open and free, then, okay, that might be useful but maybe it is not progressing. You are going to have to do work to get it where you need it to be, but if you can see activity and its trajectory is there, then you are kind of making educated guesses about where it is going to be and therefore where your contributions shouldn’t be trying to compete. That dance of prediction between open source communities and open source players is very interesting. And it is a subtle thing that is going on all the time that is actually a part of the dynamic of what technology gets created.

It is under appreciated, I think. When somebody does something and promotes something—an idea—it can cause other people not to do something. Just like vendors might do that, open source communities might do that as well. So it is why it is particularly important in my mind when a big innovation like Spark comes out and people start thinking this is everything—it has got some good innovations, but it is not everything. And that seems true of many other things. You want to make sure it doesn’t take the oxygen out of the room and people start to think oh we are done, when actually you just starting to ask the right questions.

We keep coming back to this idea of community. Clearly, you have been involved in many communities, both inside and outside of Continuum. Do you have any advice to offer—either in terms of leading a community, best ways to access it or how to get a pulse?

That’s a good question. Usually those questions have a different subtopic. In terms of leading a community, you have got to listen. Not everybody can do it, not everybody should do it. Because leading community—you have got to have enough interest and passion about what you are doing to connect with people. And recognize that the people who are a part of who you bring in—it is not just “hey I’m there and every user of this thing I’m building is equally weighted.” You actually start having to segment and look for ways to create a structured hierarchy of individuals.

Hierarchy may be the wrong word because it has implications—but it is what it is. You have the inner people who you are trying to get to contribute to the thing and build the thing and extend the knowledge. Rather than 1 to 100, you are trying to create 1 to 10, and then each of those 10—you are building a tree of relationships. Because that’s actually how real, sustainable community works. It becomes this highly interrelated connective tissue. It is not a single superstar with thousands of followers. That might work in rockstar world, but that comes and goes. It doesn’t sustain. What sustains communities are relationship networks. And so you have to recognize that and realize you are building those relationship networks with every transaction and interaction you have.

I know we all see the community tracks at conferences—leadership, how to make a community strong, how to communicate—it really does seem to be a central issue.

It is. And I think it will be for a long time because it is a central human issue. And what is interesting to me, and I don’t talk about this greatly because some people have an allergic reaction to religion, but it is exactly the same problem religions have been addressing for generations. You can go back and—even if you don’t believe in anything they have done—look at the practice of how these communities have organized and recognize it is exactly the same thing here, because it is about human interaction and how do you support each other and how do you get the right folks involved. You can study and you can have things that work and things that didn’t, and understand why.

What you have to do for community to work is get people who are passionate about the vision or the mission or the thing—the unifying elements. And to stay connected those people have to mix their activity and effort with it. So if you are trying to build a community, then you have got to be able to first have a compelling thing that matters to enough people. Second, start to engage with the people who will be your missionaries, your evangelists, your people who actually care enough to tell somebody else. The Tipping Point calls them Mavens—you have to focus your attention on educating those folks. For example, making sure you have enough developer documentation instead of just user documentation. That is an important practical aspect that is often not sufficiently appreciated.

Some of this has been accidental for me. Like just watching NumPy and SciPy explode when the documentation came out that detailed how things worked. Because you have user documentation and you have also have documentation of how things work, right? And how the documentation of how things work is critical to attract new developers—new people become part of the inner circles. You need both users and people to sustain those users by encouraging everybody working on things. It is just recognizing they are both there, and then you make progress. But it has got to start with concern and interest, and passion about the thing. There is nothing that has been created that didn’t start with somebody who really cared about it. It is not quite as simple as turn on the internet and this code magically flows. The cathedral and bazaar concept by Eric Raymond I think is still absolutely real but I have a different take on the two modes.

My re-telling of his story is is that any open source software starts with a cathedral—the creation of the thing. And then the bazaar is that everybody comes and builds their stuff around it. Successful open source always has to have a cathedral phase where there is a thing that gets built that is the right thing. And then it also has to enter the bazaar phase where tons of stuff gets added to it that’s easy to do. That’s how the new cathedral gets built, and this is just the cycle that continues. Cycles happen, but good software has a long life. Be aware that, with community building, you will pay for the mistakes of the community you built, early, for a long, long time.

That whole “strong foundation” aspect.

That’s right. But, how do you do that? One of the things about community is recognizing the importance of an organization to help you. I think that is one thing that is under appreciated sometimes, though it is starting to become more appreciated with the existence of organizations like the Apache Foundation, NumFOCUS, and the Software Freedom Conservancy. There are organizations now that can help with the tactical details and then can help you be part of something bigger.

I have one more thing that just occurred to me. It is important to understand what your rallying point is. So if you will forgive another religious connection, what’s your church meeting? What brings you together? Community to sustain has to have a rallying thing. Is it a conference? Is it a webinar? Is it just a very tight knit mailing group? What is the thing that rallies people and brings them together to share the good word and reignite the fire that drives the innovation?

That passion.

Yeah, and helping to build off of each other and learn from each other and recreate that passion. Conferences are usually filling that role, and they are pretty effective at helping feel connected to a community to the point where participants feel like, hey, I’m part of this, and so I can give, and my contributions will be received. That is why it becomes very important for those conferences to be welcoming, receptive, encouraging. You can’t force all of that, but it doesn’t just magically happen, and people have to work hard to create the safe atmosphere where people can share.

PyData has been really fun to watch because we knew this was important so we did a lot of work up front to make sure that the conference started and happened the first four times. But then all of a sudden other people get involved and you welcome that, and then they become owners, and then they become part of it. Then the natural geography and boundaries give opportunities for people to have their local deep involvement and they become part of something critically important.

Not every community is going to be big enough to have that level of organization but you can certainly affiliate, and you can have sub chapters and groups and special birds of a feather inside a larger event. There is very large opportunity with that.

Local people put on the PyData conferences, right?

Local people put it on. I founded NumFOCUS months before I founded Continuum. And the first year of Continuum we spent some of our seed capital hiring an executive director for NumFOCUS, Leah Silen. Then we supported NumFOCUS as it got off its ground, to really rally a community around a set of disparate projects that each had their power and their little sub communities, but they needed a center voice. And that center voice needed to have community governance. I didn’t want to create a center voice that people perceived as Continuum. Maybe in some parallel universe that would work, right, where people weren’t so afraid of markets.

The observation for me was there already was community and it was disparate and it was spread and you can’t go and just take over that community. It doesn’t make any sense, and people will perceive any kind of attempt at that. But it needed to be rallied and connected. So to do that, it required pushing forward NumFOCUS and PyData, and sponsoring them and making it happen and paying for it until the local community recognizes it as theirs. Every PyData now is local—the organizing committee is local. There is still a center—NumFOCUS sponsors them and provides guidance, a template, provides approach, has people who will come. There are fulltime people at NumFOCUS that help make PyData happen. But they don’t happen unless there is a local organizing committee and local support.

I know that we had people go to PyData San Francisco and they were really impressed with the local aspect.

Yeah, it’s because the community is there. I knew there was this community that was nascent energy out there, but what surprised me was how much latent energy was there. It just needed a little bit of a spark—step into it folks, go do it. So that was exciting for me to see just how powerful and vibrant that community was and is.

Editor’s note: The above has been edited for length and clarity. In the next installment of this interview, we’ll talk about Travis’ next big project, and what he sees in the future of software development.

The post Building Tech Communities appeared first on Silicon Valley Data Science.


March 12, 2017

Curt Monash

Introduction to SequoiaDB and SequoiaCM

For starters, let me say: SequoiaDB, the company, is my client. SequoiaDB, the product, is the main product of SequoiaDB, the company. SequoiaDB, the company, has another product line SequoiaCM,...


March 10, 2017

Revolution Analytics

Because it's Friday: Time/Life

From simple building blocks, complex systems can emerge. As with life, so too with the Game of Life, Conway's simple set of rules for evolving pixels on a grid: For a space that is 'populated': Each...


Simplified Analytics

Better know your customers for survival in this Digital age

With Digital Transformation, we are living in direct-to-customer world.  Consumers don’t want to talk to middlemen or brokers when they need something. They also don’t want to be bombarded with...


Revolution Analytics

Updates to the Data Science Virtual Machine for Linux

The Data Science Virtual Machine (DSVM) is a virtual machine image on the Azure Marketplace assembled for data scientists. The goal of the DSVM is provide a broad array of popular data-oriented tools...


March 09, 2017

Silicon Valley Data Science

The Value of Exploratory Data Analysis

Editor’s note: Chloe (as well as other members of SVDS) will be speaking at TDWI Accelerate in Boston. Find more information, and sign up to receive our slides here.

From the outside, data science is often thought to consist wholly of advanced statistical and machine learning techniques. However, there is another key component to any data science endeavor that is often undervalued or forgotten: exploratory data analysis (EDA). At a high level, EDA is the practice of using visual and quantitative methods to understand and summarize a dataset without making any assumptions about its contents. It is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results.

With the rise of tools that enable easy implementation of powerful machine learning algorithms, it can become tempting to skip EDA. While it’s understandable why people take advantage of these algorithms, it’s not always a good idea to simply feed data into a black box—we have observed over and over again the critical value EDA provides to all types of data science problems.

EDA is valuable to the data scientist to make certain that the results they produce are valid, correctly interpreted, and applicable to the desired business contexts. Outside of ensuring the delivery of technically sound results, EDA also benefits business stakeholders by confirming they are asking the right questions and not biasing the investigation with their assumptions, as well as by providing the context around the problem to make sure the potential value of the data scientist’s output can be maximized. As a bonus, EDA often leads to insights that the business stakeholder or data scientist wouldn’t even think to investigate but that can be hugely informative about the business.

In this post, we will give a high level overview of what EDA typically entails and then describe three of the major ways EDA is critical to successfully model and interpret its results. Whether you are a data scientist or the consumer of data science, we hope after reading this post that you will know why EDA should be a key part of the way data science operates in your organization.

What is EDA?

While aspects of EDA have existed as long as data has been around to analyze, John W. Tukey, who wrote the book Exploratory Data Analysis in 1977, was said to have coined the phrase and developed the field. At a high level, EDA is used to understand and summarize the contents of a dataset, usually to investigate a specific question or to prepare for more advanced modeling. EDA typically relies heavily on visualizing the data to assess patterns and identify data characteristics that the analyst would not otherwise know to look for. It also takes advantage of a number of quantitative methods to describe the data.

EDA usually involves a combination of the following methods:

  • Univariate visualization of and summary statistics for each field in the raw dataset (see figure 1)

    Distribution of variable 4

    Figure 1

  • Bivariate visualization and summary statistics for assessing the relationship between each variable in the dataset and the target variable of interest (e.g. time until churn, spend) (see figure 2)

    Distribution of target variable for categories of variable 3

    Figure 2

  • Multivariate visualizations to understand interactions between different fields in the data (see figure 3).

    Assessing interactions between two variables in the data

    Figure 3

  • Dimensionality reduction to understand the fields in the data that account for the most variance between observations and allow for the processing of a reduced volume of data
  • Clustering of similar observations in the dataset into differentiated groupings, which by collapsing the data into a few small data points, patterns of behavior can be more easily identified (see figure 4)

    Clustered data

    Figure 4

Through these methods, the data scientist validates assumptions and identifies patterns that will inform the understanding of the problem and model selection, builds an intuition for the data to ensure high quality analysis, and validates that the data has been generated in the way it was expected to.

Validating assumptions and identifying patterns

One of the main purposes of EDA is to look at the data before assuming anything about it. This is important, first, so that the data scientist can validate any assumptions that might have been made in framing the problem or that are necessary for using certain algorithms. Second, assumption-free exploration of the data can aid in the recognition of patterns and potential causes for observed behavior that could help answer the question at hand or inform modeling choices.

Often there are two types of assumptions that can affect the validity of analysis: technical and business. The proper use of certain analytical models and algorithms relies on specific technical assumptions being correct, such as no collinearity between variables, variance in the data being independent of the data’s value, and whether data is missing or corrupted in some way. During EDA, various technical assumptions are assessed to help select the best model for the data and task at hand. Without such an assessment, a model could be used for which assumptions are violated, making the model no longer applicable to the data in question and potentially resulting in poor predictions and incorrect conclusions that could have negative effects for an organization. Furthermore, EDA helps during the feature engineering stage by suggesting relationships that might be more efficiently encoded when incorporated into a model.

The second type of assumption, the business assumption, is a bit more elusive. With proper knowledge of a model, the data scientist knows each type of assumption that must be valid for its use and can go about systematically checking them. Business assumptions, on the other hand, can be completely unrecognized and deeply entangled with the problem and how it is framed. Once, we were working with a client who was trying to understand how users interacted with their app and what interactions signaled likely churn. Deeply embedded in their framing of the problem was their assumption that their user base was composed of, say, experienced chefs looking to take their cooking to the next level with complex recipes. In fact, the user base was composed mostly of inexperienced users trying to find recipes for quick, easy-to-make meals. When we showed the client the assumption that they had been building their app upon was misinformed, they had to pivot and embark on understanding a whole new set of questions to inform future app development.

While validating these technical and business assumptions, the data scientist will be systematically assessing the contents of each data field and its interactions with other variables, especially the key metric representing behavior that the business wants to understand or predict (e.g. user lifetime, spend). Humans are natural pattern recognizers. By exhaustively visualizing the data in different ways and positioning those visualizations strategically together, data scientists can take advantage of their pattern recognition skills to identify potential causes for behavior, identify potentially problematic or spurious data points, and develop hypotheses to test that will inform their analysis and model development strategy.

Building an intuition for the data

There is also a less concrete reason for why EDA is a necessary step to take before more advanced modeling: data scientists need to become acquainted with the data first hand and develop an intuition for what is within it. This intuition is especially important for being able to quickly identify when things go wrong. If, during EDA, I plot user lifetime versus age and see that younger users tend to stay with a product longer, then, I would expect whatever model I build to have a term that would result in increased lifetime when age is decreased. If I train a model that shows different behavior, I would quickly realize that I should investigate what is happening and make sure I didn’t make any mistakes. Without EDA, glaring problems with the data or mistakes in the implementation of a model can go unnoticed for too long and can potentially result in decisions being made on wrong information.

Validating that the data is what you think it is

In the days of Tukey-style EDA, the analyst was typically well aware of how the data they were analyzing was generated. However, now as organizations generate vast numbers of datasets internally as well as acquire third-party data, the analyst is typically far removed from the data generation process. If the data is not what you think it is, then your results could be poorly affected, or worse, misinterpreted and acted on.

One example of a way data generation can be misinterpreted and cause problems is when data is provided at the user level but is actually generated at a higher level of granularity (such as for the company, location, age group the observation is a part of). This situation results in data being the same for otherwise disparate users within a group.

Let’s look at Company A’s situation. Company A is trying to predict which of its users would subscribe to a new product offering in order to target them. They are having a hard time developing a model—every attempt results in poor predictions. Then someone thinks to perform extensive EDA, a task that they at first thought was unnecessary for achieving their desired results. The results show them that the subscribers being predicted for were part of larger corporate accounts who controlled what products their employees subscribed to. This control meant that users could look exactly the same in the data in every way but have different target outcomes, meaning that the individual-level data had little ability to inform predictions. Not only did EDA in this case expose technical problems with the approach taken so far but it also showed that the wrong question was being asked. If the subscriber’s’ behavior was controlled by its organization, there was no business use to targeting subscribers. The company needed to target and thus predict new product subscriptions for corporate accounts.

Other examples that we have seen where data generation process has been wrongfully assumed:

  • Data is generated the same across versions of a product or across platforms.
  • Data is timestamped according to X time zone or the same across time zones.
  • Data is recorded for all activity but is only recorded when a user is signed in.
  • Identifiers for users remain constant over time or identifiers are unique.

So how do you go about getting all this value?

Now that you know why EDA is valuable, you might want to know how to do it. One way is to attend our talk at TDWI Accelerate on best practices for EDA on April 3. We will also be publishing a number of blog posts in the future on various EDA methods. Until then, here are a few of our blog posts and internal projects that have highlighted the insights gained by EDA:

sign up for our newsletter to stay in touch

The post The Value of Exploratory Data Analysis appeared first on Silicon Valley Data Science.

Revolution Analytics

The Rise of Civilization, Visualized with R

This animation by geographer James Cheshire shows something at once simple and profound: the founding and growth of the cities of the world since the dawn of civilization. Dr Cheshire created the...

InData Labs

Meet InData Labs at Big Data Innovation Summit in London

The Big Data Innovation Summit London will bring together executives from the data community for two days of keynotes, panel sessions, discussions & networking. The event focuses on all areas of Big Data including Data Strategy, Data Science, Hadoop, Data Mining, Cultural Transformation and much more. We look forward to seeing you there!

Запись Meet InData Labs at Big Data Innovation Summit in London впервые появилась InData Labs.


[WHITE PAPER] Leading the Way for Data-as-a-Service

A new year presents new opportunities to get excited about. Expanding our Data-as-a-Service is the cornerstone of our company vision for 2017. In our most recent white paper, we explain how this gives you greater access to the Big Data you need from the Deep Web. In order to accomplish this, our leadership team has developed […] The post [WHITE PAPER] Leading the Way for Data-as-a-Service appeared first on BrightPlanet.

Read more »

How to build a Data-Driven Collections strategy

In our previous blog post we looked at data analytics in collections and the expected change in performance.

Strategically, data analytics drives operational execution, but the question remains: where do we start? In this blog post, I outline the 3 steps to building your own data-driven collections strategy.

Teradata ANZ

Mastering colours in your data visualisations

I’ll be the first to admit that I am terrible at colours. Be it the selection of paint for a room through to the colours of an Excel chart. I simply choose the ones that I liked without much regard to everyone else. It’s natural for me to think “well I know what I’m talking about or looking at”. What I often forget is “what will the end consumer of this think?”. We all know that data visualisation brings data science into a consumable format for end users and dramatically helps us humans to interpret information easier and quicker.

Take the following table for example:

Ben Davis_Visualisations 1

If we wanted to compare Domestic with International sales and find the peaks and troughs in the data we would need to read each line and work it out in our head. Although not an arduous task, it still takes time to interpret and calculate in our heads.

But if we visualise that same data we can quickly see this information:

Ben Davis_Visualisations 2
Hence visualisation of data generally makes it easier to interpret the results. Now of course not all data can be visualised as large datasets with many plot points often end up looking crazy.

— “Maps were some of the first ways that the human race looked at data in a visual format. —

But an often-overlooked component of data visualisation is the colour aspect. Do it right and the results will speak for themselves and your work will be well accepted by the business. But get it wrong and it can lead to confusion and misinterpretation. Heaven help us that all our hard work in data wrangling, sorting and analysing all comes to nothing just because we chose the wrong colour for a data value.

So why is colouring so hard to get right? The answers are quite simple

Cultural interpretation of colour – If you see a red light you stop, stop signs are red, warning sirens are generally red. So as a result you generally accept red being a colour of danger. But this isn’t necessarily so in other cultures as the colour red in China means prosperity and luck. So think twice about colouring negative values in red if you’re end user’s are Chinese.

Colours are hard to tell apart – How often have you been stuck trying to find a different shade of blue, brown or a pastel colour? Representing different values in similar colours often leads to confusion resulting in the consumer having to refer to a legend to understand what colour value matches the colour they are looking at. Worse still is those who are colour blind may interpret a result entirely different as the colours are so close to each other that telling them apart becomes impossible.

Apart from employing a visual designer to ensure your data visualisations are top notch there are two very simple rules to keep in mind. Of course there are numerous other rules and there are volumes of thesis written on this very topic, but two simple rules listed below are a start if you are like me and suffer from a lack of colour skills in the design stage.

1) Colouring sequential data
Sequential data is data that progresses from low to high or high to low and therefore you should use gradient colours to represent the change in gradient. Once again it is a fine line of colours you use here as you want to ensure the colours are distinct enough to represent the gradient curve, but not so distinct as to represent dramatic changes in values. Stick to colours from the same colour group.

2) Colouring qualitative data
The opposite of sequential data is qualitative data that represents categories that are distinctly different from other categories on the screen. They want or need to be seen as totally different from others. This where you need to apply contrasting colours to highlight the differences. For example green against a blue.

Always keep the end consumer of your visualization in mind. You may know your data inside/out and therefore understand it, but until someone who doesn’t know your work looks at it, you’ll only be designing from your perspective.

The post Mastering colours in your data visualisations appeared first on International Blog.


March 08, 2017

Forrester Blogs

Insights Services Drive Data Commercialization

The new data economy isn't about data; it is about insights. How can I increase the availability of my locomotive fleet? How can I extend the longevity of my new tires? How can I improve my...


Revolution Analytics

Employee Retention with R Based Data Science Accelerator

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft) Employee retention has been and will continue to be one of the biggest challenges of a company. While...


March 07, 2017

Revolution Analytics

In case you missed it: February 2017 roundup

In case you missed them, here are some articles from February of particular interest to R users. Public policy researchers use R to predict neighbourhoods in US cities subject to gentrification. The...

Big Data University

This Week in Data Science (March 7, 2017)

Here’s this week’s news in Data Science and Big Data. Artificial Intelligence

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Featured Courses From BDU

  • Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
  • Predictive Modeling Fundamentals I
    – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data.
  • Using R with Databases
    – Learn how to unleash the power of R when working with relational databases in our newest free course.
  • Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to apply deep learning to different data types in order to solve real world problems.

The post This Week in Data Science (March 7, 2017) appeared first on BDU.

Forrester Blogs

Insights Services Leaders Deliver True Decision Support

The explosive growth of the data economy is being fueled by the rise of insights services. Companies have been selling and sharing data for years. Axciom and Experian made their name by providing...


Forrester Blogs

Mobile World Congress 2017: Observations Regarding The Main Enterprise Themes

Recently, the largest annual get together of the mobile industry, Mobile World Congress (MWC) took place in Barcelona. In my opinion, the biggest themes at MWC in 2017 that are relevant for...


Revolution Analytics

Preview: R Tools for Visual Studio 1.0

After more than a year in preview R Tools for Visual Studio, the open-source extension to the Visual Studio IDE for R programming, is nearing its official release. RTVS Release Candidate 2 1 is now...


March 06, 2017

Revolution Analytics

R 3.3.3 now available

The R core group announced today the release of R 3.3.3 (code-name: "Another Canoe"). As the wrap-up release of the R 3.3 series, this update mainly contains minor bug-fixes. (Bigger changes are...


March 05, 2017

Simplified Analytics

User Experience UX is at the heart of Digital Transformation

When Pokémon go was launched, do you remember how many crazy people were randomly walking on the road, searching for poke balls and Pokémons at the public place which created a record of 500 million...


March 03, 2017

Revolution Analytics

Because it's Friday: How to read music

I don't play music. I can barely sing. (Don't accept a Rock Band party invitation from me unless you have no further need of your eardrums.) And I certainly can't read music. So I'm sure I'm missing...

Ronald van Loon

The IoT-Connected Car of Today— Cases From Hertz, Nokia, NTT, Mojio & Concur Technologies

Imagine a world where your car not only drives itself, but also says intelligent things like these:

  • A hotel is just around the corner and you have been driving for eight hours. Would you like to reserve a room and take rest for a couple of hours?
  • You last serviced the brakes twelve months ago and you have driven your car about 20.000 miles in this duration. Would you like me to find a dealer and book an appointment?

This would look like an impossibility about five years ago when the world was unaware of a technology called the Internet of Things (IoT), but today, the IoT is already breaking fresh ground for tech companies and car manufacturers, enabling them to realize their idea of a ‘connected car.’

I recently attended Mobile World Congress (#MWC17) in Barcelona where SAP announced its collaboration with Hertz, Nokia and Concur Technologies. The purpose of this new partnership is to leverage on the IoT to offer an intelligent, automated experience to car users. SAP also announced its collaboration with Mojio, the connected vehicle platform and app provider for T-Mobile USA and Deutsche Telekom. The integration of Mojio’s cloud computing capabilities with SAP Vehicles Network will make parking and fueling process a breeze for users. From enabling drivers to reserve a parking spot based on calendar events to expense management for business travelers, SAP’s collaboration with these companies is likely to accelerate the development of connected cars.

In this article, I have discussed the cases that caught my interest and that, in my opinion, are likely to progress and evolve into something revolutionary.

Mojio — The IoT Connected Car 

Mojio ‘s new smart car technology is set to create an automotive ecosystem that will allow the automotive, insurance, and telecom industry to thrive together. The recent news that Mojio plans to connect 500,000 vehicles to its cloud platform in the first phase gives us a clue about the technology is really taking off and the idea of ‘connected cars’ is likely to become a reality soon.

Mojio’s Data Analytics Capabilities

The open connected car platform introduced by Mojio has advanced data collection and analytical capabilities. The data collected by the sophisticated telematics device can be categorized into three types — contextual, behavioral, and diagnostic. Using mathematical and statistical modeling, Mojio discovers meaningful patterns and draw conclusions from data to allow companies to better understand the needs, behaviors, and expectations of their customers and drive product and service improvements.

Here’s how it all works.

  • Behavioral Data — Mojio’s telematics device gathers information about speed, steering, and braking inputs to determine driver’s fatigue level and issue alerts. Long-term driving behavior data can also be used to help the user adopt a more fuel efficient driving style and calculate risk by insurance companies.
  • Diagnostic Data — With the ability to access vehicle’s data remotely, car manufacturers can assess the health of a vehicle and combine this capability with in-car voice communication to notify customers when service is required.
  • Contextual Data — Led by Google and Amazon, contextual targeting of advertisements based on the search data of an individual has become a usual practice in the digital world. Mojio is using the same principle to offer more personalized advice to car drivers. It enriches the behavioral and contextual data of a customer with geolocation data, posted speed limits, and updated traffic flow conditions to provide valuable recommendations to the driver.

Data Sharing Outside the Connected Car Ecosystem 

Mojio has evolved from being a ‘service provider’ to a ‘system integrator’ and it now works with Google, Amazon, Microsoft, and other companies to offer all the services a user may need in an integrated, unfragmented manner. Built on SAP Vehicles Network, the Connected Car Ecosystem introduces users to a new level of convenience and comfort. Leveraging on the capabilities of this open connected car platform, users can now ask Amazon Alexa questions about their newly connected car, such as “Alexa, ask Mojio how much fuel my car has left.”

Future Possibilities: A Value Chain in Flux

Mojio has partnered with a number of companies, including Amazon Alexa, Dooing, IFTTT, FleetLeed, and Spot Angels. The integration of the value chains of these companies will mean improved convenience and better personalized services to customers. While the possibilities are unlimited, I have listed a couple of examples here to help you get an idea of the potential of this technology.

Logistical providers — Leveraging on the capabilities of this open connected car platform, you can request Amazon/UPS/DHL/FedEx to deliver an order directly to the boot of your car. Amazon will find your car using the geolocation data, enter a security code to open the luggage compartment, and leave your parcel while you’re in a meeting or having your lunch at a restaurant.

IFTTT — The integration of Mojio and IFTTT means that your calendar will be automatically updated based on your travel habits. Not only this, you will be able to set triggers and actions as well, such as:

  • When my vehicle ignition turns on, mute my Android tone.
  • Track new trips in a Google spreadsheet.
  • Receive a notification when Mojio senses that my car’s battery is low.

SpotAngel — Did you know that Mojio could save you money? The partnership of Mojio with SpotAngel will allow you to receive alerts for street cleaning, alternate side parking, or parking meters, helping you avoid parking tickets.

The possibilities are virtually unlimited. For example, if Mojio partners with a call center, then businesses will be able to get voice recordings of calls made by customers for roadside assistance or directions and use this information to ensure quality control or for CRM.

Hertz — The Rent-a-Car Company Ready to Use IoT to Improve Its Customer Experience

Hertz is set to become the first car rental company to use the Internet of Things to offer improved services to its customers. It announced its decision to join SAP Vehicles Network in the conference that I recently attended. Being a member of the SAP Vehicles Network, that currently comprises of leading names like Nokia, Concur Technologies, and Mojio, will allow Hertz to elevate the car-rental experience of its customers by providing them personalized advice and services.

Hertz is likely to integrate travel and itinerary planning along with in-car personalization to deliver just what the client needs. In addition to this, the integration of Concur’s TripLink will be particularly beneficial for business travelers. The app will aggregate all the travel-related expenses, including fuel and parking fees to allow customers to generate a single expense report for the entire trip. Using Concur’s TripLink business travelers will be able to a single click to submit their trip expense report immediately after the trip is completed.

Nokia to Offer Robust, Multi-Layered Security to Connected Cars

Nokia has designed a horizontal solution to address the challenges posed by the fragmented and complex IoT ecosystem that comprises of disparate devices and applications. Titled ‘Intelligent Management Platform for All Connected Things’ (IMPACT), the new solution offers connectivity, data collection, analytics, and business application development capabilities across all verticals.

Using IMPACT, service providers will be able to assume a competitive position in the market by offering them a number of value-adding options, such as:

  • IMPACT will monitor traffic flow to offer real-time updates to customers.
  • Personalization of driver settings and entertainment systems.
  • Remote monitoring of speed, fuel levels, and other metrics for vehicle diagnostics and predictive maintenance.

Improved Safety with Live Transportation Monitoring

Apart from Nokia, Hertz, and Mojio, SAP is also working with NTT to devise a state-of-the-art solution that can improve the safety of public transport. The solution, which is called Live Transportation Monitoring, has three components — NTT’s IoT analytics platform, SAP’s connected transportation safety portal, and hitoe® — a fabric that will used to manufacture drivers’ workwear.

This fabric is coated with a conductive polymer which will help the service provider monitor the driving behavior and key health parameters of drivers from a remote location in a real time manner. The data will be presented on SAP’s connected transportation safety portal (as exhibited in the photo below). This way, public transportation companies will be able to ensure complete safety of their passengers, as well monitor the health of their employees and vehicles.

Combined, all these technologies have the potential to make the driving experience of customers sager, more convenient, and less costly. Also, since this is a relatively new market, we can expect new players to join hands, gain a foothold, and push the boundaries of what’s possible with IoT.

What do you think of these new developments? Don’t forget to like the article, share your comments and insights.

If you would like to read Ronald van Loon future posts then please click ‘Follow‘ and feel free to also connect on LinkedIn and Twitter to learn more about the possibilities of IoT and Big Data.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post The IoT-Connected Car of Today— Cases From Hertz, Nokia, NTT, Mojio & Concur Technologies appeared first on Ronald van Loons.


March 02, 2017

Revolution Analytics

Predicting the length of a hospital stay, with R

I haven't been admitted to hospital many times in my life, but every time the only thing I really cared about was: when am I going to get out? It's also a question that weighs heavily on hospital...


Motivated Reasoning: What it is and how to avoid it in Data Analysis

Use versus abuse of statistics can often be characterised by the analytical approach adopted to the problem at hand. In this blog post, which is part of a series on Logical Fallacies to avoid in Data Analysis, I’ll be focusing on defining the motivated reasoning logical fallacy and how to avoid it in data analysis.

Revolution Analytics

Scholarships encourage diversity at useR!2017

While representation of women and minorities at last year's useR! conference was the highest it's ever been, there is always room for more diversity. To encourage more underrepresented individuals to...


March 01, 2017


Scott Vercruysse: From Researching Markets to Data Harvesting

Breathing heavy, he darts down the rubble of the trail ahead of him. Somedays the wind blows hard, and somedays the sun shines bright. Either way, he keeps going. For Scott Vercruysse, one of our Data Acquisition Engineers at BrightPlanet, both running outdoors and taking on client projects presents him with new challenges. He prefers it that […] The post Scott Vercruysse: From Researching Markets to Data Harvesting appeared first on BrightPlanet.

Read more »

Curt Monash

One bit of news in Trump’s speech

Donald Trump addressed Congress tonight. As may be seen by the transcript, his speech — while uncharacteristically sober — was largely vacuous. That said, while Steve Bannon is firmly...