Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


August 19, 2017

Simplified Analytics

Are you drowning in Data Lake?

Today more than even, every business is focusing on collecting the data and applying analytics to be competitive. Big Data Analytics has passed the hype stage and has become the essential part of...


Revolution Analytics

20 years of the R Core Group

The first "official" version of R, version 1.0.0, was released on February 29, 2000. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday,...

Cloud Avenue Hadoop Tips

Creating a Thumbnail using AWS Lambda (Serverless Architecture)

Introduction to AWS Lambda

In one of the earlier blog here, we discussed about AWS Lambda which is a FAAS (Function As A Service) with a simple example. In Lambda the granularity is at a function level and the pricing is also on the number of times a function is called and so is directly proportional to the growth of the business. We don't need to think in terms of the servers (serverless architecture), AWS will automatically scale the resources as the number of the calls to the Lambda function increases. We should be able to allocate the amount of memory allocated to the Lambda function and the Lambda function is automatically allocated the proportional CPU. Here is the FAQ on Lambda.

Amazon has published a nice article here on the how create a Lambda function which gets triggered by an image in S3 and then automatically creates a Thumbnail of the same again in a different bucket in S3. The same can be used for a photo sharing site like Flicker, Picassa etc. The article is in detail, but there are a lot steps involved using the CLI (Command Line Interface) which is not a piece of cake for those who are just starting with the AWS service. Here in this blog we will look at the sequence of steps using the AWS Web Management Console for the same.

Sequence of steps for creating an AWS Lambda function

Here we go with the assumptions that the Eclipse and Java 8 SDK setup have already been done on the system and that the participants have already have an account created with AWS. For the sake of this article the IAM, S3 and Lambda resources consumed fall under the AWS free tier.

Step 1: Start the Eclipse. Go to 'File -> New -> Project ...' and choose 'Maven Project' and click on Next.

Step 2: Choose to create a a simple project and click on Next.

creating maven project in eclipse

Step 3: Type the following artifact information and click on Finish.

Group Id: doc-examples
Artifact Id: lambda-java-example
Version: 0.0.1-SNAPSHOT
Packaging: jar
Name: lambda-java-example

creating maven project in eclipse

The project will be created and the Package Explorer will be as below.

maven project structure in eclipse

Step 4: Replace the pom.xml content with the below content. This Maven file has all the dependencies and the plugins used in this project.

<project xmlns="" xmlns:xsi="" xsi:schemaLocation="">

Step 5: Create the example package and then add the S3EventProcessorCreateThumbnail java code from here in the file.

maven project structure in eclipse

Step 6: Now, it's time to build and package the project. Right click on the project in the Package explorer view and then go to 'Run As -> Maven build ....'. Enter 'package' in the Goals as shown below and then click Run.

packaing the code using maven

Once the maven build is complete, then the BUILD SUCCESS should be shown in the bottom right console. And also the jar should appear in the target folder after refreshing the project.

jar file in eclipse

Step 7: Build the project again using the above step with the Goal as 'package shade:shade'. Make sure to change the name of the Maven configuration to something else than the previous name as shown below.

packaing the code using maven

Once the Maven build is complete, then the BUILD SUCCESS should be shown in the bottom right console. And also the jar should appear in the target folder after refreshing the project.

jar file in eclipse

Step 8: An IAM role has to be created and attached to the Lambda function, so that it can access the appropriate AWS resources. Goto the IAM AWS Management Console. Click on Roles and 'Create new role'.

creating an iam role

Step 9: Select the AWS Lambda role type.

creating an iam role

Step 10: Filter for the AWSLambdaExecute policy and select the same.

creating an iam role

Step 11: Give the role a name and click 'Create role'.

creating an iam role

The role will be created as shown below. The same role will be attached to the Lambda function later.

creating an iam role

Step 12: Goto the Lambda AWS Management Console and click on 'Create a function' and then select 'Author from scratch'.

creating a lambda function

creating a lambda function

Step 13: A trigger can be added to the S3 bucket later, Click on Next.

creating a lambda function

Step 14: Specify the below for the Lambda details and click on Next.
  • Function name as 'CreateThumbnail'
  • Runtime as 'Java 8'
  • Upload the lambda-java-example-0.0.1-SNAPSHOT.jar file from the target Eclipse folder
  • Handler name as example.S3EventProcessorCreateThumbnail::handleRequest
  • Choose the role which has been created in the IAM

Step 15: Verify the details and click on 'Create function'.

creating a lambda function

Within a few seconds the success screen should be shown as below.

creating a lambda function

Clicking on the Functions link on the left will show the list of all the functions uploaded to Lambda from this account.

creating a lambda function

Now that the Lambda function has been created, it's time to create buckets in S3 and link the source bucket to the Lambda function.

Step 16: Go to the S3 AWS Management Console and create the source and target buckets. The name of the target bucket should be the source bucket name appended with the word resized. The logic for the same has been hard coded in the Java code. There is no need to add the airline-dataset bucket, this is a bucket was already there in S3.

creating buckets in s3

Step 17: Click on the S3 source bucket and then properties to associate it with the Lambda function created earlier as shown below.

s3 attaching the lambda function

s3 attaching the lambda function

Step 18: Upload the image to the source bucket.

image in the source bucket

If everything goes well then the Thumbnail image should be in the target bucket in a few seconds. Note that the size of the resized Thumbnail image is smaller than the original image.

image in the target bucket

Step 19: The log files for the Java Lambda function can be found in the CloudWatch AWS Management Console as shown below. If for some reason, the resized image is not there in the target S3 folder, the reason for the same can be found in CloudWatch logs.

cloudwatch logs

cloudwatch logs

cloudwatch logs

Step 20: Few metrics can also be got from the AWS Lambda Management Console like the number of invocation count and duration.

cloudwatch metrics



Few things can be automated using the AWS Toolkit for Eclipse and the framework. But, they hide most of the details and so it's better to follow the above sequence of steps to know what happens behind the scenes in the AWS Lambda service. In the future blogs, we will also explore on how to do the same with AWS Toolkit for Eclipse and the framework also.

The AWS Lambda function can be fronted by AWS API Gateway which provides a REST based interface to create, publish, maintain, monitor, and secure APIs at any scale. This is again a topic for a future blog post.

Serverless Architecture would be future as it removes the burden about the infrastructure from the developer and moves it to the Cloud vendor like AWS. DynamoDB also falls under the same category. While creating the DynamoDB table, we simply specify the Read and the Write Capacity Units and Amazon will automatically provision the appropriate resources as in the case of Lambda.

This is one of the lengthiest blog I have written and really enjoyed it. Plan to write more such detailed blogs in the future.

August 18, 2017

Revolution Analytics

Because it's Friday: Movie Trailer

Via Gizmodo, this generic template for a AAA movie trailer recalls that generic brand video from a couple of years back. That's all for us for this week. Have a great weekend, we'll be back on Monday!


Revolution Analytics

Obstacles to performance in parallel programming

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to...


What is Machine Learning? And other FAQs we get...

Here's a blog post covering some of the most frequently asked questions we get on Machine Learning and Artificial Intelligence, or Cognitive Computing. We start off with "What is Machine Learing?" and finish off with addressing some of the fears and misconceptions of Artificial Intelligence.

So, what is machine learning? A simple search on Google for the answer will yield many definitions for it that leave most non-analytical people confused and entering more "What is..." statements into Google. So, I asked our Head of Marketing to try his hand at defining Machine Learning in the most simplistic way he can: explain Machine Learning to someone you've just met at a social gathering. Here's his definition - a "Machine Learning for Beginners' " definition if you will. 


August 17, 2017

Silicon Valley Data Science

Space Shuttle Problems: Long-term Planning Amid Changing Technology

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

Many enterprises are hoping to systematically drive competitive advantage and operational efficiency by gaining insight from data. For that reason, they may be contemplating — or already well into — a “data lake” effort. These are large, complex efforts, typically focussed on integrating vast enterprise data assets into a consolidated platform that makes it easier to surface insight across operational silos. Data lake programs are typically multi-year endeavors, and companies must embark on those efforts amidst a rapidly changing technology landscape, while simultaneously navigating highly dynamic business climates.

Such programs are replete with what we call space shuttle problems: your spaceship is going to launch in 10 or 12 years, so whatever processors you have today are going to seem quaint by the time you fly. The challenge is to find a way to take advantage of technical advances that will inevitably occur throughout a long, complex effort effort, without halting progress.

The attempt to “future proof” designs leads many organizations into analysis paralysis. Agile, iterative problem-solving approaches can help you avoid that paralysis. They are designed to embrace a changing scope and to be flexible enough to take advantage of unanticipated opportunities. There are limits to this approach, however. I often quip that while agile is great for many things, you can’t wait until you are in space to figure out how to breathe. Some aspects of a solution are so critical that there is little point to building anything else before you know you can solve that part of the overall problem.

How can you manage your implementation in a way that allows you to take maximum advantage of technology innovation as you go, rather than having to freeze your view of technology to today’s state and design something that will be outdated when it launches? You must start by deciding which pieces are necessary now, and which can wait.

Planning is Essential

With a space shuttle, there are components you simply don’t have the freedom to wait for. You need to design the control systems. You have to accept that by the time they are implemented and tested sufficiently to send a lucky few astronauts into space (and hopefully not kill them), you will be running on outdated equipment because you are optimizing your approach for reliability, not performance efficiency.

Beyond components with extreme reliability requirements, there are other aspects of a solution you want to make sure you tackle very early in a project. Let’s say the value proposition for the data lake effort included gaining a lot of value from using natural language processing techniques (NLP) on unstructured data. On the one hand, NLP is an area of rapid innovation. You might benefit from that innovation if you wait until you’ve loaded and integrated all of your unstructured data into the lake before choosing your NLP approach. However, if the NLP approach is unsuccessful, the time and expense of loading that data, scaling the infrastructure, learning how to use the NLP tool, etc., are potentially wasted. Because NLP is rapidly evolving, determining its utility for a given use case typically requires experimentation for that specific purpose. This is exactly when you want to conduct prototyping exercises early in the project.

The corollary to front-loading efforts to prove out critical, but uncertain aspects of a solution is to back-load efforts to implement well understood aspects of the solution. Why rush into implementing a user interface? It might be years before the project uses it and who knows what innovations happen in that time. There is little to no risk in waiting, and if the project runs into any major roadblocks there’s no wasted effort.

Space Shuttle Architecture

Beyond careful staging of how you go after your project’s backlog, there are architectural approaches you can take advantage of to help take advantage of innovation. Abstracting technology choices, so there is no unnecessary tight coupling of those choices, can make it much easier to change your mind or take advantage of the latest and greatest down the road.

For example, it used to be common to make data requests directly to an underlying database from a user application. This is the simplest and most efficient approach. However, at Silicon Valley Data Science (SVDS) we avoid this approach at all costs. By implementing a service layer between the application and the database, the application architecture is no longer tightly coupled to the database selection. You can change your database from Oracle to Postgres, or from MySQL to Cassandra, and your application is none the wiser. This is how we can isolate technology choices and support very complex, evolving application architectures without them breaking every time someone changes something.

Amazon has used this services approach to dramatic effect. They are able to support a dizzying array of products, warehouses, supply chains, partner vendors, and service offerings successfully, with several aspects of the entire system in flux at any given time. It’s probably the only way possible to integrate tens of thousands of partner inventory systems into what seems like a single store. Could you imagine if Amazon had to update its systems every time one of its small vendor partners updated their inventory system?

Next Steps

The idea of the space shuttle problem is to help put into perspective how we execute technology-driven projects in a rapidly changing (and improving) technology landscape. The goal is identifying and reducing our exposure to risk while increasing our understanding of the compromises involved with technology decisions, taking maximal advantage of innovation as we go. In some cases, the value gained by forgoing agility may be worth the consequences.

The point is to identify and front-load the poorly understood pieces of your project. Proper use of abstraction techniques can make changing your mind about those choices least expensive and easiest to implement—and that maximizes your ability to take advantage of technology innovation. Come to think of it, that’s probably the achievable version of “future proof.”

The post Space Shuttle Problems: Long-term Planning Amid Changing Technology appeared first on Silicon Valley Data Science.

Revolution Analytics

How to build an image recognizer in R using just a few images

Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer...


Curt Monash

More notes on the transition to the cloud

Last year I posted observations about the transition to the cloud. Here are some further thoughts. 0. In case any doubt remained, the big questions about transitioning to the cloud are...


August 16, 2017


Deep Web Partner News Round-Up

BrighPlanet’s business partners have been active this summer, winning great awards and making impressive announcements. Here’s the latest Deep Web and data extraction news from our friends in BrightPlanet’s Partner News Round-Up.  Newly Released In April, we were excited to learn about Rosoka’s latest release and engine improvements with Rosoka Series 6. This release has […] The post Deep Web Partner News Round-Up appeared first on BrightPlanet.

Read more »

Revolution Analytics

In case you missed it: July 2017 roundup

In case you missed them, here are some articles from July of particular interest to R users. A tutorial on using the rsparkling package to apply H20's algorithms to data in HDInsight. Several...


Revolution Analytics

Reproducibility: A cautionary tale from data journalism

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to...


Revolution Analytics

Buzzfeed trains an AI to find spy planes

Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the...


August 15, 2017

Silicon Valley Data Science

Data Ingestion with Spark and Kafka

An important architectural component of any data platform is those pieces that manage data ingestion. In many of today’s “big data” environments, the data involved is at such scale in terms of throughput (think of the Twitter “firehose”) or volume (e.g., the 1000 Genomes project) that approaches and tools must be carefully considered.

In the last few years, Apache Kafka and Apache Spark have become popular tools in a data architect’s tool chest, as they are equipped to handle a wide variety of data ingestion scenarios and have been used successfully in mission-critical environments where demands are high.

In this tutorial, we will walk you through some of the basics of using Kafka and Spark to ingest data. Though the examples do not operate at enterprise scale, the same techniques can be applied in demanding environments.


This is a hands-on tutorial that can be followed along by anyone with programming experience. If your programming skills are rusty, or you are technically minded but new to programming, we have done our best to make this tutorial approachable. Still, there are a few prerequisites in terms of knowledge and tools.

The following tools will be used:

  • Git—to manage and clone source code
  • Docker—to run some services in containers
  • Java 8 (Oracle JDK)—programming language and a runtime (execution) environment used by Maven and Scala
  • Maven 3—to compile the code we write
  • Some kind of code editor or IDE—we used the community edition of IntelliJ while creating this tutorial
  • Scala—programming language that uses the Java runtime. All examples are written using Scala 2.12. Note: You do not need to download Scala.

Additionally, you will need a Twitter developer account. If you have a normal Twitter account, you can obtain API keys by verifying your account via SMS.

A note about conventions. Throughout this tutorial, you will see some commands that start with a prompt (a dollar sign) and typed in a monospaced font. These are intended to be commands that are run in a terminal. To do this, just copy out the command excluding the prompt, paste it into your terminal, then press the return key.

Prerequisite: verify tools


Run the following commands and check your output against what is expected.

java, javac, git, mvn, docker commands

If any of these commands fail with an error, follow the guidelines to install them on your operating system.

Prerequisite: create and configure a Twitter application

To create a Twitter application, navigate to Press the button marked “Create New App.” It will either be on the upper right or middle of your browser window, depending on whether you have created a Twitter app before.

You’ll be asked to fill out several fields, some of which are required. Even though the form indicates that a website is required, you can use a localhost address.

Once you have created the application, you should be redirected to the application configuration screen. Underneath your application name is a row of menu items. Click on the one that says “Keys and Access Tokens.”

At the bottom of this page is a button marked, “Create my access token.” Press it. There should now be a number of fields in your browser window. You only need to be concerned with four of them:

  1. Consumer Key (in the Application Settings section)
  2. Consumer Secret (in the Application Settings section)
  3. Access Token (in the Your Access Token section)
  4. Access Token Secret (in the Your Access Token section)

You can either copy them into a text file for use later, or leave this browser window open until later in the tutorial when you need the values.

Get the code and compile


The code from this tutorial can be found on GitHub.


$ git clone
$ cd ingest-spark-kafka
$ mvn clean package

There are two files that will be important for the rest of this tutorial. The first can be found at:


It contains stubs that you’ll be filling in later on. The other file to be aware of is:


It contains the final working version of the code that you should end up with if you work all the way through the tutorial.

Validate Twitter settings

The first thing to do is ensure you have a proper environment that can connect to the Twitter API. Copy the four values from your Twitter application settings into their respective places in ingest-spark-kafka/

Next, compile and execute TwitterIngestTutorial. You can run it using your IDE or with maven. To execute it with maven, run the following command (demonstration):

$ mvn package exec:java 

The output should contain the text “All twitter variables are present” just preceding the line that says “[INFO] BUILD SUCCESS

Set up a Kafka container

Now that you know your Twitter setup is correct, let’s get a Kafka container up and running. If you have used Docker before, it’s probably a good idea to shut down all of your Docker containers before proceeding, to avoid contending for resources.

IMPORTANT: The Kafka client is picky about ensuring DNS and IP addresses match when connecting. In order to connect, you must create a host file entry that maps a host named “kafka” to the IP address “” (a.k.a. “localhost”). In Linux/Unix environments, this file is found at /etc/hosts, while on Windows machines it will be at %SystemRoot%\System32\drivers\etc\host. Simply add the following line: kafka

We will use a Kafka container created by Spotify, because it thoughtfully comes with Zookeeper built in. That’s one less technology you will need to become familiar with. Pull down and and start the container this way (demonstration):

$ docker pull spotify/kafka
$ docker run -p 2181:2181 -p 9092:9092 --hostname kafka --name test_kafka --env ADVERTISED_PORT=9092 --env ADVERTISED_HOST=kafka spotify/kafka

Let’s analyze these commands. The first command is simple, it simply downloads the docker image called “spotify/kafka” that has been uploaded to the Docker hub. The next command runs that image locally.

run means that the image will run now. -p 2181:2181 -p 9092:9092 maps two local ports to two ports on the container (local port on the left, container port on the right). Think of this the same way you do a SSH port-forward. --hostname kafka tells the container that its hostname will be kafka; it doesn’t mean anything outside of the container. --name test_kafka gives the container a name. This will be handy if you start and stop the container (as will do momentarily). --env ADVERTISED_PORT=9092 --env ADVERTISED_HOST=kafka pass environment variables into the container runtime environment. These are the same as if you issued an export FOO=’bar’ command from a terminal inside the container. The final parameter is the name of the image to source the container from.

Run some Kafka commands

Next, we’ll stop the container and restart it in background mode. Press “CTRL+C” to stop the container. It should log something about waiting for ZooKeeper and Kafka (the processes!) to die. Restart the container using this command:

$ docker start test_kafka

It should execute quickly. Now we can connect to the container and get familiar with some Kafka commands. Log into the container this way:

$ docker exec -it test_kafka /bin/bash

This is invoking the Docker client and telling it you wish to connect an interactive TTY to the container called test_kafka and start a bash shell. You will know you are inside the container if the prompt changes to something that looks like this:


The first thing we will do is create Kafka topic. A topic in Kafka is a way to group data in a single application. Other message systems call this a “queue”; it’s the same thing. From within the container TTY you just started, execute this command (remember to remove the prompt!):

root@kafka:/# /opt/kafka_2.11- --zookeeper kafka:2181 --create --topic test_topic --partitions 3 --replication-factor 1

There’s a lot going on here. Let’s unroll this command. is a script that wraps a java process that acts as a client to a Kafka client endpoint that deals with topics. --zookeeper kafka:2181 tells the client where to find ZooKeeper. Kafka uses ZooKeeper as a directory service to keep track of the status of Kafka cluster members. ZooKeeper also has roles in cluster housekeeping operations (leader election, synchronization, etc.). The client queries ZooKeeper for cluster information, so it can then contact Kafka nodes directly. --create indicates a particular operation that will create a topic. --topic names the topic. --partitions 3 indicates how many partitions to “break” this topic into. Partitions come into play when you to achieve higher throughput. The best information I’ve seen about how to choose the number of partitions is a blog post from Kafka committer Jun Rao. We choose three here because it’s more than one. --replication-factor 1 describes how many redundant copies of your data will be made. In our case that value is just “1” so there is no redundancy at all, though you’d expect this with a cluster that has only one node.

You can verify that your topic was created by changing the command to --list:

root@kafka:/# /opt/kafka_2.11- --zookeeper kafka:2181 --list

Now that you have a topic, you can push a few messages to it. That involves a different Kafka script, the console producer.

root@kafka:/# /opt/kafka_2.11- --topic test_topic --broker-list kafka:9092

--broker-list kafka:9092 is analogous to specifying the ZooKeeper hosts, but specifies a Kafka cluster member to contact directly instead. (I have no idea why does not support this.)

You may think this command is hanging, but in reality it is in a loop waiting for you to send some messages to the topic. You do this by typing a message and pressing the return key. Go ahead and send a few messages to the topic. When you are finished, press CTRL-C.

We can now play these messages back using the console consumer. Use this command:

root@kafka:/# /opt/kafka_2.11- --topic test_topic --bootstrap-server kafka:9092 --from-beginning

It takes a few seconds to start up. After that you should see as many messages as you produced earlier come across in the output. CTRL-C will get you out of this application. If you run it again you should see the same output. This is because --from-beginning tells Kafka that you want to start reading the topic from the beginning. If you leave that argument out the consumer will only read new messages. Behind the scenes Kafka will keep track of your consumers topic offset in ZooKeeper (if using groups), or you can do it yourself. You can experiment with this on your own by running the console consumer and console producer at the same time in different terminals.

If you stopped your consumer, please start it again. We will use it later on to validate that we are pushing Twitter messages to Kafka.

Create a Kafka client

Let’s go back to editing TwitterIngestTutorial again. It contains a stubbed-in case class called KafkaWriter. This step will complete it so that we can send messages to Kafka.

First, we’ll add a few configuration properties to the config variable. Add the following lines after the comment that says “add configuration settings here.

// add configuration settings here.
put("bootstrap.servers", brokers)
put("topic", topic)
put("key.serializer", classOf[StringSerializer])
put("value.serializer", classOf[StringSerializer])

You’ll recognize bootstrap.servers from the console consumer command you just used. It is the same thing except in this case the value is supplied from a string in the constructor. topic should be self-explanatory at this point. The last two values, key.serializer and value.serializer tell the client how to marshal data that gets sent to Kafka. In this case, we have indicated to expect strings.

Next, we’ll modify the write() method to actually send data to Kafka. Above the write() method you can see an instance of KafkaProducer is created. The write() method will use this producer to send data to Kafka. First we’ll create a ProducerRecord, then we’ll use the producer to send() it.

val record = new ProducerRecord[String, String](this.topic, key, data)
producer.send(record).get(5, TimeUnit.SECONDS)

As you see, the record instance is type parameterized to match the types expected by the serializers described by the key.serializer and value.serializer settings. Since producer.send() returns a java.util.concurrent.Future instance, we call get() on it and block until it returns.

This is an example of a synchronous client. Synchronous clients are easier to write, but often do not perform well in highly concurrent (multithreaded) settings. This client could be modified to be asynchronous by introducing a queue and executor pool to KafkaWriter. This is left as an exercise to the reader.

The last step for the Kafka client is to finish the close() method by having it call producer.close().

Spark initialization

There are two steps to initialize Spark for streaming. First you create a SparkConf instance, then you set up a StreamingContext. Place this code after the Twitter validation check:

val conf = new SparkConf()
val ssc = new StreamingContext(conf, Seconds(5))

In a production scenario, many of the spark configuration values come from the environment, versus specifying here in the code. local[4] tells Spark to use four executors for parallelism. The Seconds parameter in the StreamingContext constructor indicates that our “microbatches” will be five seconds wide.

Input stream

Now we’ll create an input stream to process. The TwitterUtils object abstracts away the Twitter API and gives us a nice DStream interface to data. Essentially, it polls the Twitter API for new events and keeps track of which events have already been processed.

If your job were to create a stream interface into a legacy API in your enterprise, the TwitterUtils class would serve as a good example of how to do it. One important thing to keep in mind with this example is that stream ingestion from Twitter happens in a single thread, and could become a bottleneck and a single point of failure in a production scenario. Concurrently consuming an unpartitioned stream is one of those difficult problems in computer science.

The next few lines of code create the input stream, then repartition it three ways and apply a mapping function so that we are dealing with strings and not Twitter API objects. As a result, the stream will be typed as DStream[(Long, String)].

val stream = TwitterUtils.createStream(ssc, twitterAuth = None, 
filters = Seq("#nba", "#nfl", "nba", "nfl"))
      .map(tweet =&amp;gt;; (tweet.getId, 

The filters in this case limit us to tweets related to a few sports terms. You can substitute other terms here or pass in an empty Seq to receive the whole data stream.

Stream operations

Once we have a reference to the stream, we can perform operations on it. It is important to make the conceptual distinction that is now happening in this code: while it appears to all live within a single class (indeed a single file), you are writing code that can potentially be shipped to and run on many nodes.

Spark does an okay job of keeping you aware of this. If you ever see a runtime error complaining about a class that is not Serializable, that is usually an indication that you either forgot to mark an intended class as Serializable, or (more likely) you’ve mistakenly instantiated something in the wrong closure—try to push it down a little further. StackOverflow has a wealth of information on this topic.

In order to perform concurrent operations on our stream, we will decompose it into constituent RDD instances and process each individually in the publishTweets() method.

stream.foreachRDD(publishTweets _)

Finally, we’ll kick things off by starting the StreamingContext and telling it to hang around:


If you run this code, you should see log message that indicate Spark is starting up and processing the stream. Most importantly, you should verify that you see the log message from publishTweets() every five seconds or so.

RDD operations

We repartitioned the input stream earlier, so that we could process chunks of it in parallel at this point. Add the following code to publishTweets(), then run the code.

tweets.foreachPartition { partition =&amp;amp;gt;"PARTITION SIZE=${partition.size}")

You’ll want to note two things. First, the PARTITION SIZE=X messages appear almost simultaneously. Second, and what’s more interesting, is that they are all running on different threads, indicated by the thread=XXX preamble to the logging messages. Were you running this on a cluster, those messages would likely be output not just on different threads, but on entirely different machines.

Now replace that code with this:

tweets.foreachPartition { partition =>
      val output = KafkaWriter("kafka:9092", "test_topic")
      partition.foreach { record =>
           output.write(record._1.toString, record._2)

Do you see how we instantiate each KafkaWriter instance inside the closure that works on the partition? That is to avoid the class serialization problems mentioned earlier.

If you run this code, you should see a lot of output coming across the Kafka console consumer you left running.

Wrapping up

That’s it! Hopefully at this point, you have become familiar with simple Kafka operations and commands and even learned a little bit about how containers can make development easier. Then you learned some simple techniques for handling streaming data in Spark.

Moving on from here, the next step would be to become familiar with using Spark to ingest and process batch data (say from HDFS) or to continue along with Spark Streaming and learn how to ingest data from Kafka. Watch this space for future related posts!

The post Data Ingestion with Spark and Kafka appeared first on Silicon Valley Data Science.


August 12, 2017

Simplified Analytics

Why Data Visualization matter now?

Data Visualization is not new, it has been around in various forms for more than thousands of years.  Ancient Egyptians used symbolic paintings, drawn on walls & pottery, to tell timeless...


August 11, 2017

Revolution Analytics

Because it's Friday: The Shepard Tone

I haven't seen the Dunkirk movie yet, but the video below makes me want to see it soon. It turns out it contains an auditory illusion: the "Shepard Tone", which sounds like it's continually rising...

InData Labs

Keys to building robust data infrastructure for a data science project

Ones you decide to leverage data science techniques in your company, it is time to make sure the data infrastructure is ready for it. Starting a data science project is a big investment, not just a financial one. It involves a lot of time, effort, and preparatory work. Data science is about leveraging a company’s data...

Запись Keys to building robust data infrastructure for a data science project впервые появилась InData Labs.


August 10, 2017

Silicon Valley Data Science

Machine Learning vs. Statistics

Throughout its history, Machine Learning (ML) has coexisted with Statistics uneasily, like an ex-boyfriend accidentally seated with the groom’s family at a wedding reception: both uncertain where to lead the conversation, but painfully aware of the potential for awkwardness. This is caused in part by the fact that Machine Learning has adopted many of Statistics’ methods, but was never intended to replace statistics, or even to have a statistical basis originally. Nevertheless, Statisticians and ML practitioners have often ended up working together, or working on similar tasks, and wondering what each was about. The question, “What’s the difference between Machine Learning and Statistics?” has been asked now for decades.

Machine Learning is largely a hybrid field, taking its inspiration and techniques from all manner of sources. It has changed directions throughout its history and often seemed like an enigma to those outside of it.1 Since Statistics is better understood as a field, and ML seems to overlap with it, the question of the relationship between the two arises frequently. Many answers have been given, ranging from the neutral or dismissive:

  • “Machine learning is essentially a form of applied statistics”
  • “Machine learning is glorified statistics”
  • “Machine learning is statistics scaled up to big data”
  • “The short answer is that there is no difference”

to the questionable or disparaging:

  • In Statistics the loss function is pre-defined and wired to the type of method you are running. In machine learning, you will most likely write a custom program for a unique loss function specific to your problem.
  • “Machine learning is for Computer Science majors who couldn’t pass a Statistics course.”
  • “Machine learning is Statistics minus any checking of models and assumptions.”
  • “I don’t know what Machine Learning will look like in ten years, but whatever it is I’m sure Statisticians will be whining that they did it earlier and better.”

The question has been asked—and continues to be asked regularly—on Quora, StackExchange, LinkedIn, KDNuggets, and other social sites. Worse, there are questions of which field “owns” which techniques [“Is logistic regression a statistical technique or a machine learning one? What if it’s implemented in Spark?”, “Is Regression Analysis Really Machine Learning?” (Mayo, see References)]. We have seen many answers that we regard as misguided, irrelevant, confusing, or just simply wrong.

We (Tom, a Machine Learning practitioner, and Drew, a professional Statistician) have worked together for several years, observing each other’s approaches to analysis and problem solving of data-intensive projects. We have spent hours trying to understand the thought processes and discussing the differences. We believe we have an understanding of the role of each field within data science, which we attempt to articulate here.

The difference, as we see it, is not one of algorithms or practices but of goals and strategies. Neither field is a subset of the other, and neither lays exclusive claim to a technique. They are like two pairs of old men sitting in a park playing two different board games. Both games use the same type of board and the same set of pieces, but each plays by different rules and has a different goal because the games are fundamentally different. Each pair looks at the other’s board with bemusement and thinks they’re not very good at the game.

The purpose of this blog post is to explain the two games being played.


Both Statistics and Machine Learning create models from data, but for different purposes. Statisticians are heavily focused on the use of a special type of metric called a statistic. These statistics provide a form of data reduction where raw data is converted into a smaller number of statistics. Two common examples of such statistics are the mean and standard deviation. Statisticians use these statistics for several different purposes. One common way of dividing the field is into the areas of descriptive and inferential statistics.

Descriptive statistics deals with describing the structure of the raw data, generally through the use of visualizations and statistics. These descriptive statistics provide a much simpler way of understanding what can be very complex data. As an example, there are many companies on the various stock exchanges. It can be very difficult to look at the barrage of numbers and understand what is happening in the market. For this reason, you will see commentators talk about how a specific index is up or down, or what some percentage of the companies gained or lost value in the day.

Inferential statistics deals with making statements about data. Though some of the original work dates back to the 18th and 19th century, the field really came into its own with the pioneering work of Karl Pearson, RA Fisher, and others at the turn of the 20th century. Inferential statistics tries to address questions like:

  • Do people in tornado shelters have a higher survival rate than people who hide under bridges?
  • Given a sample of the whole population, what is the estimated size of the population?
  • In a given year, how many people are likely to need medical treatment in the city of Bentonville?
  • How much money should you have in your bank account to be able to cover your monthly expenses 99 out of 100 times?
  • How many people will show up at the local grocery store tomorrow?

The questions deal with both estimation and prediction. If we had complete perfect information, it might be possible to calculate these values exactly. But in the real world, there is always uncertainty. This means that any claim you make has a chance of being wrong—and for some types of claims, it is almost certain you will be slightly wrong. For example, if you are asked to estimate the exact temperature outside your house, and you estimate the value as 29.921730971, it is pretty unlikely that you are exactly correct. And even if you turn out to get it right on the nose, ten seconds later the temperature is likely to be somewhat different.

Inferential statistics tries to deal with this problem. In the absolute best case, the claims made by a statistician will be wrong at least some portion of the time. And unfortunately, it is impossible to decrease the rate of false positives without increasing the rate of false negatives given the same data. The more evidence you demand before claiming that a change is happening, the more likely it is that changes that are happening fail to meet the standard of evidence you require.

Since decisions still have to be made, statistics provides a framework for making better decisions. To do this, statisticians need to be able to assess the probabilities associated with various outcomes. And to do that, statisticians use models. In statistics, the goal of modeling is approximating and then understanding the data-generating process, with the goal of answering the question you actually care about.

The models provide the mathematical framework needed to make estimations and predictions. In practice, a statistician has to make trade-offs between using models with strong assumptions or weak assumptions. Using strong assumptions generally means you can reduce the variance of your estimator (a good thing) at the cost of risking more model bias (a bad thing), and vice versa. The problem is that the statistician will have to decide which approach to use without having certainty about which approach is best.

Since statisticians are required to draw formal conclusions, the goal is to prepare every statistical analysis as if you were going to be an expert witness at a trial.

This is an aspirational goal: in practice, statisticians often perform simple analyses that are not intended to stand up in a court of law. But the basic idea is sound. A statistician should perform an analysis with the expectation that it will be challenged, so each choice made in the analysis must be defensible.

It is important to understand the implications of this. The analysis is the final product. Ideally, every step should be documented and supported, including data cleaning steps and human observations leading to a model selection. Each assumption of the model should be listed and checked, and every diagnostic test run and its results reported. The statistician’s analysis, in effect, guarantees that the model is an appropriate fit for the data under a specified set of conditions.

In conclusion, the Statistician is concerned primarily with model validity, accurate estimation of model parameters, and inference from the model. However, prediction of unseen data points, a major concern of Machine Learning, is less of a concern to the statistician. Statisticians have the techniques to do prediction, but these are just special cases of inference in general.

Machine Learning

Machine Learning has had many twists and turns in its history. Originally it was part of AI and was very aligned with it, concerned with all the ways in which human intelligent behavior could be learned. In the last few decades, as with much of AI, it has shifted to an engineering/performance approach, in which the goal is to achieve a fairly specific task with high performance. In Machine Learning, the predominant task is predictive modeling: the creation of models for the purpose of predicting labels of new examples. We put aside other concerns of Machine Learning for the moment, as predictive analytics is the dominant sub-field and the one with which Statistics so often is compared.

We briefly define the process in order to be clear. In predictive analytics, the ML algorithm is given a set of historical labeled examples. Each example has a label, which, depending on the problem type, can be either the name of a class (classification) or a numeric value (regression). It creates a model, the purpose which is prediction. Specifically, the learning algorithm analyzes the data examples and creates a procedure that, given a new unseen example, can accurately predict its class. Some portion of the data is set aside (the holdout set) and used to validate the model. Alternatively, a method like the bootstrap or cross-validation can be employed to reuse the data in a principled way.

Predictive modeling can have great value this way. A model with good performance characteristics can predict which customers are valuable, which transactions are fraudulent, which customers are good loan risks, when a device is about to fail, whether a patient has cancer, and so on. This all assumes that the future will be similar to the past, and that historical patterns that occurred frequently enough will occur again. This presumes some degree of causality, of course, and such causal assumptions must be validated.

In contrast to Statistics, note that the goal here to generate the best prediction. The ML practitioner usually does some exploratory data analysis, but only to prepare the data and to guide the choice of features and a model family. The model does not represent a belief about or a commitment to the data generation process. Its purpose is purely functional. No ML practitioner would be prepared to testify to the “validity” of a model; this has no meaning in Machine Learning, since the model is really only instrumental to its performance.2 The motto of Machine Learning may as well be: The proof of the model is in the test set.

This approach has a number of important implications that distance ML from Statistics.

  1. ML practitioners are freed from worrying about model assumptions or diagnostics. Model assumptions are only a problem if they cause bad predictions. Of course, practitioners often perform standard exploratory data analysis (EDA) to guide selection of a model type. But since test set performance is the ultimate arbiter of model quality, the practitioner can usually relegate assumption testing to model evaluation.
  2. Perhaps more importantly, ML practitioners are freed from worrying about difficult cases where assumptions are violated, yet the model may work anyway. Such cases are not uncommon. For example, the theory behind the Naive Bayes classifier assumes attribute independence, but in practice it performs well in many domains containing dependent attributes (DomingosPazzani, see References). Similarly, Logistic Regression assumes non-colinear predictors yet often tolerates colinearity. Techniques that assume Gaussian distributions often work when the distribution is only Gaussian-ish.
  3. Unlike the Statistician, the ML practitioner assumes the samples are chosen independent and identically distributed (IID) from a static population, and are representative of that population. If the population changes such that the sample is no longer representative, all bets are off. In other words, the test set is a random sample from the population of interest. If the population is subject to change (called concept drift in ML) some techniques can be brought into play to test and adjust for this, but by default the ML practitioner is not responsible if the sample becomes unrepresentative.
  4. Very often, the goal of predictive analytics is ultimately to deploy the prediction method so the decision is automated. It becomes part of a pipeline in which it consumes some data and emits decisions. Thus the data scientist has to keep in mind pragmatic computational concerns: how will this be implemented? how fast does it have to be? where does the model get its data and what does it do with the final decision? Such computational concerns are usually foreign to Statisticians.

To a Statistician, Machine Learning may look like an engineering discipline [Note from Drew: You bet it does! But that is not a bad thing], rather than science—and to an extent this is true. Because ML practitioners do not have to justify model choice or test assumptions, they are free to choose from among a much larger set of models. In essence, all ML techniques employ a single diagnostic test: the prediction performance on a holdout set. And because Machine Learning often deals with large data sets, the ML practitioner can choose non-parametric models that typically require a great deal more data than parametric models.

As a typical example, consider random forests and boosted decision trees. The theory of how these work is well known and understood. Both are non-parametric techniques that require a relatively large number of examples to fit. Neither has diagnostic tests nor assumptions about when they can and cannot be used. Both are “black box” models that produce nearly unintelligible classifiers. For these reasons, a Statistician would be reluctant to choose them. Yet they are surprisingly—almost amazingly—successful at prediction problems. They have scored highly on many Kaggle competitions, and are standard go-to models for participants to use.


There are great areas of Statistics and Machine Learning we have said nothing about, such as clustering, association rules, feature selection, evaluation methodologies, etc. The two fields don’t always see eye-to-eye on these, but we are aware of little confusion on their fundamental use. We concentrate here on predictive modeling, which seems to be the main point of friction between the fields.

In summary, both Statistics and Machine Learning contribute to Data Science but they have different goals and make different contributions. Though the methods and reasoning may overlap, the purposes rarely do. Calling Machine Learning “applied Statistics” is misleading, and does a disservice to both fields.

Much has been made of these differences. Machine learning is generally taught as part of the computer science curriculum, and statistics is taught either by a dedicated department or as part of the math department. Computer scientists are taught to design real-world algorithms that will be used as part of software packages, while statisticians are trained to provide the mathematical foundation for scientific research. In many cases, both fields use different terminology when referring to exactly the same thing.3 Putting the two groups together into a common data science team (while often adding individuals trained in other scientific fields) can create a very interesting team dynamic.

However, the two different approaches share very important similarities. Fundamentally, both ML and Statistics work with data to solve problems. In many of the dialogues we have had over the past few years, it is obvious that we are thinking about many of the same basic issues. Machine learning may emphasize prediction, and statistics may focus more on estimation and inference, but both focus on using mathematical techniques to answer questions. Perhaps more importantly, the common dialogue can bring improvements in both fields. For example, topics such as regularization and resampling are of relevance to both types of problems, and both fields have contributed to improvements.


NOTE: We were made aware, after writing this blog post, that some of our points are made in Leo Breiman’s 2001 journal article “Statistical Modeling: The Two Cultures” (Breiman, see References). We don’t claim to present or summarize his point of view; we wanted our posting to be fairly short and a quick read of our own points of view. Since Breiman’s article is more elaborate than this essay, and his work is always worth reading, we refer the reader to it.



  • Breiman: Statistical Modeling: The Two Cultures. Breiman, L. Statistical Science (2001) 16: 3, 199–231.
  • DomingosPazzani: On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Domingos, P. & Pazzani, M. Machine Learning (1997) 29: 103. doi:10.1023/A:1007413511361
  • Freitas: Comprehensible Classification Models—a position paper. ACM SIGKDD Explorations Newsletter. 15: 1, June 2013. pp. 1–10.
  • Mayo: Is Regression Analysis Really a Machine Learning Tool. KDnuggets, June 17, 2017. Matthew Mayo. Online here.
  • Shmueli: To Explain or to Predict? Statistical Science. 25: 3, 2010. pp. 289–310. DOI: 10.1214/10-STS330


1. Where machine learning has been and the path it has taken makes for an interesting story, but one which is longer than this blog posting. Suffice it to say that Machine Learning is a lot like a war orphan: it has sketchy lineage, it has been through a lot, and has seen a lot, not all of which it wants to remember.
2. This is not strictly true. We are exaggerating to make a point. Some ML practitioners care about model intelligibility. Both Freitas and Shmueli (see References) have written about the importance of intelligible data models and descriptive data analysis. But these papers simply reinforce the original point: the community must be reminded that intelligibility is desirable because it is so often forgotten.
3. Robert Tibshiriani’s glossary [PDF] provides some guidance here.

The post Machine Learning vs. Statistics appeared first on Silicon Valley Data Science.

Curt Monash

Notes on data security

1. In June I wrote about burgeoning interest in data security. I’d now like to add: Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance. In...


Mario Meir-Huber

Locksmith Renton

Locksmith Renton

All your plans were damaged by unexpected breakage of the garage Locksmith Renton Washington? Are you afflicted by the blocked car cowl? There are no reasons for panic. In cases when your habitual vital rhythm is broken by malfunction of the lock – you call our professionals.

The staff of our team is very wide. Each our employee – the professional whose qualification it is impossible to doubt. We have earned brilliant authority thanks to the professional performance of services. Our important trump – the democratic price. Therefore to us clients handle different financial opportunities. We never try to be enriched on the client’s problem, therefore, we don’t exaggerate a problem essence, and we speak just as is actually.
Our clients always admire our expeditious arrival. For us there are no difficult locks, we are able to analyze quickly an essence of a problem and to find a way out. We never refuse to the client, we won’t be stopped even that the client is territorially rather far from us. We quickly will arrive to the place of an event. A result ours of works is always professional consultation on which we tell as it is correct to operate the lock that in the future he hasn‘t afflicted the owner with sudden breakage.
Each employee of our staff – the unique master of the profile. All our employees are divided into three categories:
– residential locksmith which daily solves such widespread problems as breakage of the door room lock, breakage of the country key mechanism. He is also capable to repair garage locks of any level of complexity.
– commercial locksmith which daily restores faulty office locks of any types and any level of complexity. It is possible to address him and in case the safe was automatically blocked because of wrong input of the code code.
– car locksmith which can restore any types of Mobile Locksmith company in Renton. He can’t be frightened the hi-tech up-to-date key mechanism as he annually increases the level of the professional abilities at advanced training courses.
Call to us if you see that the independent solution hasn‘t yielded positive results. We will help you to eliminate breakage extremely quickly. You will be pleased by the speed of arrival of our masters, politeness of our employees and their high level of literacy. Our Renton Locksmith company has huge base of the equipment which always near at hand at each master. At us also it is always possible to make the duplicate of any type of keys quickly. We do it cheap and qualitatively. Each type of our service is supported with a guarantee. Address and we in practice will prove to you that brilliant professionals still exist.

The post Locksmith Renton appeared first on Techblogger.


August 09, 2017

Revolution Analytics

dplyrXdf 0.10.0 beta prerelease

I’m happy to announce that version 0.10.0 beta of the dplyrXdf package is now available. You can get it from Github: install_github("RevolutionAnalytics/dplyrXdf", build_vignettes=FALSE) This is a...



SoftBase Announces General Availability for TestBase Release 6.1

TestBase 6.1 further improves performance and Db2v12 support.   New features in TestBase 6.1 include: Pre-Compiled with Db2V11 and Db2V12 Testbase divided into 7 different components that can...



VIDEO: How to Install an Ubuntu Virtual Machine on Your Desktop

Safety and security are crucial in Deep Web and Dark Web search. Unfortunately, the probability of encountering a virus or something else harmful increases while working in this part of the web. Installing a virtual machine such as the Ubuntu on your desktop allows you to work in the Deep Web and Dark Web without the […] The post VIDEO: How to Install an Ubuntu Virtual Machine on Your Desktop appeared first on BrightPlanet.

Read more »

Revolution Analytics

Tutorial: Deep Learning with R on Azure with Keras and CNTK

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft) Microsoft's Cognitive Toolkit (better known as CNTK) is a commercial-grade and open-source framework...


August 08, 2017

Revolution Analytics

Tutorial: Publish an R function as a SQL Server stored procedure with the sqlrutils package

In SQL Server 2016 and later, you can publish an R function to the database as a stored procedure. This makes it possible to run your R function on the SQL Server itself, which makes the power of...

Silicon Valley Data Science

The First 100 Days: FAQs

Editor’s Note: In a recent post, Sanjay Mathur and Scott Kurth wrote that CDOs have 100 days to get the ball rolling downhill on digital transformation. CDOs know they need to move beyond the traditional focus of data governance and work to create new data products and value for the business, but there are some very real challenges to doing this efficiently. The following are some of the frequently asked questions we’ve received on this topic.

1. What are some techniques the CDO can use for building strong partnerships and/or achieving consensus efficiently? Just getting all the stakeholders in a room and hashing out the different priorities can take weeks or even months, let alone getting everyone on the same page. And building relationships as a new executive often takes time, too.

There are two aspects to these relationships that are important: building trust with individuals and gathering consensus. As you’re building trust in these 1-to-1 relationships, the most important thing is getting meaningful conversations started. Make them meaningful by offering the other person something of value. Share your thoughts, early plans, and how you believe those plans will generate impact, but more than anything else—listen. Understand their metrics and where they are focusing their own initiatives. What matters to them?

Gaining consensus in organizations is usually harder. If you can, try to gather all your stakeholders in a room together. You’ll get a better response if you can articulate how critical this is to not just your own success, but the success of the company. Ask your CEO to make it a priority for all. It can be difficult to schedule, but there’s no substitute for those face-to-face conversations (or debates) to reach a shared understanding of what matters most. This is a technique that we at SVDS use in nearly every one of our data strategy engagements. Achieving shared understanding, and shared goals, is a great way to help you manage the inevitable trade-offs that will emerge later as you’re executing your strategy.

2. How do I build data literacy in my organization when every department uses a different set of tools (for example, one team prefers Python while another wants to look at data in Excel)? Do I have to centralize all teams on the same toolset?

A lot of CDOs that I talk to are grappling with this question. As you’re thinking about helping teams to work effectively, a single toolset shouldn’t necessarily be the goal, but smart rationalization and supporting those tools from a common platform should be. You’re always going to have users with different needs and different levels of sophistication. Forcing everyone onto a single tool risks alienating certain groups—or worse, sapping their productivity.

Smart rationalization is a worthwhile pursuit though. No one benefits from the wild west approach. Emphasize and communicate the value of sharing data (and analyses) among and across data scientists and teams. Reuse and repeatability benefit everyone, and help you build sticky communities that increase knowledge and capability, too. Done right, the teams will see the benefit in it.

However—data literacy should be about much more than getting all your data scientists to use Python (or any other tool). It’s about ensuring that business leaders understand the data available to them and the ways that you can help them make better decisions. It’s also about ensuring that they ask others for data to support new strategies, new directions, and new choices. And real data literacy means that they’ll want to use data to measure how those decisions pan out, too! Data literacy shouldn’t be limited to only your engineering or analytics teams. It is critical that the mindset span the organization.

3. How can I be an efficient change agent when my colleagues’ incentives are aligned differently? My CIO is measured on efficiency so they’re not interested in spinning up huge new data clusters, and department leaders aren’t currently incentivized to share data with each other.

First, talk to leadership. See whether you can get metrics better aligned. It’s fine to have metrics that are complementary or ones that—at first glance—appear to be tangential. Conflicting metrics, though, need to get resolved. If the CIO is still measured on outdated metrics that treat data as a cost center, you’ll have a tough time swimming upstream.

Second, show them how data impacts their metrics. CIO metrics focused on efficiency aren’t automatically bad; many CDOs have savings targets, too. You can work together to consolidate data platforms, find ways to migrate workloads to lower-cost infrastructure (e.g., data warehouse rightsizing or data lakes side-by-side with a distributed data platform), or sunset legacy systems.

Finally, business unit heads will always be focused on their revenue targets, as they should be. Seek out opportunities where data and analytics unlock new value for their organization. Build the value propositions and start experimenting. If you help them unlock a new $100M value proposition in their business, they’ll do more than just open up to sharing data—they’ll be your biggest advocate.

4. There’s a lot of conversation about not getting stuck playing defense, and going beyond data governance. But I work in a highly-regulated industry and we have real data governance imperatives and challenges. How can I still make progress on being “offensive” and looking for new value streams when so much of my focus has to be on compliance?

So first off, governance is a very real concern, especially in heavily-regulated industries like financial services or healthcare. I don’t want to make light of it. But the reality is that you’re going to have to do both. The CDOs I’ve spoken to have repeatedly told me that their partners in the business don’t see value in policies. “The business doesn’t care.” This is where most governance initiatives fail.

To succeed, you’re going to have to make the business stand up and take notice. So, how are you going to do both? Delegate. You need lieutenants who can take the burden off of you. It requires attention, certainly, but if you let it consume all of your time, you’ll never achieve your primary mission: generating value for the company. At the recent MIT CDOIQ Symposium, Venkat Varadachary, CDO of American Express, estimated that he only spends 5–10% of his time on defensive activities—but his business wishes it were zero. That’s only possible when you have delegates who can multiply your impact.

5. Often with data, there is a delay before you can reap what you have sown, so to speak. Experiments must be run and it can often take time to discover where the real value lies. How can I expedite this process to have impact more quickly?

Your company wasn’t standing still before you took on the CDO role. Your organization has experiments running already. Uncover them, embrace them, and support them—or redirect them to something more fruitful. And don’t be afraid to kill projects. Failed experiments are successful projects, too, but only if you fail fast and learn from the experiment.

Do that in parallel with developing your plan. Your first 100 days shouldn’t be about finding all the answers. Instead, it’s more about identifying the right questions. What are the experiments or proofs of concept (POCs) that you should be running? Where are the fruitful targets? This is what ensures that your first 100 days is the warm-up that lets you accelerate out of the gate.

You may also be interested in our upcoming series of webinars called Data Dialogues, which will focus both on Data Strategy and Data in Practice. Register now to reserve your seat and take part in the live Q&A at each session.

The post The First 100 Days: FAQs appeared first on Silicon Valley Data Science.


August 07, 2017

Revolution Analytics

How to make best use of the byte compiler in R

Tomas Kalibera, the newest member of the R Core Team, has been working for the last several years with fellow Core Team member Luke Tierney implementing R's byte-code compiler and interpreter....


Mario Meir-Huber

Почему пластиковые окна в Минске и Могилеве являются лучшим выбором?

Почему пластиковые окна в Минске и Могилеве являются лучшим выбором?

окна пвх в Минске

Для начала необходимо разобраться в томчто из себя представляют пластиковые окнаТакие окна иначе называют армированнымиАрмируют их для тогочтобы рама окна приобрела необходимую жесткостьПервым составляющим евроокна является термопластикЭтот материал добывается из природного сырьяВторым составляющим является полимерный материалОн синтетический и создается химическим способом.
В XXI веке существует огромное количество предметовсостоящих их полимерного материалаК примеруможно привести игрушки для детейпосудукомпьютерыхолодильники и многое другоеНесмотря на то что данный материал присутствуетпрактически вездеон не представляет особой угрозы.
Производителями практически всех пластиковых окон в стране являются российские компанииВсе из них обладают необходимыми сертификатами и соответствуют всем необходимым стандартам и нормам.

В чем же заключаются основные преимущества евроокон?
Первым значительным плюсом является долгий срок эксплуатации.
Если опираться на информацию из средств массовой информациито можно смело заявитьчто евроокна способны прослужить достаточно долгооколо пятидесяти летСтоит отметитьчто для такого длительного срока службы необходиморегулярно должным образом за ними ухаживать.

окна ПВХ в Могилеве

Вторым не менее значимым плюсом является стойкость к метеоусловиямДанные окна можно похвалить за это качествоведь любые погодные условия не способны испортить их внешний вид.
Третьим плюсом евроокон является их непроницаемостьтеплоизоляция и шумоизоляция.
Пластиковые окна отлично справляются с данными функциямиОни легко избавляют жителей от разных шумов на улицеа также отлично сохраняют теплоМожно смело утверждатьчто при закрытом окне никакая уличная грязь не проникнет в помещение.

Четвертым преимуществом является их экологическое производствоПри производстве еврооконэкологическая среда не загрязняетсятак как пластик легко и в полном объеме перерабатываетсяКак следствиеего производство является безопасным для окружающего мира и не приносит никакого вреда.
Данное обстоятельство служит явным преимуществом перед деревянными окнамиВедь для их изготовления используются огромные лесные запасы.

Пятым плюсом данных окон является их надежность и защита от огня.
При производстве окон добавляются такие веществакак антипиреныЭто позволяет избежать возгоранияМногим известночто если окна лопаются при возгорании в помещениипотоки воздуха с улицы раздувают огоньКонечноречь идет о более качественных стеклопакетахкоторые могут выдерживать очень высокие температуры.

И шестым плюсом является их стоимостьЕсли сравнить стоимость современного деревянного евроокна и стоимость самогодорогостоящего пластиковогостоимость первого значительно выше второгоЕсли сравнить стоимость пластикового и деревянного обыкновенного окнаих цена примерно одинаковаяНо деревянные значительно уступают преимуществам пластиковых оконописанных выше.

Таким образомвыбирая пластиковые окнавы сможете обеспечить максимальный комфорт и защиту от природных стихий.
Просмотрите варианты, чтобы выгодно приобрести окна пвх в Борисове!

The post Почему пластиковые окна в Минске и Могилеве являются лучшим выбором? appeared first on Techblogger.

Simplified Analytics

Do you want to hire a Data Scientist?

As mentioned by Tom Davenport few years back, Data Scientist is still a hottest job of century. Data scientists are those elite people who solve business problems by analyzing tons of data and...


August 04, 2017

Revolution Analytics

Because it's Friday: People remain awesome

The People are Awesome people are making the rounds again: the Best of 2017 so far video is popping up all over the place. But that made me realize I'd missed the 2016 video late last year, and if...


Revolution Analytics

Painting with Data

The accidental aRt tumblr (mentioned here a few years ago) continues to provide a steady stream of images that wouldn't look out of place in a modern art gallery, but which in fact are data...


The Apps on your Mobile that use Machine Learning

Seems like the term Machine Learning is popping up in mainstream media as the next big thing. The fact is, however, that Machine Learning went mainstream a long time ago. You don’t think so? Check your mobile phone. Chances are you’ve been using and benefiting from Machine Learning all this time without even knowing it. 

In this blog post, I go through some of the many apps on your mobile phone that use Machine Learning to make recommendations, get you to your destination quickly and safely, improve your photos, tell you what song you’re listening to and more. You’ll see, Machine Learning is not so far away. It’s already in the palm of your hand.


August 03, 2017

Revolution Analytics

Text categorization with deep learning, in R

Given a short review of a product, like "I couldn't put it down!", can you predict what the product is? In that case it's pretty easy — it's for a book — but this general problem of text...


August 02, 2017

Revolution Analytics

Applications in energy, retail and shipping

The Solutions section of the Cortana Intelligence Gallery provides more than two dozen working examples of applying machine learning, data science and artificial intelligence to real-world problems....



EVENT: An Engaging Opportunity to Learn About Deep Web Data Extraction

Content and data. What do these two words have in common? Maybe the more appropriate question to ask is, what don’t these two words have in common with each other? We know that there is plenty of content written on the web, including the Deep Web and Dark Web. We also know that where there […] The post EVENT: An Engaging Opportunity to Learn About Deep Web Data Extraction appeared first on BrightPlanet.

Read more »

Rob D Thomas

The Fortress Cloud

In 1066, William of Normandy assembled an army of over 7,000 men and a fleet of over 700 ships to defeat England's King Harold Godwinson and secure the English throne. King William, recognizing his...

Silicon Valley Data Science

From Defense to Offense: Shifting the CDO Mindset

A couple of weeks ago, I attended the MIT CDOIQ Symposium in Cambridge, MA. This annual gathering of Chief Data Officers (CDOs) and other data leaders from a variety of companies and industries is always one of the highlights of my conference calendar. It is a focused time of learning and sharing among a community that is pushing the envelope in consequential ways in the enterprise, the military, and higher education.

Throughout all the sessions this year, I noticed three key themes:

  • A shift from playing defense to playing offense
  • Data should appear on your company balance sheet
  • The metaphor of the car-racing pit crew

In this post, I’ll summarize the discussion around each of these themes and share some of the highlights that have stuck with me.

From Defense to Offense

In our report on the landscape of the CDO role, I described how the first CDOs were minted to oversee data governance and compliance with regulatory laws in industries such as finance and healthcare, where the abundance of personally identifiable information (PII) makes data particularly sensitive. I called this “the stick,” comparing it to “the carrot” of new revenue streams, efficiencies, and revenue opportunities becoming possible in the age of digital customer interactions.

This same idea of shifting perspective from stick to carrot is being expressed more and more widely now—and at the MIT CDOIQ Symposium, I heard it expressed repeatedly as a shift from playing defense to playing offense. Milind Kamkolkar, CDO of Sanofi, was one of the speakers to use this phrase. He sees his mission as empowering Sanofi to monetize data and insights at an industrial scale. Part of what that means is building the capability not only to investigate new opportunities, but also to move those investigation results from pilot projects into full-scale production. We have written about this before (borrowing language from Thomas C. Redman and Bill Sweeney) as moving from the Lab to the Factory.

In a panel session, however, Venkat Varadachary, CDO of American Express, pointed out that defense and offense don’t have to be entirely separate. “Regulators and internal auditors do ask good questions, and push our thinking in ways that transfer to our other efforts,” he said. Christina Clark, CDO of GE and another member of the panel session, agreed. “Fundamentals still matter,” she said, so defense and offense don’t have to exist in opposition to one another.

Still, the trend I saw over the course of the Symposium represented a growing awareness among CDOs and other data leaders that—as important as governance and compliance are—the potential of data to impact the bottom line of a business is huge, and significant part of their role needs to be focused on driving new value for their companies with data.

Data on the Balance Sheet

This brings me to the second big trend I saw: the need to represent data as a company asset on the balance sheet. If you’re going to use data to drive value for the business, then it needs to be accounted for. Of course, how to do this data valuation is still a very open question. But the need to do it is gaining traction with data leaders across multiple sectors.

In a panel on CDOs in the Boardroom, Joan Del Bianco, SVP in the Head U.S. Office of the Chief Data Officer at TD Bank, and Brandon Thomas, CDO of Zions Bankcorporation, discussed how they’re reporting to their respective stakeholders. Thomas in particular described how he had recently been given a brief timeslot to present at the regular board meeting on his activities for the quarter. In his allotted minutes, he focused on traditional metrics. “Is that your data strategy?” he was asked. Of course a few metrics didn’t begin to cover his actual strategy, but the fact that his board was aware and curious for details in an interesting signal.

Part of the need to represent data on the balance sheet is so companies can make informed decisions about how to invest in its cultivation and use. As several different attendees pointed out in the event’s roundtable discussions, there is often a delay between being data exploration and experimentation (the Lab), and shifting that into production mode (the Factory), so helping stakeholders understand longer-term ROI and why those investments are still worth making is easier when data can be counted as a durable asset.

Why You Need a Racing Pit Crew

Sports metaphors are all over the business world, but I’m used to encountering them in the form of “at-bats” or “slam dunks.” I was intrigued to notice this same metaphor of a pit crew from the slightly more obscure sport of car-racing used by at least two speakers. The metaphor points to the game-winning strategy of building a holistic and highly-coordinated team. For CDOs, this means that it’s not enough to hire a bunch of really smart data scientists—the best work happens when each team member has a specialized role and operates in a culture of cooperation.

Kamkolkar, CDO of Sanofi, spoke about how he has embraced the holistic team concept by hiring not only data scientists and engineers but also journalists, designers, and even an anthropologist. They contribute important skills and expertise to the team in order to help create “delightful information engagement experiences”—views into the data that help promote data literacy and contribute to the culture change necessary for any organization to become more data-driven. These team members also play a crucial role in ensuring that artificial intelligence and machine learning are used in ethical ways.

The racecar driver is critical, and often receives the spotlight, but the pit crew provides the support that makes it all possible. Hiring a team of individuals with specialized roles who can work together in an efficient, coordinated way toward a common goal is what ultimately produces a win. Said Kamkolkar, “Innovation shouldn’t live in one team or title, but should be built into everything.”


There are still lots of ways to be a CDO, and the position looks different from organization to organization. There is still no standard reporting structure, for instance. And not all CDOs even hold that particular title—there are many data leaders who perform CDO-like functions under a different name. However, the most interesting part of observing these three particular trends at this year’s MIT CDOIQ Symposium was the emergence of consensus they represent.

Regardless of industry, size, maturity, or a variety of other organizational factors, CDOs are converging around some core ideas of what their role is—and what that means for their mission. These are the importance of data for adding new business value, the importance of reporting that value accurately to their respective boards and other stakeholders, and the importance of forming diverse and holistic teams in order to maximize and execute on delivery of that value.

The post From Defense to Offense: Shifting the CDO Mindset appeared first on Silicon Valley Data Science.


August 01, 2017

Revolution Analytics

A modern database interface for R

At the useR! conference last month, Jim Hester gave a talk about two packages that provide a modern database interface for R. Those packages are the odbc package (developed by Jim and other members...


July 31, 2017

Revolution Analytics

How to use H2O with R on HDInsight is an open-source AI platform that provides a number of machine-learning algorithms that run on the Spark distributed computing framework. Azure HDInsight is Microsoft's fully-managed Apache...

InData Labs

Data Science experts on how business can benefit from big data

InData Labs data science experts Denis Pirshtuk and Denis Dus talked to Bel.Biz about how InData Labs solves business problems based on our own algorithms, using advanced technologies in the fields of big data and data science. Denis Pirshtuk and Denis Dus spoke about the prospects for the world of big data and shared their advice...

Запись Data Science experts on how business can benefit from big data впервые появилась InData Labs.


July 30, 2017

Simplified Analytics

How Customer Analytics has evolved...

Customer analytics has been one of hottest buzzwords for years. Few years back it was only marketingdepartment’s monopoly carried out with limited volumes of customer data, which was stored in...


July 28, 2017

Silicon Valley Data Science

Make the Most of Your Data

I recently read an article by Jeff Haden, A Brutal Truth About Success That Few People are Willing to Admit, in which he wrote: “Many people feel that buying something new—new technology, new devices, new systems, new ‘stuff’—is the key to success. They’re convinced that what they have is not enough. What holds them back are the things they don’t have. But, that is rarely true. Most people can’t work to the limits of what they already have, much less approach the limits of whatever is newer, or faster, or better.”

This sentiment runs parallel to our view at Silicon Valley Data Science as to how to get to outcomes quickly. We don’t peddle any given partner’s newest, shiniest tool. We don’t peddle a product or a one-size-fits-all solution. Rather, we work with each client to look at their business objectives first. Then we build out a sound data strategy, making the most of the technology investments the client has already made. Our agile process means that we deliver results back to the bottom line quickly and efficiently, then make incremental improvements toward long-term goals. We work collaboratively with clients, resulting in bilateral learning in every project.

Because I run marketing and sales for SVDS, I get to see this play out repeatedly across a variety of companies. And I see the kinds of questions organizations grapple with in their quests for success. Do you try to improve on the status quo through small steps? Do you throw out the old and buy a new solution? Can you “do data science” to give you actionable insights? What is the right avenue that will lead you to success?

In order to help address these questions—and to spread the benefit of our cooperative learning even farther—SVDS is proud to introduce a new webinar series, Data Dialogues. This series will help you address these questions in a digestible format, and provide an opportunity for live Q&A with our leading experts. It is dedicated to imparting knowledge, spurring more questions, and helping our clients on their journey to being data-driven organizations. It will include two distinct tracks.

The Data Strategy track focuses on creating and continuously updating your data strategy. The first four sessions in that track will be:

  • 1A&B: What it Means to be Data-Driven
  • 2A: Getting Real World Results with Data Science
  • 3A: Simplifying the Analytics Ecosystem
  • 4A: The One Key Skill of the CDO

The Data Practice track focuses on modern techniques for efficient execution of your data strategy. The first four sessions in that track will be:

  • 1A&B: What it Means to be Data-Driven
  • 2B: Connecting Data Science to Business
  • 3B: Building Data Products
  • 4B: Improving User Engagement and Retention

What you do with your data is more important than what you do to your data. Making fuller use of the data you already have often allows you to derive the insights that can fundamentally change your business. And it’s certainly better than chasing the latest technology fad.

We look forward to learning more about what questions your organization is grappling with as you work to make better use of your data, and to starting these conversations with you. If you’d like to talk with us directly at any point, please don’t hesitate to contact us—we’d be delighted to hear from you.

The post Make the Most of Your Data appeared first on Silicon Valley Data Science.


July 27, 2017

Revolution Analytics

Because it's Friday: Rolling shutters, explained

I've written about the shutter effect a couple of times here on the blog, but this video from SmarterEveryDay provides the best explanation I've seen yet of why digital cameras can distort...


Revolution Analytics

Learn parallel programming in R with these exercises for "foreach"

The foreach package provides a simple looping construct for R: the foreach function, which you may be familiar with from other languages like Javascript or C#. It's basically a function-based version...


Revolution Analytics

The R6 Class System

R is an object-oriented language with several object-orientation systems. There's the original (and still widely-used) S3 class system based on the "class" attribute. There's the somewhat stricter,...



VIDEO: How to Set Up a Windows 10 Virtual Machine on Your Desktop

Do you ever find yourself sifting through web data, but you’re worried about the possibilities of something going wrong within your desktop, such as a virus? If so, installing the Windows 10 burner box could offer you some much-needed reassurance while working with Deep Web technologies and open source intelligence tools. This virtual desktop runs […] The post VIDEO: How to Set Up a Windows 10 Virtual Machine on Your Desktop appeared first on BrightPlanet.

Read more »

What is Mathematical Optimisation and how does it benefit business?

The word optimisation is used quite loosely and can relate to many different areas.  For example, there is search engine optimisation (getting your website pages to the top of online search results), process optimisation (making existing processes more efficient), code optimisation (making your code run more efficiently) and then there is mathematical optimisation. 

In this blog post, we'll be focusing on mathematical optimisation: what it is, how it can be applied in making more optimal business decisions at a customer level, and specifically how it's applied in credit risk. And you can even try using optimisation yourself - using an optimisation tool we've shared in this post - to see the various scenarios resulting from your decisions.

Scroll to the bottom to try it for yourself!

Jean Francois Puget

What Is Artificial Intelligence?

Here is a question I was asked to discuss at a conference last month: what is Artifical Intelligence (AI)?  Instead of trying to answer it, which could take days, I decided to focus on how AI has been defined over the years.  Nowadays, most people probably equate AI with deep learning.  This has not always been the case as we shall see.

Most people say that AI was first defined as a research field in a 1956 workshop at Dartmouth College.  Reality is that is has been defined 6 years earlier by Alan Turing in 1950.  Let me cite Wikipedia here:

The Turing test, developed by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine's ability to render words as speech.[2] If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test does not check the ability to give correct answers to questions, only how closely answers resemble those a human would give.

The test was introduced by Turing in his paper, "Computing Machinery and Intelligence", while working at the University of Manchester (Turing, 1950; p. 460).[3] It opens with the words: "I propose to consider the question, 'Can machines think?'" Because "thinking" is difficult to define, Turing chooses to "replace the question by another, which is closely related to it and is expressed in relatively unambiguous words."[4] Turing's new question is: "Are there imaginable digital computers which would do well in the imitation game?"[5] This question, Turing believed, is one that can actually be answered. In the remainder of the paper, he argued against all the major objections to the proposition that "machines can think".[6]



So, the first definition of AI was about thinking machines.  Turing decided to test thinking via a chat. 

The definition of AI rapidly evolved to include the ability to perform complex reasoning and planing tasks.  Early success in the 50s led prominent researchers to make imprudent predictions about how AI would become a reality in the 60s.  The lack of realization of these predictions led to funding cut known as the AI winter in the 70s. 

In the early 80s, building on some success for medical diagnosis, AI came back with expert systems.  These systems were trying to capture the expertise of humans in various domains, and were implemented as rule based systems.  This was the days were AI was focusing on the ability to perform tasks at best human expertise level.  Success like IBM Deep Blue beating the chess world champion, Gary Kasparov, in  1997 was the acme of this line of AI research.

Let's contrast this with today's AI.  The focus is on perception: can we have systems that recognize what is in a picture, what is in a video, what is said in a sound track?  Rapid progress is underway for these tasks thanks to the use of deep learning.  Is it AI still?  Are we automating human thinking?  Reality is we are working on automating tasks that most humans can do without any thinking effort. Yet we see lots of bragging about AI being a reality when all we have is some ability to mimic human perception.  I really find it ironic that our definition of intelligence is that of mere perception  rather than thinking.


Granted, not all AI work today is about perception.  Work on natural language processing (e.g. translation) is a bit closer to reasoning than mere perception tasks described above.  Success like IBM Watson at Jeopardy, or Google AlphaGO at Go are two examples of the traditional AI aiming at replicate tasks performed by human experts.    The good news (to me at least) is that the progress is so rapid on perception that it will move from a research field to an engineering field in the coming years.  We will then see a re-positioning of researchers on other AI related topics such as reasoning and planning.  We'll be closer to Turing's initial view of AI.


July 26, 2017

Revolution Analytics

Introducing Joyplots

This is a joyplot: a series of histograms, density plots or time series for a number of data segments, all aligned to the same horizontal scale and presented with a slight overlap. Peak time for...

Silicon Valley Data Science

You Have 100 Days to Lead a Data Revolution

Most data resides unused on corporate servers—information that, if unlocked, could create sustainable competitive advantage. That’s why smart companies are looking for digital change leaders to guide their organizations through the transformation around using data rather than just collecting it.

We’ll call these change agents Chief Data Officers, or CDOs, although they aren’t always called by that title—they may be the CIO, Chief Digital Officer, or lead a line of business. It might be you. Whoever it is, the organization is relying on this person to lead change that probably includes parts of all of these:

  • Extracting more value from existing data
  • Monetizing data as a new revenue source
  • Leading a cultural shift around data-driven decision making
  • Tearing down data silos
  • Ensuring data privacy and security
  • Taking on dry but crucial issues such as data governance

This all means big issues, lots of moving parts, and a high-risk challenge for your career. In all likelihood, the board is watching intently.

If you’ve been asked to take on this challenge, where do you start? We believe CDOs have 100 days to get this digital transformation rolling downhill and towards a successful conclusion. If the basic building blocks aren’t in place and moving towards real progress by then, there is trouble ahead.

But before we start on how to make this crucial transition happen, we need a brief look at why it is occurring now.

Technology Is Changing the DNA of Business

Chances are the company you work for has evolved dramatically over the last decade, spurred by transformational shifts in technology. These shifts started with open source software, allowing enterprises to access innovation at lower cost and without having to invest in long-term platform decisions. Then, the development of public and hybrid clouds enabled companies large and small to compete at scale and reduce time-to-market for technology solutions. And now, the primary change-driver has shifted once again: to the availability, analysis, and use of data.

Data is everywhere—it flows in from products reporting on their own performance, from digitally-linked business partnerships, from video of shoppers’ traffic patterns around a store, from IoT sensors around the world, and from field reps using mobile apps to retrieve up-to-the-minute customer information before making a call.

Companies are rushing to digitize big swaths of the business that haven’t been previously, instrumenting more and more business processes to capture the data they need to fuel improvements and make better decisions. Some have realized they know more about how a user moves through their website than they know about how a new product makes it to market.

Because of these technological earthquakes, the expectations and skills of today’s knowledge workers around data have evolved rapidly. Individuals who are data-conversant and understand how to connect data, systems, and decisions together are in high demand—just ask GE, which last year moved its headquarters from Connecticut to Boston primarily to be closer to technical talent.

The First 100 Days

A CDO’s first 100 days should lay a foundation for iterative—but fast—action. The change agent must educate leadership about what needs to be done, and drive home the idea that nothing less than a new way of thinking about achieving business goals is being introduced.

The CDO needs to be an evangelist for the art of the possible. What business actions and decisions can be improved by better data? What unused data can be put to work? By evangelizing what can be done and the ensuing benefits, the CDO helps people overcome their resistance to change.

The value of experimentation is another critical message of this evangelism. Enterprises that can formulate a new hypothesis, construct capabilities to test the hypothesis, gather data about outcomes, and then iterate and try again, have a strategic advantage over their competitors.

One hundred days is not a long time. Let’s get down to specifics. Starting on the road to implementation requires understanding of three things:

Understand the Business Goals

First stop on this journey is understanding how data advances business goals. In short, what does the business need to accomplish? All explorations of data and capabilities should be grounded in what the business needs to ensure you aren’t building capability for capability’s sake. How will you demonstrate to the business the impact you’ve made without an eye on what matters to them?

Once you know the business goals, you’ll also have a good idea of important stakeholders to invite along. Establish strong partnerships with them. These will likely include internal business owners, IT experts, and outside business partners.

Understand Your Data

Get started conducting data audits and catalogs—but delegate this and don’t let it distract from your true purpose. (In our experience, some people think an inventory of assets is the entire job. Believe me, that is just the start.)

More than just finding data, you need to comprehend what it can do. Does it have the predictive power needed to meet the business goals ascertained above? You’ve also got to turn that question on its head. It’s tempting to start from the data, but the big question is: Does the enterprise have the data needed to accomplish its business objectives? If the business wants to improve the customer experience, are you collecting the data that tells you where customers are frustrated today?

Knowing where data gaps exist, by business objective, is a great starting point for your action plan. There’s a good chance your board wants to know, too.

Understand Your Organizational Capabilities

Even if you already know what data is available, is your organization capable of using it effectively, given your current level of data maturity? This is more than grading technology—it’s understanding your complete capabilities: people, process, and systems.

One of your tasks is to evaluate how data, information, and insight are put to use (or not). This is more difficult than it sounds, because the opinions you hear will be influenced by their holder’s place and experiences in the organization. We’ve all heard the story of the CEO who, when asked to describe how a process works on the manufacturing floor, gives an answer that rolls the eyes of the shop foreman. It’s vital that both voices are heard.

It is also vital to identify where talent is misdeployed. You might have the best and brightest PhDs on your machine learning team, but if they aren’t working on problems that impact the bottom line, then why are they there?

The capability to turn data into decisions is not something accomplished by a single department. It’s not the business side of the house that does it, and it’s not IT that does it. It’s both, along with a cast of supporting players that can include engineering, your data science team, business analysts, floor managers, and top executives. Having the right data available at the right time empowers potentially everyone in the company to make decisions quickly and confidently.

Lay the Groundwork for Investment

The final step in the first 100 days is commonly called putting your money where your mouth is. Saying, “We are becoming a data-driven company” but not asking for data and analysis to support key decisions is pointless and also demoralizing to those who have signed on to follow your lead. The CDO has to evangelize for a new culture that values actionable data as much as previous business models prioritized cash flow.

Assessing the Assessment

As you come to the end of the initial assessment phase, it’s time to apply what you’ve learned.

Start by translating your technical and business assessment into a formal action agenda. Without a battle plan, you’re wasting effort. Use business results as a way to focus on what’s important. Finally, think of your action plan as more of a roadmap, where occasional course corrections are inevitable to stay up to date on changes in the competitive environment, internal changes in strategy or tactics, and financial realities.

Here is what you should be able to show stakeholders at the end of the period:

  1. You have demonstrated how integral data is to achieving their business goals, and specifically where the opportunities lie.
  2. Your organization now has a survey of its critical information, where it resides, how it can be accessed, and by whom.
  3. You have delivered a clear illustration of how data moves through the organization, and have identified problems such as dead-ends, misdirections, and black holes.

The Next 100 Days

With specificity developed around these three areas and links drawn between them, you should have a compelling story for stakeholders and the executive suite to win support for the next steps, whatever they may be.

As your initial three-month run comes to an end, set goals for yourself and continue generating momentum. When you hit 100 days, pause, take a deep breath, reflect on your accomplishments, and visualize where the road leads next.

You may also be interested in our upcoming series of webinars called Data Dialogues, which will focus both on Data Strategy and Data in Practice. Register now to reserve your seat and take part in the live Q&A at each session.

The post You Have 100 Days to Lead a Data Revolution appeared first on Silicon Valley Data Science.


July 25, 2017

Revolution Analytics

SQL Server 2017 release candidate now available

SQL Server 2017, the next major release of the SQL Server database, has been available as a community preview for around 8 months, but now the first full-featured release candidate is available for...


July 24, 2017

Revolution Analytics

Analyzing Github pull requests with Neural Embeddings, in R

At the useR!2017 conference earlier this month, my colleague Ali Zaidi gave a presentation on using Neural Embeddings to analyze GitHub pull request comments (processed using the tidy text...


Simplified Analytics

Go Digital or Die - What will you chose?

Just before 2007, we didn't have access to smartphones like iPhone or social media apps like Instagram, Whatsapp and even email was much more limited only to desktops.  Zoom in to Today -...


July 21, 2017

Revolution Analytics

Because it's Friday: How Bitcoin works

Cryptocurrencies have been in the news quite a bit lately. Bitcoin prices have been soaring recently after the community narrowly avoided the need for a fork, while $32M in rival currency Etherium...


Revolution Analytics

IEEE Spectrum 2017 Top Programming Languages

IEEE Spectrum has published its fourth annual ranking of of top programming languages, and the R language is again featured in the Top 10. This year R ranks at #6, down a spot from its 2016 ranking...


Forrester Blogs

Calling All Endpoint Detection & Response Vendors

On July 25, we’re going to start sending out detailed questionnaires to vendors who qualify for our upcoming report entitled, Vendor Landscape: Endpoint Detection & Response, 2017. This report...


Forrester Blogs

Does Your Channel Run In A Silo?

I have spent my entire career in the channel. With 75% of world trade flowing through indirect channels according to the World Trade Organization, I’m always interested in seeing how businesses...


Forrester Blogs

We assess market for Hosted Private Cloud in Europe with new Forrester Wave

(Photo Credit: justusbluemer Flickr via Compfight cc) Working closely with colleagues in North America and Asia, we just published the third of three related Forrester Waves on Hosted Private Cloud....