decor
 

Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.

 

June 27, 2017

Big Data University

Cognitive Class Uses Machine Learning to Help SETI Find Little Green Men

This month the team at CognitiveClass.ai was in Galvanize San Francisco with Adam Cox and Patrick Titzler, running a Code challenge that will help SETI (Search for extraterrestrial intelligence) look for Aliens.

The goal of the event was to help the SETI Institute develop new signal classification algorithms and models that can aid in the radio signal detection efforts at the SETI Institute’s Allen Telescope Array (ATA) in Northern California.

Our Chief data scientist Saeed Aghabozorgi developed several Jupyter notebooks including one to transform the signals into spectrograms using a Spark cluster. In addition, Saeed provided several Tensorflow notebooks, one of which used a Convolutional Neural Network [1] to classify the Spectrogram. Check out the Github page and see all the scripts from Saeed Aghabozorgi , Adam Cox and Patrick Titzler.

Our developer Daniel Rudnitski developed a scoreboard that evaluates everyone’s algorithms. The scoreboard works by comparing the predicted results and the true labels in a holdout set for which the participants did not know the labels (shown in Figure 1). I gave a tutorial on Neural Networks and Tensorflow, helped the participants debug their code, and enjoyed the free food. 😄😁😀😁

Figure 1: Cognitive Class’ leaderboard used to assess results of Hackathons

 

SETI searches for E.T. by scanning star systems with known exoplanets. The idea is that nature does not produce sine waves, therefore the system looks for narrow-band carrier waves like sign waves. The detection system sometimes triggers on signals that are not narrow-band signals. The goal of the event was to classify these signals accurately in real-time, allowing the signal detection system to make better informed observational decisions. [2]

We transformed the observed time-series radio signals into a spectrogram. A spectrogram is a 2-dimensional chart that represents how the power of the signal is distributed over time and frequency [3]. An example is shown in Figure 2. The top chart is a spectrogram in which the bright green represents higher intensity values, and the blue represents low intensity values. The bottom chart contains two amplitude modulated signals labeled A and B. The two brightly colored patches in the spectrogram directly above the signal represent the distribution of the signal energy in time and frequency. The horizontal axis represents time, while the vertical axis represents frequency. If we examine signal A we see that it oscillates at a much lower rate than signal B, meaning that it has a much lower frequency. This is exhibited by a much lower location of the energy on the vertical axis of the Spectrogram.

 

Fig 2: Spectrogram (top) of two amplitude modulated Gaussian signals (bottom)

 

The 2D representation provided by the spectrogram allows us to change the problem into a visual recognition problem. Allowing us to apply methods such as convolutional neural networks. Individuals without expertise in design and implementing Deep Neural Networks could focus on the signal processing problem and let IBM Watson Visual Recognition tool handle the complex problem of image classification. The process is demonstrated in figure 3 with a Chirp signal (a signal in which the frequency increases or decreases over time). After the spectrogram, several convolutional layers are applied to extract features from the image, then the output is flattened and placed as inputs into a fully connected neural network. To learn more about deep learning check out our Deep Learning 101 and Deep Learning with TensorFlow courses.

Figure 3: Example architecture used in the event. (Source: Wikipedia)

 

To speed up the process of developing and testing these neural network, participants were given access to GPUs on IBM PowerAI Deep Learning.   Participants used libraries such as Caffe, Theano, Torch, and Tensorflow. In addition, given the vast amounts of data for signal processing, participants were also given access to an IBM Apache Spark Enterprise cluster. For example, the spectrograms where calculated on several nodes as shown in figure 4.

Figure 4: Example architecture used in the event.

 

The top team was Magic AI. This team used a wide neural net, a network that has less layers than a deep network, but more neurons per layer. According to Jerry Zhang, a Graduate Researcher at UC Berkeley Radio Astronomy Lab, the spectrogram exhibited less complex shapes then a standard image like those in Modified National Institute of Standards and Technology database (MIST), as a result less convolutional layers where required to encode features like edges. We see this by examining figure 5, the left image shows 5 spectrograms and the right image shows 5 images from MIST. The Spectrogram is colored using the standard gray scale where white represents the largest values and black represents the lowest values. We see the edges of the spectrogram are predominantly vertical and straight while the numbers exhibit horizontal lines, parallel lines, arches and circles.

 

Figure 5: Spectrograms and the right image shows 5 images from MIST

 

The Best Signal Processing Solution was by the Benders. They applied a method for detecting earthquakes to improve signal processing. Arun Kumar Ramamoorthy, one of the members, also made an interesting discovery while plotting out some of the data points. Check out their blog post here.

The prize for best Non Neural Network/Watson: went to team Explorers and most Interesting went to team Signy McSigFace. The trophies are shown in Figure 6.

Figure 6: Custom trophies designed for winners of this hackathon.

 

The weekend was quite interesting with talks from , Dr. Jill Tarter, Dr. Gerry Harp, and Jon Richards who gave talks about SETI, the radio data processing and operations. They were also available to answer questions from participants. Kyle Buckingham gave a talk about the radio telescope he built in his backyard! Everyone who participated is shown in the image below.

Figure 7: SETI Hackathon participants

 

Check out the event GitHub page: https://github.com/setiQuest/ML4SETI/

For more information on SETI, please check out: https://www.seti.org/

To donate to SETI: https://www.seti.org/donate/astrobiology-sb

Would you like to make your own predictions? Learn about Deep Learning with our Deep Learning 101 and Deep Learning with TensorFlow courses.

References

[1] Krizhevsky, Alex, Ilya Sutskever, and. Hinton Geoffrey E ,. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

[2]  Aghabozorgi, Saeed   with  Cox, Adam  and  Titzler, Patrick,. ML4SETI   https://github.com/setiQuest/ML4SETI

[3] Cohen, Leon. “Time-frequency distributions-a review.” Proceedings of the IEEE 77.7 (1989): 941-981.

The post Cognitive Class Uses Machine Learning to Help SETI Find Little Green Men appeared first on Cognitive Class.

Silicon Valley Data Science

How to Grow Your Data Capital

The twin tides of digitization and artificial intelligence have risen so high that no business can ignore them. Information technology has come out from behind the curtain as the driver of value and innovation. Little wonder, then, that a recent issue of The Economist proclaimed that “the world’s most valuable resource is no longer oil, but data”.

As practically every human activity now generates a digital trace, the advantage lies with those who can exploit the insights that lie in the data. When will a customer buy? What’s the best product mix? When should we do maintenance? What makes my workforce the most effective?

Surprisingly, the algorithms that drive these insights aren’t the secret sauce. And although priced at a premium, neither are the data scientists who wield them.

The value lies in the data itself, in its potential when applied to your business.

Though there is much to be gained from using data to reduce costs, or optimize processes, that’s not the full story either. A well of data creates future possibilities and competitive advantage. For example, by delivering a connected car, Tesla gains unique and massive insight into driving patterns: a bulkhead against competitors, and a rich resource to leverage in future innovation.

The ability to generate this future potential through operating your current business is the ultimate definition of what it means to be data-driven: when value, and not solely decision-making, is being driven by data.

Growing data is a pressing business priority

The true value of data places it as a burning issue for every business. A plan to grow data reserves merits a prime place in every corporate strategy. But how exactly can you make that a reality?

There are four levels of growing data investment: buy, integrate, instrument, and design. The truly data-driven engage across all of these.

Grow Your Data

Buy data

One of the quickest ways to gain insight is to buy data, and it’s nothing new to most businesses. Buying data has been industry practice for decades in the arena of consumer and competitive intelligence, thanks to information companies such as Acxiom, LexisNexis, or Hoovers. Financial data, weather and geographical data are all staple requirements for many businesses.

Through acquiring third party data, you can gain more insight into existing operations and improve decision support. However, available data for sale tends to be limited to large scale data sets, predominantly environmental in nature. Buying data enables the business, but you can’t buy your way to competitive differentiation.

Integrate your data

Even before the era of big data, companies generated a wealth of information that could be leveraged to fuel growth and create more value. The potential of integrating this information is high. Combining data sets tends to have a multiplicative effect on the data’s power, rather than merely additive.

Every system in a company is constantly creating data. However, data silos are a systemic problem that heavily constrain how useful this data is. Silos come from software applications and processes built in an era before the value of data was truly recognized, and one where resource constraints were more restrictive. The design of these systems is centered around the process they perform, in order to be most efficient. A natural side-effect of being process-oriented leads to semantic differences between business functions, and adds further complication: it is not unusual to encounter multiple different definitions of concepts such as “customer”.

Instrument for data

When you are really focused on using data to drive decisions and value, instrumentation is the next step to take. Instrumentation is the act of embedding data generation into processes and systems, reporting on their status over time. Thanks to the field of operations research, this concept is nothing new in business, but with increasing digitization, the scope for instrumentation is vast.

To give an example from software: the extent of software instrumentation used to be logging errors, in order to give support technicians hope of diagnosis. Now, it’s feasible to record every user interaction, every move of the mouse, and the questions we can ask extend further: not only “why did something go wrong”, but also “how easy is this product to use?” Or another example from retail, where radio beacons now allow the monitoring of foot-traffic through a store, enabling assessment of the effectiveness of a physical layout.

The combination of instrumentation and data science allows us to gather detail and solve problems which would previously require prohibitively expensive human intervention.

Design for data

Any organization deeply invested in the above three levels is already in a strong leading position, but there is one step further in realizing the potential of data. If you truly comprehend your data’s power, then design activities around its accumulation and exploitation, building data as a strategic asset.

To reach this point is to begin to have an answer to the tech giants of Google, Amazon and Facebook. As data-native companies, their agility, accumulation of data, and financial power allows them to challenge convincingly in new markets.

The essence of a business designed for data is that the delivery of the product or service generates data that further enhances the service, and creates a platform for the next steps of innovation and expansion. Customers aren’t viewed as isolated recipients, but as part of an ecosystem in which their participation makes the product better for everyone. Feeding off this customer-generated data, analytics and artificial intelligence are used as part of the product, in comparison to the out-of-band applications of the previous three stages.

Data is a defensible advantage

The effects of building a company with data can be remarkable. Today’s advances in artificial intelligence bring major efficiencies and new capabilities, but have one driving trait—an insatiable thirst for data. Data is not a fungible commodity, and so its accumulation provides a daunting hurdle for competitors. This is such a compelling advantage that perhaps, as the Economist observes, we may even need a new kind of anti-trust approach in the future.

The post How to Grow Your Data Capital appeared first on Silicon Valley Data Science.

Knoyd Blog

Traveling The World In D3 - Part 1: Making A Map

The Knoyd team is currently spread out all over the world (Chile, Peru and Austria) and I am living on the road for quite a while now. I had the idea to code up an interactive map for my travel blog and thought it would be nice to share with you how to do it yourself.

This is a multipart tutorial. To get started, you can get all the code from GitHub and also see the final product here
 

What is D3?

D3.js (or just D3 for Data-Driven Documents) is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It uses the widely implemented SVG, HTML5, and CSS standards. In contrast to many other libraries, D3.js allows great control over the final visual result.

Overall the library is very low level and it takes a while to get used to, so I wouldn't blame anyone for using some of the libraries built on top of D3 (such as Plotly or Bokeh in Python). But if you want to make something truly custom, you will end up learning D3 anyways.
 

 

Part 1: Making a map

1.1 Basic world projection

We start by plotting the map of the world in the Mercator projection:

<!DOCTYPE html>
<meta charset="utf-8">

<body>
<script type="text/javascript" src="https://d3js.org/d3.v4.min.js"></script>
<script src="trip_data.js"></script>
<script src="https://unpkg.com/topojson-client@3"></script>
<script>
var width = window.innerWidth,
    height = window.innerHeight,
    centered,
    clicked_point;

var projection = d3.geoMercator()
    .translate([width / 2.2, height / 1.5]);
    
var plane_path = d3.geoPath()
        .projection(projection);

var svg = d3.select("body").append("svg")
    .attr("width", width)
    .attr("height", height)
    .attr("class", "map");
    
var g = svg.append("g");
var path = d3.geoPath()
    .projection(projection);
    
// load and display the World
d3.json("https://unpkg.com/world-atlas@1/world/110m.json", function(error, topology) {
    g.selectAll("path")
      .data(topojson.feature(topology, topology.objects.countries)
          .features)
      .enter()
      .append("path")
      .attr("d", path)
      ;
 });
 
</script>
</body>
</html>

 

Let's add some color. We have to already have the trip_data loaded in the code above. We want to color the countries we have visited in a different color. This is a helper function that will do that:

// color country
function colorCountry(country) {
    if (visited_countries.includes(country.id)) {
        // hack to discolor ehtiopia
        if (country.id == '-99' & country.geometry.coordinates[0][0][0] != 20.590405904059054){
            return '#e7d8ad'    
        } else {
            return '#c8b98d';
        };
    } else {
        return '#e7d8ad';
    }
};

 

Now we can select all the path elements (which are the countries polygons) and color them. Put this inside of the d3.json() call above.

g.selectAll('path')
    .attr('fill', colorCountry);

 

Voila! The results should look something like this:

The World map after the first step.

The World map after the first step.

 

1.2 Zooming into countries:

The next thing we would like is to be able to click and zoom into particular countries. We also want the zoomed in country to have a different color. For this we need to run a transformation on the map projection. Firstly add the click action to the generated path (country) objects.

g.selectAll("path")
      .data(topojson.feature(topology, topology.objects.countries)
          .features)
      .enter()
      .append("path")
      .attr("d", path)
      .on("click", clicked) //adding the click action

 

We introduced a new function called clickedso let's take a look at what it looks like.

//clicked
function clicked(d) {
      var x, y, k;
      //if not centered into that country and clicked country in visited countries
      if ((d && centered !== d) & (visited_countries.includes(d.id))) {
        var centroid = path.centroid(d); //get center of country
        var bounds = path.bounds(d); //get bounds of country
        var dx = bounds[1][0] - bounds[0][0], //get bounding box
            dy = bounds[1][1] - bounds[0][1];
        //get transformation values
        x = (bounds[0][0] + bounds[1][0]) / 2;
        y = (bounds[0][1] + bounds[1][1]) / 2;
        k = Math.min(width / dx, height / dy);
        centered = d;
      } else {
        //else reset to world view
        x = width / 2;
        y = height / 2;
        k = 1;
        centered = null;
      }
      //set class of country to .active
      g.selectAll("path")
       .classed("active", centered && function(d) { return d === centered; })
   
   
      // make contours thinner before zoom for smoothness
      if (centered !== null){
        g.selectAll("path")
         .style("stroke-width", (0.75 / k) + "px");
      }
  
      // map transition
      g.transition()
        //.style("stroke-width", (0.75 / k) + "px")
        .duration(750)
        .attr("transform", "translate(" + width / 2 + "," + height / 2 + ")scale(" + k + ")translate(" + -x + "," + -y + ")")
        .on('end', function() {
            if (centered === null){
              g.selectAll("path")
               .style("stroke-width", (0.75 / k) + "px");
      }
        });
}

 

You can see that we have introduced a active CSS class, which is assigned to the zoom in country, and will help us change the color of the country.

.active {
    fill: #98f5ff;
}

We now have working zooming and the countries are changing colors on zoom too ;-)

Next time:

In the next part of this tutorial, we will draw points where we have been to our zoomed in country, add a tooltip with description and icons with links to the actual blog posts about each place.

If you don't wanna miss out, sign up for our newsletter in the bottom right corner.

>>>

Ronald van Loon

AI – The Present in the Making

I attended the Huawei European Innovation Day recently, and was enthralled by how the new technology is giving rise to industrial revolutions. These revolutions are what will eventually unlock the development potential around the world. It is important to leverage the emerging technologies, since they are the resources which will lead us to innovation and progress. Huawei is innovative in its partnerships and collaboration to define the future, and the event was a huge success.

For many people, the concept of Artificial Intelligence (AI) is a thing of the future. It is the technology that has yet to be introduced. But Professor Jon Oberlander disagrees. He was quick to point out that AI is not in the future, it is now in the making. He began by mentioning Alexa, Amazon’s star product. It’s an artificial intelligent personal assistant, which was made popular by Amazon Echo devices. With a plethora of functions, Alexa quickly gained much popularity and fame. It is used for home automation, music streaming, sports updates, messaging and email, and even to order food.

With all these skills, Alexa is still in the stages of being updated as more features and functions are added to the already long list. This innovation has certainly changed the perspective of AI being a technology of the future. Al is the past, the present, and the future.

Valkyrie is another example of how AI exists in the present. There are only a handful of these in the world, and one of them is owned by NASA. They are a platform for establishing human-robot interaction, and were built in 2013 by a Johnson Space Center (JSC) Engineering directorate. This humanoid robot is designed to be able to work in damaged and degraded environments.

The previous two were a bit too obvious. Let’s take it a notch higher.

The next thing on Professor Jon Oberlander’s list was labeling images on search engines. For example, if we searched for an image of a dog, the search engine is going to show all the images that contain a dog, even if it’s not a focal point. The connected component labeling is used in computer vision, and is another great example of how AI is developing in present times.

Over the years, machine translation has also gained popularity as numerous people around the world rely on these translators. Over the past year, there has been a massive leap forward in the quality of machine translations. There has definitely been a dramatic increase in the quality as algorithms are revised and new technology is incorporated to enhance the service.

To start with a guess, and end up close to the truth. That’s the basic ideology behind Bayes Rule, a law of conditional probability.

But how did we get here? All these great inventions and innovations have played a major role in making AI a possibility in the present. And these four steps led us to this technological triumph;

  • Starting
  • Coding
  • Learning
  • Networking

Now that we are here, where would this path take us? It has been a great journey so far, and it’s bound to get more exciting in the future. The only way we can eventually end up fulfilling our goals is through;

  • Application
  • Specialization
  • Hybridization
  • Explanation

With extensive learning systems, it has become imperative to devise fast changing technologies, which will in turn facilitate the spread of AI across the world. With technologies such as deep fine-grained classifier and the Internet of Things, AI is readily gaining coverage. And this is all due to Thomas Bayes, who laid the foundations of intellectual technology.

If you would like to read more from Ronald van Loon on the possibilities of AI, please click “Follow” and connect with him on LinkedIn and Twitter.

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post AI – The Present in the Making appeared first on Ronald van Loons.

Cloud Avenue Hadoop Tips

Processing the Airline dataset with AWS Athena

AWS Athena is a interactive query engine to process the data in S3. Athena is based on Presto which was developed by Facebook and then open sourced. With Athena there is no need to start a cluster, spawn EC2 instances. Simply create a table, point it to the data in S3 and run the queries.

In the previous blog, we looked at converting the Airline dataset from the original csv format to the columnar format and then run SQL queries on the two data sets using Hive/EMR combination. In this blog we will process the same data sets using Athena. So, here are the steps.

Step 1 : Go to the Athena Query Editor and create the ontime and the ontime_parquet_snappy table as shown below. The DDL queries for creating these two tables can be got from this blog.



Step 2 : Run the query on the ontime and the ontime_parquet_snappy table as shown below. Again the queries can be got from the blog mentioned in Step 1.



Note that, for processing the csv data it took 3.56 seconds and 2.14 GB of S3 data was scanned. For processing the Parquet Snappy data it took 3.07 seconds and 46.21 MB of S3 data was scanned.

There is not a significant time difference running the queries on the two datasets. But, Athena pricing is based on the amount of data scanned in the S3. So, the cost is significantly less to process the Parquet Snappy data than the csv data.

Step 3 : Go to the Catalog Manager and drop the tables. Dropping them will simply delete the table definition, but not associated data in S3.


Just out of curiosity I created the two tables again and run a different query this time. Below are the queries with the metrics.
select distinct(origin) from ontime_parquet_snappy;
Run time: 2.33 seconds, Data scanned: 4.76MB

select distinct(origin) from ontime;
Run time: 1.93 seconds, Data scanned: 2.14GB

As usual the there is not much difference in the time taken for the query execution, but the amount of data scanned in S3 for the Parquet Snappy data is significantly lower. So, the cost to run the query on the Parquet Snappy format data is significantly less.
 

June 26, 2017


Revolution Analytics

Useful tricks when including images in Rmarkdown documents

Rmarkdown is an enormously useful system for combining text, output and graphics generated by R into a single document. Images, in particular, are a powerful means of communication in a report,...

...

Rob D Thomas

The Data Science Renaissance

“If people knew how hard I worked to get my mastery, it wouldn’t seem so wonderful at all.” -Michaelangelo Renaissance means rebirth. A variety of factors, coming together at the same time, can...

...
Cloud Avenue Hadoop Tips

Converting Airline dataset from the row format to columnar format using AWS EMR

To process Big Data huge number of machines are required. Instead of buying them, it's better to process the data in the Cloud as it provides lower CAPEX and OPEX costs. In this blog we will at processing the airline data set in the AWS EMR (Elastic MapReduce). EMR provides Big Data as a service. We don't need to worry about installing, configuring, patching, security aspects of the Big Data software. EMR takes care of them, just we need specify the size and the number of the machines in the cluster, the location of the input/output data and finally the program to run. It's as easy as this.

The Airline dataset is in a csv format which is efficient for fetching the data in a row wise format based on some condition. But, not really efficient when we want to do some aggregations. So, we would be converting the CSV data into Parquet format and then run the same queries on the csv and Parquet format to observe the performance improvements.

Note that using the AWS EMR will incur cost and doesn't fall under the AWS free tier as we would be launching not the t2.micro EC2 instances, but a bit bigger EC2 instances. I will try to keep the cost to the minimum as this is a demo. Also, I prepared the required scripts ahead and tested them in the local machine on small data sets instead of the AWS EMR. This will save the AWS expenses.

So, here are the steps

Step 1 : Download the Airline data set from here and uncompress the same. All the data sets can be downloaded and uncompressed. But, to keep the cost to the minimum I downloaded the 1987, 1989, 1991, 1993 and 2007 related data and uploaded to S3 as shown below.



Step 2 : Create a folder called scripts and upload them to S3.


The '1-create-tables-move-data.sql' script will create the ontime and the ontime_parquet_snappy table, map the data to the table and finally move the data from the ontime table to the ontime_parquet_snappy table after transforming the data from the csv to the Parquet format. Below is the SQL for the same.
create external table ontime (
Year INT,
Month INT,
DayofMonth INT,
DayOfWeek INT,
DepTime INT,
CRSDepTime INT,
ArrTime INT,
CRSArrTime INT,
UniqueCarrier STRING,
FlightNum INT,
TailNum STRING,
ActualElapsedTime INT,
CRSElapsedTime INT,
AirTime INT,
ArrDelay INT,
DepDelay INT,
Origin STRING,
Dest STRING,
Distance INT,
TaxiIn INT,
TaxiOut INT,
Cancelled INT,
CancellationCode STRING,
Diverted STRING,
CarrierDelay INT,
WeatherDelay INT,
NASDelay INT,
SecurityDelay INT,
LateAircraftDelay INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://airline-dataset/airline-csv/';

create external table ontime_parquet_snappy (
Year INT,
Month INT,
DayofMonth INT,
DayOfWeek INT,
DepTime INT,
CRSDepTime INT,
ArrTime INT,
CRSArrTime INT,
UniqueCarrier STRING,
FlightNum INT,
TailNum STRING,
ActualElapsedTime INT,
CRSElapsedTime INT,
AirTime INT,
ArrDelay INT,
DepDelay INT,
Origin STRING,
Dest STRING,
Distance INT,
TaxiIn INT,
TaxiOut INT,
Cancelled INT,
CancellationCode STRING,
Diverted STRING,
CarrierDelay INT,
WeatherDelay INT,
NASDelay INT,
SecurityDelay INT,
LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE TABLE ontime_parquet_snappy SELECT * FROM ontime;
The '2-run-queries-csv.sql' script will run the query on the ontime table which maps to the csv data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/csv-query-output' select Origin, count(*) from ontime where DepTime > CRSDepTime group by Origin;
The '3-run-queries-parquet.sql' script will run the query on the ontime_parquet_snappy table which maps to the Parquet-Snappy data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin;
Step 3 : Goto the EMR management console and click on the 'Go to advanced options'.


Step 4 : Here select the software to be installed on the instances. For this blog we need Hadoop 2.7.3 and Hive 2.1.1. Make sure these are selected, the rest are optional. Here we can add a step. According to the AWS documentation, this is the definition of Step - 'Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.'. This can be a MR program, Hive Query, Pig Script or something else. The steps can be added here or later. We will add steps later. Click on Next.


Step 5 : In this step, we can select the number of instances we want to run and the size of each instance. We will leave them as default and click on next.


Step 6 : In this step, we can select additional settings like the cluster name, the S3 log path location and so on. Make sure the 'S3 folder' points to the log folder in S3 and click on Next. Uncheck the 'Termination protection' option.


Step 7 : In this screen again all the default options are good enough. If we want to ssh into the EC2 instances then the 'EC2 key pair' has to be selected. Here are the instructions on how to create a key pair. Finally click on 'Create cluster' to launch the cluster.


Initially the cluster will be in a Starting state and the EC2 instances will be launched as shown below.



Within a few minutes, the cluster will be in a running state and the Steps (data processing programs) can be added to the cluster.


Step 8 : Add a Step to the cluster by clicking on the 'Add step' and pointing to the '1-create-tables-move-data.sql' file as shown below and click on Add. The processing will start on the cluster.



The Step will be in a Pending status for some time and then move to the Completed status after the processing has been done.



Once the processing has been complete the csv data will be converted into a Parquet format with Snappy compression and put into S3 as shown below.


Note that the csv data was close to 2,192 MB and the Parquet Snappy data is around 190 MB. The Parquet data is in columnar format and provides higher compression compared to the csv format. This enables to fit more data into the memory and so quicker results.

Step 9 : Now add 2 more steps using the '2-run-queries-csv.sql' and the '3-run-queries-parquet.sql'. The first sql file will run the query on the csv data table and the second will run the query on the Parquet Snappy table. Both the queries are the same, returning the same results in S3.

Step 10 : Check the step log files for the steps to get the execution times in the EMR management console.

Converting the CSV to Parquet Snappy format - 148 seconds
Executing the query on the csv data - 90 seconds
Executing the query on the Parquet Snappy data - 56 seconds

Note that the query runs faster on the Parquet Snappy data, when compared to the csv data. I was expecting the query to run a bit faster, need to look into this a bit more.

Step 11 : Now that the processing has been done, it's time to terminate the cluster. Click on Terminate and again on Terminate. It will take a few minutes for the cluster to terminate.


Note that the EMR cluster will be terminated and EC2 instances will also terminate.



Step 12 : Go back to S3 management console the below folders should be there. Clean up by deleteing the bucket. I would be keeping the data, so that I can try Athena and RedShift on the CSV and the Parquet Snappy data. Note that 5GB of S3 data can be stored for free upto one year. More details about the AWS free tier here.


In the future blogs, we will look at processing the same data using AWS Athena. With Athena there is no need to spawn a cluster, so the serverless model. AWS Athena will automatically spawn the servers. We simply create a table, map it to the data in S3 and run the SQL queries on it.

With the EMR the pricing is rounded to the hour and for executing a query about 1 hour and 5 minutes, we need to pay for complete 2 hours. With Athena we pay by the amount of the data scanned. So, changing the data into a columnar format and compressing it, will not only make the query run a bit faster, but also cut down the bill.

UpdateHere and here are articles from the AWS documentation on the same. It has got some additional commands.
 

June 25, 2017


Simplified Analytics

What are Digital Twins ?

Digital Transformation has brought in all the new concepts and technologies at the hands of consumers and businesses alike. Digital Twin is one of them. It is a virtual image of your machine or...

...
 

June 23, 2017


Revolution Analytics

Because it's Friday: Mario in the Park

I got my first chance to use HoloLens just a couple of weeks ago. It was pretty amazing to see a virtual wind turbine appear in the room with me, and to be able to walk around it and see how it was...

...

Revolution Analytics

The R community is one of R's best features

R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a community to be successful. And...

...
Cloud Avenue Hadoop Tips

Algorithmia - a store for algorithms, models and functions

I came across Algorithmia a few months back and didn't get a change to try it out. Again it came into focus with a Series A funding of $10.5M. More about the funding here.

Algorithmia  is a place where algorithms, models or functions can be discovered and be used for credits which we can buy. We get 5,000 credits every month for free. For example if a model costs 20 credits, then it can be called 250 times a month.

Create a free account here and get the API key from the profile. Now we should be able to call the different models using different languages like Python, Java, R and commands like curl. Below are the curl commands to do a sentimental analysis on a sentence. Make sure to replace the API_KEY with your own key.

curl -X POST -d '{"sentence": "I really like this website called algorithmia"}' -H 'Content-Type: application/json' -H 'Authorization: Simple API_KEY' https://api.algorithmia.com/v1/algo/nlp/SocialSentimentAnalysis

{"result":[{"compound":0.4201,"negative":0,"neutral":0.642,"positive":0.358,"sentence":"I really like this website called algorithmia"}],"metadata":{"content_type":"json","duration":0.010212005}}

curl -X POST -d '{"sentence": "I really dont like this website called algorithmia"}' -H 'Content-Type: application/json' -H 'Authorization: Simple API_KEY' https://api.algorithmia.com/v1/algo/nlp/SocialSentimentAnalysis

{"result":[{"compound":-0.3374,"negative":0.285,"neutral":0.715,"positive":0,"sentence":"I really dont like this website called algorithmia"}],"metadata":{"content_type":"json","duration":0.009965723}}
Algorithmia is more like a Google Play Store and Apple App Store, where individuals and companies can upload mobile applications and rest of us can download the same. It's an attempt to democratize Artificial Intelligence and Machine Learning.

Here is a service to convert the black and white to color images.

BrightPlanet

Proxy Servers: Defying the Data-Collection Odds

At BrightPlanet, data harvesting is our specialty. But data harvesting is anything but simple. That’s where proxy servers come in. We have an anonymous proxy server and a TOR proxy cluster that we leverage for most of our harvesting. Friendly Crawling: A Harvesting Necessity Today many servers will automatically block crawlers (like ours) based on […] The post Proxy Servers: Defying the Data-Collection Odds appeared first on BrightPlanet.

Read more »
 

June 22, 2017

Silicon Valley Data Science

Three Ways the C-Suite Can Embrace (Gulp) Failure

In today’s hero culture, failure can lead to demoralization, loss of status, or loss of job. Teams that fail too much lose resources and ultimately fall apart. Fear of failure may be especially acute among company leaders, those women and men who are paid handsomely to produce success stories. So, yes, failing hurts and has real consequences. No wonder so many of us fear failure and manage our projects with safety nets and guard rails that rein in risk.

But failure has a bad rap. Failure is not only inevitable, it’s something to be embraced. And in almost all cases, success springs from a series of failed experiments, design iterations, or chances taken that led to true innovation—the ultimate market conqueror. Thirty-nine versions preceded the spray lubricant megahit WD-40.

As leaders work toward building their own experimental enterprise, it’s important to have a perspective on success and failure. You can’t avoid failure, but you can learn to cope and make the most of it. In this post, we’ll look at three ways you can embrace failure:

  • Understand that failing is learning.
  • Design processes and organizational structures that learn by failing.
  • Gain control of failure anxiety.

In the early 2000s, the first three rocket launches made by Elon Musk’s SpaceX were spectacular fiascos, and left the company one more failure away from bankruptcy. Test four, of course, succeeded, thanks to the culture he built in his company of learning from mistakes and iterating forward to build a better product. SpaceX eventually got its first break: a $1.5 billion contract with NASA. Innovative leaders like Musk, Richard Branson, and Jeff Bezos probably have had many more failures than successes—but their wins changed the world, and they didn’t get there on the first try

Understand that failing is learning

It’s easy to comprehend why we avoid talking about failure. For example, we may fear seeming negative in a culture of “positive thinking.” In truth, though, planning for negative outcomes is just good planning. Even CEOs don’t control the world as individuals, and we can’t make everything go our way.

And sometimes, if you look closely enough, failure really isn’t. A technology client of ours, a global travel integrator, asked if we could predict costly resource-allocation failures from their business and systems data. Weeks spent on an analysis concluded that we couldn’t predict any failures. It turned out that the data we’d been given didn’t have enough breadth to cover all the required variables. Finding nothing is an uncomfortable situation for any consultant, but the insight was that the data wasn’t correlative with the predictions that the client wanted. We shouldn’t be afraid to say that to our client. And our client shouldn’t be afraid to say that to the boss. The true failure would be not deciding what to do next.

Design your process to learn by failing

It’s not enough to accept failure if it happens; you must build in the possibility at the start of a project. When preparing, pretend there are binary outcomes—a positive and a negative. You must be prepared for both. What must go perfectly for the whole project to come together? What are those things so critical that, if they go wrong, the project craters?

But it’s not just risk planning around individual projects that we need to be concerned about. Pivotal personnel and organizational components must be addressed to create a culture around learning from mistakes.

Who should be spearheading this methodology? In traditional project management, the project manager better be thinking about both sides of what you are trying to execute. I think the true issue is the leader—the sponsor, or the person that is responsible and accountable ultimately for the investment being made.

Roche CEO Severin Schwan told an interviewer that he holds celebrations for failed projects to underscore that risk taking is endorsed from the very top of the organization. He said, “I would argue, from a cultural point of view, it’s more important to praise the people for the nine times they fail, than for the one time they succeed.”

Another way to overcome that fear is to make failing the objective. In the true agile sense, you do want failure–you want to test fast and find out what doesn’t work so you can go down the positive path. (As my former colleague, Mike Bechtel, puts it, “Failure isn’t what you’re after. You’re after big, honking, 100x success.”) We’ve spoken extensively about working with agile teams.

If you are truly agile and everything goes without a hitch to, say, roll out a new user experience, then you probably haven’t really tested the boundaries of where else success might be found. Failure can propel you to grow faster by accomplishing what innovation consultant Mike Maddock terms “failing forward.”

There is a right way to fail forward; I think of what scientist Max Delbrück called the Principle of Limited Sloppiness. He meant that your research and development activities should be open to and encourage unexpected serendipitous possibilities that appear out of nowhere. But you shouldn’t be chasing rainbows to the point where the results can’t be reproduced.

This isn’t about dressing up failure—just saying, “Well, I learned a lot!,” and moving on. You must dig deep into exactly what you learned. Maybe you discover from a project that you are not especially good at building product or increasing sales, and you need to either strengthen those skills or team up with the right person next time. Rather than throwing up your hands and saying it was out of your control, you must use your perspective to empower yourself to do better in the future.

Great leaders shouldn’t penalize well-conceived risks; they should penalize not taking risks or making the same mistakes twice.

Gain control of failure anxiety

You’ve acknowledged that failure must be planned for, and you’ve changed your perspective so that you see failure as an opportunity to learn. That’s great, but moments of anxiety will still pop up. How can those doubts be mitigated?

As Dr. Guy Winch says, focus on what you can control. Specifically, “Identify aspects of the task or preparation that are in your control and focus on those. Brainstorm ways to reframe aspects of the task that seem out of your control such that you regain control of them.” Have a plan and iterate quickly—that’s what gives you more control. Planned iteration helps you knock out unknown parameters and move to success (or failure) with speed and certainty.

Also, try visualizing your obstacles, just as an Olympics luge driver visualizes shooting down the icy, twisty track at 90 mph. In that early planning phase, when you’re acknowledging what could go wrong, really sit with it. Think about what it will mean for the project, the conversations you’ll need to have, and the solutions you’ll need to develop in order to get down your own critical path.

What’s next?

Becoming comfortable with failure is not easy, not for you or for the organization you lead. It won’t happen overnight. Use the skills we’ve detailed here to fight the knee-jerk reaction of fear, so you can commit yourself and your team to leverage failure into ultimate success.

(For more on the experimental enterprise, and how to build your own, watch our video.)

The post Three Ways the C-Suite Can Embrace (Gulp) Failure appeared first on Silicon Valley Data Science.


Forrester Blogs

The Age of Alt: Data Commercialization Brings Alternative Data To Market

We all want to know something others don't know. People have long sought "local knowledge," "the inside scoop" or "a heads up" - the restaurant not in the guidebook, the real version of the story, or...

...

Revolution Analytics

Interactive R visuals in Power BI

Power BI has long had the capability to include custom R charts in dashboards and reports. But in sharp contrast to standard Power BI visuals, these R charts were static. While R charts would update...

...

Revolution Analytics

Updated Data Science Virtual Machine for Windows: GPU-enabled with Docker support

The Windows edition of the Data Science Virtual Machine (DSVM), the all-in-one virtual machine image with a wide-collection of open-source and Microsoft data science tools, has been updated to the...

...
Principa

Your Debtor is still your Customer: Personalisation and the Customer Experience in Debt Collection

I recently attended the First Party Summit – a Collections summit in the USA - which was an exciting conference focusing on the unique challenges of Collections, Outsourcing, and Customer Care. Over the 3-day period I learned quite a lot from the presentations and round table discussions covering topics from innovative communication strategies to the role of human and artificial intelligence in collections.

Throughout the conference, I noticed the following two points were consistent through all aspects of the discussions: Personlisation and the Customer Experience are on the rise in debt collection. In this blog post, I’ll explain why.

 

June 20, 2017


Revolution Analytics

R leads, Python gains in 2017 Burtch Works Survey

For the past four years, recruiting firm Burtch Works has conducted a simple survey of data scientists with just one question: "Which do you prefer to use — SAS, R or Python". The results for this...

...
Ronald van Loon

GDPR – A Change in the Making

Organizations all over the EU must be aware by now that the Data Protection Act (DPA) will be changed into GDPR (General Data Protection Regulation). Some of these changes might cause some compliance issues but there’s an easy way to avoid any problems, by raising awareness.

The more your staff and employees know about GDPR, the less chances you have, of ever violating the conditions of the reform.

What is GDPR?

GDPR gives customers control over their personal data, to modify, restrict or withdraw consent, and transfer data. For example, you decide to contact Apple to ask how they’re using your personal data because you frequently shop online on their site, and use iTunes. You tell them that they can no longer use your data because you won’t be using their services anymore, and request for them to send your personal information to Spotify instead.

Now Spotify can use your personal data to start making customized music recommendations for you. You also contact Spotify and limit how they use your data, and for what purpose.

How does it help?

GDPR (General Data Protection Regulation) was drafted to ensure that the privacy rights of EU citizens aren’t threatened in anyway. This new reform was designed to enable EU citizens to have better control over their personal data. The basic concept behind this instrument is to reduce regulation and to reinforce consumer trust.

In the wake of their reforms, data processors and controllers have been ordered to “implement appropriate technical and organizational measures” taking into account “the state of the art and the costs of implementation” and “the nature, scope, context, and purposes of the processing as well as the risk of varying likelihood and severity for the rights and freedoms of individuals.”

A number of security actions were suggested by the regulation that can be considered appropriate to the risk, such as encryption of personal data, ensuring the confidentiality and resilience of systems and services, the timely restoration of data after a technical issue etc.

Importance of Unified Governance

It has been established that unified governance is essential for gaining better business insights and enabling compliance with many complex regulations and the law such as GDPR or HIPAA.

Without unified data governance, businesses will not be able to comply with law regulations that are redefining client personal data usage. They will also be at risk of potential data breaches, penalties, and loss of client trust. Moreover, without client consent to access their data, companies cannot use personal client information in order to gain business insights and improve the Customer Experience.

Preparing for GDPR

Most organizations aren’t adequately prepared for the May 25th 2018 deadline, and should see this as an opportunity to begin managing their data properly. GDPR makes it even more imperative for companies to implement data and analytics solutions that help them effectively analyze, classify, and manage their data.

They need to have the technologies, processes, and advanced data and analytics capabilities in place to support proper data governance and management, and better provide a positive Customer Experience across channels.

Present & Future Impacts of GDPR

Currently, organizations need to begin preparation measures regarding their data management. In the long term, there’s an opportunity to differentiate your organization from your competition, and secure a competitive advantage by gaining client consent to use personal data and improve the Customer Experience. GDPR increases awareness of the value of personal data, giving customers more control over their own data, which is becoming a “currency” in this digitally driven era.

Want to learn more about the topic? Register now to join me at the live session with Hilary Mason, Dez Blanchfield, Rob Thomas, Kate Silverton, Seth Dobrin and Marc Altshuller.

Follow me on Twitter and LinkedIn for more interesting updates about machine learning and data integration.

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post GDPR – A Change in the Making appeared first on Ronald van Loons.

 

June 19, 2017


Revolution Analytics

Using sparklyr with Microsoft R Server

The sparklyr package (by RStudio) provides a high-level interface between R and Apache Spark. Among many other things, it allows you to filter and aggregate data in Spark using the dplyr syntax. In...

...
Ronald van Loon

Digital Transformation Starts with Customer Experience

I attended the interview with Nick Drake, Senior Vice President, Direct to Consumer at T-Mobile and Otto Rosenberger who serves as CMO at the Hostelworld Group at the Adobe Summit. The key take away of the entire session was that customer experience is the beginning and the core of digital transformations – it is where it all begins.

T-Mobile and Hostelworld are completely different companies, but what kind of connects them is the fact that they both focused on customer experience when transforming their companies.

So why is customer experience the key to it all? Because it links organizations to customers at an emotional and physiological level.

The story of Hostelworld

Hostelworld is now a leading hostel booking platform. Three years ago, it was set up as just a booking engine, as a transactional business. Today, the company accompanies their customers throughout the entire trip. Hostelworld operates globally, with most of the customers based in North America, whereas 30% to 35% come from Europe. What fueled their growth? They went beyond booking, and helped their customers out in each and every way so that they get the best offers around, and can enjoy invitations and group tours during the trip. Almost 50% of their bookers use the app when they are travelling, and 90% of these people say the app made their trip so much more fun.

So what did Hostelworld do differently? They tapped into the emotions of their customers, and offered them the experiences they were looking for. Yes, a lot of internal changes were required, but it was worth it. They had to work on their business goals, operating principles, and the team they had. Additionally, they had to divide the budget appropriately between marketing and tech.

T-Mobile’s Journey

Drake shared the journey of T-Mobile. When Drake joined T-Mobile, the company was doing well in terms of customer acquisition, but they weren’t living up to its potential. Only 35% of all acquisitions were made on the digital channel, and so Drake’s task was to raise the bar.

T-Mobile had to radically transform their business, giving the IT team enough breathing space to platform their legacy. They decided to go forward with multi technology platforms, taking a radically different approach to customer experience, but they had to bring about a lot of changes. They understood their audience, and figured out ways to interact with them over various channels, while reinventing and customizing their product offerings.

T-Mobile has seen surprising results, and doubled their subscriber base since they stepped into the race. They completely redesigned themselves using the Adobe Marketing Cloud. Using personalized content, they reduced clicks by 60%, and drove higher engagement levels. From a technical perspective, they redeveloped their mobile app in order to provide a better service. A new feature called asynchronous messaging was introduced, which allows users to strike up a conversation with customer services.

Drake advised that it is important to think about what kind of business you are in, and then invest in both the current day and future. Plus, there should be a balance between the commitment made to the shareholders and then ensuring that commitments RE met for the next few years.

So what does this boil down to in the end?

Experiences impact the way in which people feel and respond. But businesses must provide rich and immersive experiences that go deeper than redesigning and managing interactions. Experience is more about building, and then nurturing an emotional connection with your audience – so that they completely connect to your brand.

Your business may not have begun to transform digitally, but sooner or later, you’ll have to take the step. And if you don’t, you’ll be eaten up by the competition- that is what it gets down to.

Stay updated with what’s going in the digital world with Ronald van Loon, Top Ten Global Influencer for Big Data, the Internet of Things (IoT), and Predictive Analytics on LinkedIn and Twitter.

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Digital Transformation Starts with Customer Experience appeared first on Ronald van Loons.

 

June 18, 2017


Simplified Analytics

Top 10 tips for Digital Transformation

Digital Transformation is changing the way customers think & demand new products or services. There is so much discussed in various forums as to how to go digital. No business will...

...
 

June 17, 2017

Cloud Avenue Hadoop Tips

Hardware of the AWS EMR instances

HDFS which follows a master-slave architecture has a different requirement for the master and the slave. The master (NameNode) should be a durable machine with more RAM. While the slave (DataNode) can be a commodity machine and should have more disk to store the blocks of data.

According to the AWS EMR documentation

The collection of EC2 instances that host each node type is called either an instance fleet or a uniform instance group. The instance fleets or uniform instance groups configuration is a choice you make when you create a cluster. It applies to all node types, and it can't be changed later.

Looks like having different EC2 instance types for the NameNode and the DataNode is not possible at least in the case of Amazon EMR, which doesn't make much sense. Maybe, AWS will add this feature in the future. It should not be a difficult thing for AWS to implement the same, not sure why they haven't implemented the same till now.

One way to get around this problem is create a fleet of EC2 instances of different types bases on the roles and the responsibilities and install the required software on it. The con of this approach is that we have to maintain everything on our own.

In this article, the context is around HDFS. But, the same can be applied to other softwares like Spark, MapReduce where it requires a different set of machine types.
 

June 16, 2017


Revolution Analytics

Because it's Friday: Dry Martini Specifications

"Standards are Serious Business" was once the tagline of ANSI, the American National Standards Institute, but this tongue-in-cheek standard (ANSI K100.1-1974, an update to ASA K100.1-1966) is...

...

Revolution Analytics

Applications of R at EARL San Francisco 2017

The Mango team held their first instance of the EARL conference series in San Francisco last month, and it was a fantastic showcase of real-world applications of R. This was a smaller version of the...

...

Revolution Analytics

Schedule for useR!2017 now available

The full schedule of talks for useR!2017, the global R user conference, has now been posted. The conference will feature 16 tutorials, 6 keynotes, 141 full talks, and 86 lightning talks starting on...

...

Curt Monash

Generally available Kudu

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including: Security is an ever...

...
InData Labs

Why Start a Data Science Project with Exploratory Data Analysis?

How to ensure you are ready to use machine learning algorithms in a project? How to choose the most suitable algorithms for your data set? How to define the feature variables that can potentially be used for machine learning? Exploratory Data Analysis (EDA) helps to answer all these questions, ensuring the best outcomes for the...

Запись Why Start a Data Science Project with Exploratory Data Analysis? впервые появилась InData Labs.


BrightPlanet

Your Data Management Strategy: A Review

In recent years, many organizations have progressed in their data strategies, resulting in much more mature strategies. This is especially true with BrightPlanet customers. As such, the insights in the recent Harvard Business Review article certainly ring true with what we see in the market today. We thought it would be useful to share with you some of […] The post Your Data Management Strategy: A Review appeared first on BrightPlanet.

Read more »
Cloud Avenue Hadoop Tips

GCP for AWS Professionals

As many of you know I had been working with AWS (public Cloud from Amazon) for quite some time and so I thought of getting my feet wet with GCP (public Cloud from Google). I tried to find some free (????) MOOC around GCP and didn't find many.

Google partnered with Coursera and started a MOOC on the GCP fundamentals. It covers the different GCP fundamentals at a very high level with a few demos. Here is the link for the same. There is also documentation from Google comparing the GCP and the AWS platform here.

As I was going through the above mentioned MOOC, I could clearly map many of the GCP services with the AWS services which I am more comfortable with. Here is the mapping between the different GCP and the AWS services. The mappings are really helpful for those who are comfortable with one of the Cloud platform and want to get familiar with the other.

AWS provides free resources for the developer to get more familiar and to get started with their platform. Same is the case with GCP, a 300$ credit which is valid for 1 year is provided. Both the Cloud vendors provide a few services for free for the first year and there are some services will are free life long with some usage limitations.

Maybe it's my perspective, but the content around AWS seems to be much more robust and organized when compared to the GCP documentation. Same is the case with the Web UI also. Anyway, here is documentation for AWS and here is the documentation for AWS.

Happy clouding.
Cloud Avenue Hadoop Tips

AWS Lambda with Serverless freamework

Google CloudFunction, IBM OpenWhisk, Azure Functions and AWS Lambda allow building applications in a Serverless fashion. Serverless doesn't mean that there is no server, but the Cloud vendor takes care of the provisioning the servers, scaling them up and down as required. In the Serverless model all we have to do is author a function with the event that triggers and the function and the resources the function uses.

Since, no servers are being provisioned there is no constant cost. The cost is directly proportional to the number of times the particular function is called and the amount of resources it consumes. This model is useful as for a startup with limited resources or for a big company who want to deploy the applications quickly.

Here is a nice video form the founder of aCloudGuru on how they built their entire business on Serverless Architecture. Here is a nice article/tutorial from AWS on using Lambda function to shrink the image uploaded to S3 bucket.


In the above workflow as soon as a image is uploaded to S3 Source Bucket, it will fire an event and call the Lambda function. The function will shrink the image and then put it in the S3 Target Bucket. In this scenario, there is no sustained cost associated as we pay based on the number of times the function is called and the amount of resources it consumes.

For the last few days I was intrigued by the Serverless architecture and had been trying the use cases mentioned in AWS Lambda documentation here. It was fun, but not pretty straight forward or as simple as uploading a function. The function has to be authored, packaged, uploaded, set the permissions for the event/resources, test the Lambda and finally integrate with the API Gatway. oooops. It's not an impossible task, but definitely tedious. Good that I haven't mentioned the debugging of the Lambda functions, which is really a pain.

For the rescue is the Serverless framework which makes working with the Lambda functions easy. Setting up the Serverless frameowork on Ubuntu was a breeze. Creating a HelloWorld Lambda function in AWS with all the required dependencies was even easier.

Note that the Serverless framework supports other platforms besides AWS. But, in the this blog I will provide some screen shots with a brief write-up on the same. Here, I go with the assumption that Serverless framework and all the dependencies like NodeJS, integration with the AWS Security Credentials has been done.

So, here are the steps.

Step1: Create the skeleton code and the Serverless Lambda configuration file.



Step 2: Deploy the code to AWS using the CloudFormation.

 
Once the deployment has been complete, the AWS resources will be created as shown below in the CloudFormation console.


As seen in the below screen, the Lambda function also gets deployed.


Step 3 : Now is the time to invoke the Lambda function again using the Serverless framework as shown below.


In the Lambda management console, we can observe that the Lambda function was called once.


Step 4 : Undeploy the Lambda function using Serverless framework.


The CloudFormation stack should be removed and all the associated resources should be deleted.


These completes the steps for a simple demo of the Serverless framework. A couple of things to note that we simply uploaded a Python function to the AWS Cloud and never created a server and so the Serverless Architecture. Here a few more resources to get started with Serverless Architecture.

On a side note I was really surprised that the Serverless framework has been started by CocaCola and then at some point of time they decided to open source in Github. All the time it was companies like Facebook, Google, Twitter, LinkedIn, Netflix, Microsoft :) opening the internal software to the public, but this was the first time I saw something from CocaCola.

Maybe I should drink more CocaCola and promote them to publish such cool frameworks.
 

June 15, 2017

Silicon Valley Data Science

Five Business Challenges Data Can Solve

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

It’s 2017, there’s an AI assistant in your pocket. You don’t need convincing that there’s power in data to change your business; you’ve probably got one or more big data projects running. Kicking the tires has been relatively straightforward, and you’re ready to take things to the next level.

The next level, however, is a much bigger deal than those islands of encouraging success. Moving the business forward meaningfully—new product offerings, for example—usually means collaboration across functions and silos, and the IT and business sides of the house working closely together. That’s not an everyday experience, and the journey to using data effectively means adopting many of the traits of successful software and web companies: figuring out how to deploy technology in strategic support of the business.

At SVDS, we’ve helped numerous clients through these data transformations. Before we get going writing code or unleashing our data scientists, the first step we take is to create a modern data strategy. For us, that means a living roadmap of what to tackle, and in which order, to get results.

In today’s business climate, executives understandably want to see both early results and a long-term direction. A data strategy helps meet business needs, while ordering work in a way that respects constraints and creates future opportunities. If that sounds a little like motherhood and apple pie, get some business and IT leaders in the same room and see if they see priorities and resource requirements the same way! Even if it provides nothing else novel, creating a data strategy enables agreement across these stakeholders.

When do you need a data strategy?

While it’s not hard to see the need for data strategy, it can be difficult to map that to your current position. In this section, we present some of the more common situations that lead to charting a data strategy; use these to start planning your own path. The key similarity in all these situations is that the end goal is to create new data capabilities that enable progress, and that the stakeholders are drawn from across the business.

Infrastructure is inhibiting growth

The more customers you get, the more data you get. The problem is that traditional data systems rarely scale linearly: when you reach a certain point, scaling issues become pathological, and it’s time to move to a new platform. Doing that at the same time as maintaining existing systems is a challenge that requires careful planning.

Infrastructure is inhibiting development

It might not be data size that’s the challenge, but keeping your offerings competitive. In an era where great user experiences, personalization, and “always connected” are the expectations for both consumer and business customers, keeping up is essential. Typically, these capabilities are not simple additions, but part of a move towards a new platform. And new data streams often means the involvement of many more stakeholders than before, across the organization, from R&D through to marketing and finance.

Undergoing an analog to digital transformation

The march of digitization is uneven: though we interact via smartphones every day, many processes and systems still operate in the analog world. Through moving paper processes into software, or bringing hardware online into the Internet of Things, great efficiencies and insight can be gained. The challenge is not just to catch up to the last decade, but to harness the capabilities of today’s technologies, such as machine intelligence, to leapfrog into a competitive position.

Changing business models

There are now new models for businesses, created by the directly connected internet consumer, the scalability of big data technologies, and the application of machine intelligence. Ask anyone if they thought five years ago that a Silicon Valley firm would reinvent the taxi industry. Stores have progressed to make the things they used to just resell, and publications can now sell the things they used to just advertise. Businesses can find new models thanks to the new capabilities available, but they require a technology path that supports and validates these ambitions.

Unifying fragmented offerings

No business grows in perfectly planned evolution. They grow lumpily, adding on developments that met a market need at the time, but don’t quite connect up to new product offerings. Or by M&A, where every acquired system and product generates both redundant cost and operational incompatibility at the same time. Operational costs are raised as a result, but crucially, it’s the customer that suffers as well. For example, consider the operational and patient impact caused by the lack of integration between medical devices. New data capabilities mean we can create a unified infrastructure, giving new life to existing products and creating future opportunity.

Digital is in the driving seat

The rapid sweep of digitization across industries means that data infrastructure is a key component for creating efficiencies, growth and future potential. It can’t be managed in isolation as an IT concern, but as part of the overall company strategy. That’s where data strategy comes in: bridging and coordinating between business ambitions and the necessary investments in systems, code and data science.

If any of the situations in this article resonate with your challenges, and you’d like help or advice, please reach out to us to talk. To discover more about how our approach to data strategy can help, you can also download our position paper.

The post Five Business Challenges Data Can Solve appeared first on Silicon Valley Data Science.

 

June 14, 2017


Revolution Analytics

Studying disease with R: RECON, The R Epidemics Consortium

For almost a year now, a collection of researchers from around the world has been collaborating to develop the next generation of analysis tools for disease outbreak response using R. The R Epidemics...

...

Revolution Analytics

Demo: Real-Time Predictions with Microsoft R Server

At the R/Finance conference last month, I demonstrated how to operationalize models developed in Microsoft R Server as web services using the mrsdeploy package. Then, I used that deployed model to...

...

Curt Monash

The data security mess

A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:...

...

Curt Monash

Light-touch managed services

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model: Altus manages jobs for you. But you actually run them on your own cluster, and so you...

...

Curt Monash

Cloudera Altus

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to: Provide all the important advantages of on-premises Cloudera. Provide all the...

...
Jean Francois Puget

Kaggle Master

image

 

Do you have spare time on evenings and week ends?  Here is a great way to use it: enter machine learning competitions.  That's what I do since a year, as often as I can.  

The latest competition I entered, the Quora competition on Kaggle, was quite good for me as my team finished in gold, being 12th among more than 3,300 teams.  Here is how Quora describes the problem:

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Currently, Quora uses a Random Forest model to identify duplicate questions. In this competition, Kagglers are challenged to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not. Doing so will make it easier to find high quality answers to questions resulting in an improved experience for Quora writers, seekers, and readers.

The problem was to predict, given a pair of questions, if the two questions are about the same topic or not. 

We were given approximately 400,000 question pairs to train our models.  For each pair we known the target value: 1if the two questions are duplicates, 0 if they are not duplicates.  The test set was made of  about 2,400,000 question pairs.  We needed to build models that predict which pairs are duplicates among these 2,400,000 pairs. 

Competition duration was 3 months.  I entered that competition to learn about natural language processing (NLP), a domain entirely unknown to me at the start of the competition.

In order to get to a good result I teamed up with 3 other great data scientists, with varying skills: Les 4 Mousquetaires.  A short technical overview of our solution can be found here.  We had to combine techniques from deep learning (LSTM and GRU neural networks), and classical machine learning (XGBoost and LightGBM) to perform well.  Covering all of them at expert level in such short time was above my skills, which is why teaming up was necessary.  I thank my team mates as I would not have reached that level without them.

That competition was one of the most popular ever on Kaggle.  It was also a tough one, with the top 9 Kaggle competitors engaged.  We had to wait till the very last day when we made a significant progress to get into the top 16 teams and get a gold medal.  As a result I got promoted to Kaggle Competition Master.

This alone is not worth a blog entry but I'm sharing it because Kaggle enforces a rather sound machine learning methodology, similar to what I described in Be Brave In Machine Learning

Let me expand a bit on it.  For each competition, we are given a training data set, i.e. a set of examples, and for each training example a target value. This is typical of supervised machine learning.  A second data set is provided, called the test set.  The test set is a set of examples similar to the training set, except that the target is kept hidden from participants.  The goal of competitors is to create machine learning models and use these models to predict the target on the test set. 

The test set is split into a public test set, and a private test set.  The split isn't disclosed to participants.  When one submits predictions for the test set, the public test set target is used as benchmark.  The participant gets feedback on how his/her predictions fare compared to the true target on the test set. This can be used to select among various models.  Competitors need to select 2 of their models before competition ends. 

After competition ends, the private test set is used for final evaluation of the selected models.  At no point in time participants have any feedback on how their models behave on the private test set.

This process is very close to recommended ML practice.  See for instance Elements of Statistical Learning, 10th print, page 241 (emphasis is mine):

If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.

Kaggle public test set plays the role of the validation set, while the Kaggle private test set plays the role of the test set.  Entering one of their competition (or competitions hosted by other sites) is a good way to practice the right machine learning methodology. 

All in all, this competition has been a great experience. I learned a lot about NLP, and I learned a lot about teaming up for Kaggle.  The outcome was great.  It was also very time consuming, eating most of my evenings and week ends during the last month.  I thank my wife who has been extremely patient and supportive during that period.  Last but not least, now that it is over I have more time to blog again ;)

 

June 13, 2017


Revolution Analytics

Syberia: A development framework for R code in production

Putting R code into production generally involves orchestrating the execution of a series of R scripts. Even if much of the application logic is encoded into R packages, a run-time environment...

...
 

June 12, 2017


Revolution Analytics

Interfacing with APIs using R: the basics

While R (and its package ecosystem) provides a wealth of functions for querying and analyzing data, in our cloud-enabled world there's now a plethora of online services with APIs you can use to...

...
Ronald van Loon

How to Start Incorporating Machine Learning in Enterprises

The world is long past the Industrial Revolution, and now we are experiencing an era of Digital Revolution. Machine Learning, Artificial Intelligence, and Big Data Analysis are the reality of today’s world.

I recently had a chance to talk to Ciaran Dynes, Senior Vice President of Products at Talend and Justin Mullen, Managing Director at Datalytyx. Talend is a software integration vendor that provides Big Data solutions to enterprises, and Datalytyx is a leading provider of big data engineering, data analytics, and cloud solutions, enabling faster, more effective, and more profitable decision-making throughout an enterprise.

The Evolution of Big Data Operations

To understand more about the evolution of big data operations, I asked Justin Mullen about the challenges his company faced five years ago and why they were looking for modern integration platforms. He responded with, “We faced similar challenges to what our customers were facing. Before Big Data analytics, it was what I call ‘Difficult Data analytics.’ There was a lot of manual aggregation and crunching of data from largely on premise systems. And then the biggest challenge that we probably faced was centralizing and trusting the data before applying the different analytical algorithms available to analyze the raw data and visualize the results in meaningful ways for the business to understand.”

He further added that, “Our clients not only wanted this analysis once, but they wanted continuous refreshes of updates on KPI performance across months and years. With manual data engineering practices, it was very difficult for us to meet the requirements of our clients, and that is when we decided we needed a robust and trustworthy data management platform that solves these challenges.”

The Automation and Data Science

Most of the economists and social scientists are concerned about the automation that is taking over the manufacturing and commercial processes. If the digitalization and automation continues to grow at the same pace it is currently happening, there is a high probability of machines partly replacing humans in the workforce. We are seeing some examples of the phenomena in our world today, but it is predicted to be far more prominent in the future.

However, Dynes says, “Data scientists are providing solutions to intricate and complex problems confronted by various sectors today. They are utilizing useful information from data analysis to understand and fix things. Data science is an input and the output is yielded in the form of automation. Machines automate, but humans provide the necessary input to get the desired output.”

This creates a balance in the demand for human and machine services. Both, automation and data science go parallel. One process is incomplete without the other. Raw data is worth nothing if it cannot be manipulated to produce meaningful results and similarly, machine learning cannot happen without sufficient and relevant data.

Start Incorporating Big Data and Machine Learning Solutions into Business Models

Dynes says, “Enterprises are realizing the importance of data, and are incorporating Big Data and Machine Learning solutions into their business models.” He further adds that, “We see automation happening all around us. It is evident in the ecommerce and manufacturing sectors, and has vast applications in the mobile banking and finance.”

When I asked him about his opinion regarding the transformation in the demand of machine learning processes and platforms, he added that, “The demand has always been there. Data analysis was equally useful five years ago as it is now. The only difference is that five years ago there was entrepreneurial monopoly and data was stored secretively. Whoever had the data, had the power, and there were only a few prominent market players who had the access to data.”

Justin has worked with different companies. Some of his most prominent clients were Calor Gas, Jaeger and Wejo. When talking about the challenges those companies faced before implementing advanced analytics or machine learning he said, “The biggest challenges most of my clients face was the accumulation of the essential data at one place so that the complex algorithms can be run simultaneously but the results can be viewed in one place for better analysis. The data plumbing and data pipelines were critical to enable data insights to become continuous rather than one-off.”

The Reasons for Rapid Digitalization

Dynes says, “We are experiencing rapid digitalization because of two major reasons. The technology has evolved at an exponential rate in the last couple of years and secondly, organization culture has evolved massively.” He adds, “With the advent of open source technologies and cloud platforms, data is now more accessible. More people have now access to information, and they are using this information to their benefits.”

In addition to the advancements and developments in the technology, “the new generation entering the workforce is also tech dependent. They rely heavily on the technology for their everyday mundane tasks. They are more open to transparent communication. Therefore, it is easier to gather data from this generation, because they are ready to talk about their opinions and preferences. They are ready to ask and answer impossible questions,” says Dynes.

Integrating a New World with the Old World

When talking about the challenges that companies face while opting for Big Data analytics solutions Mullen adds, “The challenges currently faced by industry while employing machine learning are twofold. The first challenge they face is related to data collection, data ingestion, data curation (quality) and then data aggregation. The second challenge is to combat the lack of human skills in data-engineering, advanced analytics, and machine learning”

“You need to integrate a new world with the old world.

The old world relied heavily on data collection in big batches

while the new world focuses mainly on the real-time data solutions”

Dynes says, “You need to integrate a new world with the old world. The old world relied heavily on data collection while the new world focuses mainly on the data solutions. There are limited solutions in the industry today that deliver on both these requirements at once right now.”

He concludes by saying that, “The importance of data engineering cannot be neglected, and machine learning is like Pandora’s Box. Its applications are widely seen in many sectors, and once you establish yourself as a quality provider, businesses will come to you for your services. Which is a good thing.”

Follow Ciaran Dynes, Justin Mullen, and Ronald van Loon on Twitter and LinkedIn for more interesting updates on Big Data solutions and machine learning.

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How to Start Incorporating Machine Learning in Enterprises appeared first on Ronald van Loons.

Jean Francois Puget

Zhihu Live

image

 

I went to China in April, to meet colleagues and participate to various conferences and events.  I was very happy and honored to have a session on Zhihu Live, a very respected China media.  I spoke about artificial intelligence, machine learning, and deep learning.  The most important part was a Q&A session with attendees.

The questions themselves are very interesting to me as they paint a landscape of what is hot in Chain now.  Not surprisingly, deep learning is hot, but I think the focus on deep learning is stronger than in western countries.  I reproduce the questions below together with the answers I gave. 

The questions were most often asked in Chinese (Mandarin), and were translated by my colleagues Ke Wei Wei and Henry Zeng.  They also translated my answers back to Chinese.

 

  1. Are the mature machine learning algorithms used commercially?

    Yes.  Machine learning is used in several areas.  One of them is product recommendation where matrix factorization algorithms are routinely used.  Other areas where machine learning is now used commercially include natural language processing, image recognition, sales forecast, predictive maintenance, customer churn prediction.

 

  1. What can be done in machine learning? What products?

    A lot can be done as soon as one has a clear business goal and the data that support this business goal.  For instance, if your business goal is to reduce the time it takes to ship goods once they are ordered, then you must have enough data from that past to learn what influences the time to delivery. 

 

  1. How to implement and realize the machine learning?

    Start small with a well define, small scope project.  Then use open source to build models.  Then use an industry platform like IBM Machine Learning to manage the lifecycle of your models.

 

  1. Voice recognition, natural language processing, image recognition, currently in e-commerce, is nothing more than voice customer service. Based on the searching and recommendation of deep learning, identification etc., are there any other application directions?

    Customer service is key but there are other areas where ML is relevant.  For instance, predictive maintenance is a great area for machine learning. The idea is to use IoT to gather information on various equipment, and predict their health condition so that failures can be prevented.  Another area is health, where machine learning can help diagnosis, and help select best treatment.

 

  1. So for the classification, what are the classic cases? Do you have any thoughts or suggestions? When do we need to consider complex models?

    Anomaly detection is a classical use case where you want to distinguish between what is normal and what is not.  This is a two class, or binary classification, problem.  This includes fraud detection (normal vs fraud), predictive maintenance (normal operation vs failure), health (normal vs disease), etc.  I recommend starting with simple models, e.g. logistic regression, then look for more complex models, e.g. gradient boosted decision trees or deep learning, if accuracy isn’t good enough and if there is lots of training data.

 

 

  1. What is the progress of the cancer diagnosis project jointly studied by IBM Watson and Department of Medicine of the University of Tokyo last year?

    I don’t know, I need to check.

 

  1. Watson Pepper is able to get images and text by social media. I wonder how do Pepper process the information and what is it for?

    Watson Pepper uses deep learning to process that data.

 

  1. Do you think is there any bubble/hype in machine learning now?

    Yes.  I think deep learning is oversold and that people have unrealistic expectations.  Deep learning is great, and it enables breakthrough in computer vision and natural language processing.  But this comes with significant investment and lots of data.  Most companies do not have enough data to make deep learning relevant.  Moreover, deep learning isn’t the technology of choice in many areas where other machine learning techniques are better suited.  I wish the power and limits of deep learning were better explained in general.

 

  1. Why is deep learning more academic than industrial?

    This is changing fast. The most advanced teams work in companies like IBM, Facebook, Google, etc, not academia.  Yet, deep learning is still in the hands of researchers instead of engineers.  One reason is that deep learning is not well understood. Designing the right network architecture is still an art that few master.

 

  1. What do you think of transfer learning?

    It is a great idea.  It can save lots of time when training complex models.

 

  1. If deep learning has better performance than any other algorithms, is it possible to replace other classic ML algorithms?

    Deep learning has greater performance for sound and images, but not for the rest.  The other classic ML algorithms are here to stay for a while for many ML applications,  either because deep learning isn’t yielding good results, or because there isn’t enough training data.

 

  1. What are the industry application directions of unsupervised learning?
    I don’t think unsupervised learning is used much as a standalone technique.  Unsupervised learning is used a lot as a preprocessing step for supervised learning.  For instance, clustering data then using cluster id as a new feature may help the performance of supervised ML algorithms.

 

  1. If enterprises use machine learning, how should they start it? Is the technical threshold high? Which industries have the opportunity?
    Enterprises need to start with trained data scientists on a small and well defined project.  Enterprise can train their employees to become data scientists via online course like Stanford ML course on Coursera.  But training isn’t enough, people must practice.  A good way to practice is to enter ML competitions.  Several web sites host such competitions.

 

  1. What is the difficulty of reinforcement learning? Is it closer to general AI?

    Reinforcement learning aims at learning next best action.  It demonstrated great success in domains where the number of possible actions is limited, like board games (Go), or Poker. It remains to be seen how these successes can be extended to real world situations where the number of possible actions is endless.  If we can do it, then yes, we would be closer to general AI.

 

  1. Is there a plan to release IBM NN chips to the market?

    IBM does not disclose plans about potential future products.

 

  1. I’m trying to predict the price of arts with machine learning. In trained data, the price of works and other parameters are known. I would like to know which algorithm I should make, supervised or unsupervised? Can IBM’s current products make it?

    You need to use regression algorithms.  I guess you want to learn both from art images, and meta data such as artist, year of creation, dimensions, material, etc.  I would recommend a mix of deep learning to process the images, and classical ML for the rest.  My favorite classical ML algorithm is gradient boosted decision trees like XGBoost or LightGBM.  We intend to support these in IBM ML.

 

  1. Do you think there is any privacy issue in machine learning?

    Yes, definitely. Think of using ML for health, for instance for diagnosing cancer from lung radios.  In order to train a ML model you need to get a large sample of lung radios.  If not dealt with care, it can be possible to identify who has a cancer and who has not from the training data.  This would be a major privacy breach, and could be unlawful in some countries.  One way to deal with it is to anonymize the data before it is sent to machine learners.

     
  2. Is it possible to combine deep learning with traditional programming? Will the development of NTM replace that of some programs?

    Not sure I understand the question right.  If you ask about combining deep learning with traditional machine learning, the answer is definitely yes.  For instance, if you have training data that is a mix of pictures and structured data, you would use an ensemble approach.  Train a deep learning model on the pictures, train a classical ML model on the rest of the features, then use a third classifier that takes the predictions of the first two models as input.

 

  1. Deep learning is a multiple layer NN, is it possible to use other multiple-layer-algorithms? Say, multi-layer trees

    Yes, see for instance deep forests: https://arxiv.org/abs/1702.08835

 

 

 

 

 

 

 

 

 

 

 

Jean Francois Puget

Data Science Is Not Dead

image

 

Is data science dead really?  One can wonder after reading Jeroen ter Heerdt's Data Science is dead.  If you haven't read it then you probably should.  Jeroen's point is that lots of business use cases for Data Science are now served by cloud services that are very simple to use for a non data scientist. 

This is true.  Lost of companies now provide apis that rely on some machine learning models internally.  IBM is no exception with its Watson Developer Cloud apis.  When you have a use case that is covered by one of these apis, you no longer need a data science team for it.

I disagree with Jeroen iwhen he concludes that the world no longer needs data scientists.  True, the most common use cases will be, or are already, packaged as ready to use services.  But there is a very long tail of use cases where developing a packaged service isn't worth it.  We still need data scientists to solve these remaining use cases. 

This situation reminds me of operations research (OR).  Lots of operations research applications are delivered as prepackaged solutions, be it for supply chain management, price optimization, aircraft scheduling, crew scheduling, production planning, you name it.  Yet, many companies still have an OR department (under that name, or as part of their data science department) because they have operations problems that cannot be optimized with off the shelf software of services. 

Jean Francois Puget

Installing XGBoost For Anaconda on Windows

Update on June, 12, 2017.  Guido Tapia builds xgboost executables for WIndows every night: http://www.picnet.com.au/blogs/guido/post/2016/09/22/xgboost-windows-x64-binaries-for-download/ .  This may be easier than what follows.

XGBoost is a recent implementation of Boosted Trees.  It is a machine learning algorithm that yields great results on recent Kaggle competitions.  I decided to install it on my computers to give it a try.  Installation on OSX was straightforward using these instructions (as a matter of fact, reality is a bit more complex, see the update at the bottom of this post).  Installation on Windows was not as straightforward.  I am sharing what worked for me in case it might help others.  I describe how to install for the Anaconda Python distribution, but it might work as-is for other Python distributions. 

In order to install and use XGBoost with Python you need three software on your windows machine:

  • A Python installation such as Anaconda.
  • Git
  • MINGW

I assume you have Anaconda up and running.  I am using Anaconda for Python 3.5.

Git installation is quite easy.  There are several options, one is to use Git for Windows.  Just download and save the installer file on your disk, then launch it by double clicking it.  You may need to authorize this operation.  Then follow the installer instructions. 

Once the installation has completed look for a program called Git Bash in your start menu.  Launch it.  It starts a terminal running the Bash shell.  This is different from the regular Windows terminal, but it is more handy for what we need to do.  First, go to the directory where you want to save XGBoost code by typing the cd command in the bash terminal.  I used the following.

 $ cd /c/Users/IBM_ADMIN/code/

Then download XGBoost by typing the following commands. 

$ git clone --recursive https://github.com/dmlc/xgboost
$ cd xgboost
$ git submodule init
$ git submodule update

Next step is to build XGBoost on your machine, i.e. compile the code we just downloaded.  For this we need a full fledged 64 bits compiler provided with MinGW-W64.  I downloaded the installer from this link.  Save the file on your disk, then launch it by double clicking it.  You may need to authorize this operation.  Then click next on the first screen:

image

Then select the x86_64 item in the architecture menu.  Do not modify the other settings. 

image

Then click Next and follow the instructions.  On my machine, it installed the compiler in the C:\Program Files\mingw-w64\x86_64-5.3.0-posix-seh-rt_v4-rev0 directory.  The make command and the runtime libraries are in this directory (look for the directory that contains mingw32-make):

C:\Program Files\mingw-w64\x86_64-5.3.0-posix-seh-rt_v4-rev0\mingw64\bin

Use these instructions, depending on your Windows version, to add the above to the Path system variable.

Then close the Git Bash terminal, and launch it again.  This will take into account the new Path variable.  To check you are fine, type the following

$ which mingw32-make

It should return something like:

/c/Program Files/mingw-w64/x86_64-5.3.0-posix-seh-rt_v4-rev0/mingw64/bin/mingw32-make

To make our life easier, let us alias it as follows:

$ alias make='mingw32-make'

We can now build XGBoost.  We first go back to the directory where we downloaded it:

 $ cd /c/Users/IBM_ADMIN/code/xgboost

The command given in the instructions does not work as I write this blog entry.  Until this is fixed, we need to compile each sub module explicitly with the following commands.  Wait until each make command is completed before typing the next command.

$ cd dmlc-core

$ make -j4

$ cd ../rabit

$ make lib/librabit_empty.a -j4

$ cd ..

$ cp make/mingw64.mk config.mk

$ make -j4

Once the last command completes the build is done. 

We can now install the Python module.  What follows depends on the Python distribution you are using.  For Anaconda, I will simply use the Anaconda prompt, and type the following in it (after the prompt, in my case [Anaconda3] C:\Users\IBM_ADMIN>):

[Anaconda3] C:\Users\IBM_ADMIN>cd code\xgboost\python-package

The point is to move to the python-package directory of XGBoost.  Then type:

[Anaconda3] C:\Users\IBM_ADMIN\code\xgboost\python-package>python setup.py install

We are almost done.  Let's launch a notebook to test XGBoost.  Importing it directly causes an error.  In order to avoid it we must add the path to the g++ runtime libraries to the os environment path variable with:

import os

mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'

os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

We can then import xgboost and run a small example.

import xgboost as xgb
import numpy as np

data = np.random.rand(5,10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix( data, label=label)

dtest = dtrain

param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' }
param['nthread'] = 4
param['eval_metric'] = 'auc'

evallist  = [(dtest,'eval'), (dtrain,'train')]

num_round = 10
bst = xgb.train( param, dtrain, num_round, evallist )

bst.dump_model('dump.raw.txt')

We are all set!

 

Update on April, 15, 2016.  The OSX installation is a bit more complex that I wrote here if we want to be able to use XGBoost in a multi-threaded mode.  I provide instructions for it in Installing XGBoost on Mac OSX

 

 

June 11, 2017


Simplified Analytics

Top 5 uses of Internet of Things!!

While many organizations are creating tremendous value from the IoT, some organizations are still struggling to get started.  It has now become one of the key element of Digital Transformation...

...
 

June 09, 2017


Revolution Analytics

Because it's Friday: Subway maps to scale

Many subway maps are masterpieces of information design, but inevitably make compromises in geographic fidelity for clarity. Inspired by a viral post on Reddit, you can now find a collection of...

...
 

June 08, 2017


Revolution Analytics

Run massive parallel R jobs cheaply with updated doAzureParallel package

At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we've noted here before, this package allows you to...

...
Silicon Valley Data Science

Exploratory Data Analysis in Python

Earlier this year, we wrote about the value of exploratory data analysis and why you should care. In that post, we covered at a very high level what exploratory data analysis (EDA) is, and the reasons both the data scientist and business stakeholder should find it critical to the success of their analytical projects. However, that post may have left you wondering: How do I do EDA myself?

Last month, my fellow senior data scientist, Jonathan Whitmore, and I taught a tutorial at PyCon titled Exploratory Data Analysis in Python—you can watch it here. In this post, we will summarize the objectives and contents of the tutorial, and then provide instructions for following along so you can begin developing your own EDA skills.

Tutorial Objectives

Our objectives for this tutorial were to help those attending:

  1. Develop the EDA mindset by walking through a questions to ask yourself through the various stages of exploration and pointing out things to watch out for
  2. Learn how to invoke some basic EDA methods effectively, in order to understand datasets and prepare for more advanced analysis. These basic methods include:
    • slicing and dicing
    • calculating summary statistics
    • basic plotting for numerical and categorical data
    • basic visualization of geospatial data on maps
    • using Jupyter Notebook widgets for interactive exploration

We view EDA very much like a tree: there is a basic series of steps you perform every time you perform EDA (the main trunk of the tree) but at each step, observations will lead you down other avenues (branches) of exploration by raising questions you want to answer or hypotheses you want to test.

Which branches you pursue will depend on what is interesting or pertinent to you. As such, the actual exploration you do while following along on this tutorial will be yours. We have no answers or set of conclusions we think you should come to about the datasets. Our goal is simply to aid in making your exploration as effective as possible, and to let you have the fun of choosing which branches to follow.

Tutorial Outline

The talk consists of the following:

  1. Introduction to exploratory data analysis: I summarize the motivation for EDA and our general strategy that we dive deeper into throughout the tutorial.
  2. Introduction to Jupyter Notebooks: our tutorial entails working through a series of Jupyter Notebooks and so Jonathan gives a quick introduction to using them for those who haven’t seen them before. We even learn a new trick from an attendee!
  3. Exploratory analysis of the Redcard dataset: Jonathan works through an exploratory analysis of a dataset that comes from a fascinating paper published with commentary in Nature. The core question of the paper is reflected in the title, “Many analysts, one dataset: Making transparent how variations in analytical choices affect results”. The authors recruited around 30 analytic teams who were each tasked with the same research question: “Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players?” and given the same data. The dataset came from the players who played in the 2012–13 European football (soccer) professional leagues. Data about the players’ ages, heights, weights, position, skintone rating, and more were included. The results from the teams were then compared to see how the different ways of looking at the dataset yielded different statistical conclusions. The rich dataset provides ample opportunity to perform exploratory data analysis. From deciding hierarchical field positions, to quantiles in height or weight. We demonstrate several useful libraries including standard libraries like pandas, as well as lesser known libraries like missingno, pandas-profiling, and pivottablejs.
  4. Exploratory analysis of the AQUASTAT dataset: In this section, I work through exploration of the Food and Agriculture Organization (FAO) of the United Nation’s AQUASTAT dataset. This dataset provides metrics around water availability and water use, as well as other demographic data for each country, reported every five years since 1952. This dataset is often called panel or longitudinal data because it is data that is repeatedly collected for the same subjects (in this case, countries) over time. We discuss methods for exploring it as panel data, as well as methods focused on looking at only a cross-section of the data (data collected for a single time period across the countries). The data also is geospatial, as each observation corresponds to a geolocated area. We show how to look at very basic data on maps in Python, but geospatial analysis is a deep field and we scratch only the surface of it while looking at this dataset. We recommend the PySAL tutorial as an introduction to geospatial analysis in Python.

Following Along at Home

To get full value out of this tutorial, we recommend actually working through the Jupyter notebooks that we have developed. You can do this in one of two ways:

  1. In the cloud via Microsoft Azure notebooks: Set up an account and then clone this library. Cloning this library will allow you to open, edit, and run each Jupyter notebook online without having to worry about setting up Jupyter notebooks and a Python environment. This service is free and your notebooks can be saved for future use without any constraint. The only thing to know about this service is that while notebooks are persisted indefinitely, there is no saving of data or other non-notebook files after the working session. Data can be imported and then analyzed but any results outside of the notebook will have to be downloaded before leaving.
  2. Locally on your computer: Clone the github repo here and set up your Python environment according to the instructions found in the README.

Following along at home, you have the benefit of being able to put us on pause. We went through a lot of material in the three hours of the tutorial (and had to deal with some of the technical troubles inevitable during a hands-on tutorial of 65+ people using computers with different operating systems and various company firewalls!). To get full value out of the content, we suggest you pause throughout the tutorial when there are suggestions to try certain analyses yourself.

The possibilities for EDA are endless, even for a single dataset. You may want to look at the data in different ways and we welcome you to submit your own EDA notebooks for either or both of the datasets through a pull request in the github repo. We will provide feedback and approve PRs for your approaches to be shared with others developing their EDA skills.

The post Exploratory Data Analysis in Python appeared first on Silicon Valley Data Science.

Principa

What is Scorecard Monitoring and why is it so critical?

Scorecards form the back-bone of decision making for many financial institutions. They are used in the account management of key decision areas like collections and authorisations, for example. They can tell us whether to accept or decline a customer for a particular credit based product, or tell us the percentage of a customer’s outstanding balance that will be recovered over a certain period of time.

In this blog post, we’ll be covering what scorecard monitoring is, its importance and the consequences of not carrying out the exercise regularly.

Ronald van Loon

Securing Competitive Advantage with Machine Learning

Business dynamics are evolving with every passing second. There is no doubt that the competition in today’s business world is much more intense than it was a decade ago. Companies are fighting to hold on to any advantages.

Digitalization and the introduction of machine learning into day-to-day business processes have created a prominent structural shift in the last decade. The algorithms have continuously improved and developed.

Every idea that has completely transformed our lives was initially met with criticism. Acceptance is always followed by skepticism, and only when the idea becomes reality does the mainstream truly accept it. At first, data integration, data visualization and data analytics were no different.

Incorporating data structures into business processes to reach a valuable conclusion is not a new practice. The methods, however, have continuously improved. Initially, such data was only available to the government, where they used it to make defense strategies. Ever heard of Enigma?

In the modern day, continuous development and improvement in data structures, along with the introduction of open source cloud-based platforms, has made it possible for everyone to access data. The commercialization of data has minimized public criticism and skepticism.

Companies now realize that data is knowledge and knowledge is power. Data is probably the most important asset a company owns. Businesses go to great lengths to obtain more information, improve the processes of data analytics and protect that data from potential theft. This is because nearly anything about a business can be revealed by crunching the right data.

It is impossible to reap the maximum benefit from data integration without incorporating the right kind of data structure. The foundation of a data-driven organization is laid on four pillars. It becomes increasingly difficult for any organization to thrive if it lacks any of the following features.

Here are the four key elements of a comprehensive data management system:

  • Hybrid data management
  • Unified governance
  • Data science and machine learning
  • Data analytics and visualization

Hybrid data management refers to the accessibility and repeated usage of the data. The primary step for incorporating a data-driven structure in your organization is to ensure that the data is available. Then you proceed by bringing all the departments within the business on board. The primary data structure unifies all the individual departments in a company and streamlines the flow of information between those departments.

If there is a communication gap between the departments, it will hinder the flow of information. Mismanagement of communication will result in chaos and havoc instead of increasing the efficiency of business operations.

Initially, strict rules and regulations governed data and restricted people from accessing it. The new form of data governance makes data accessible, but it also ensures security and protection. You can learn more about the new European Union General Data Protection Regulation (GDPR) law and unified data governance over here in Rob Thomas’ GDPR session.

The other two aspects of data management are concerned with data engineering. A spreadsheet full of numbers is of no use if it cannot be tailored to deduce some useful insights about business operations. This requires analytical skills to filter out irrelevant information. There are various visualization technologies that make it possible and easier for people to handle and comprehend data.

Want to learn more about the topic? Register now to join me at the live session with Hilary Mason, Dez Blanchfield, Rob Thomas, Kate Silverton, Seth Dobrin and Marc Altshuller.

Follow me on Twitter and LinkedIn for more interesting updates about machine learning and data integration.

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Securing Competitive Advantage with Machine Learning appeared first on Ronald van Loons.

 

June 07, 2017


Revolution Analytics

How to create dot-density maps in R

Choropleths are a common approach to visualizing data on geographic maps. But choropleths — by design or necessity — aggregate individual data points into a single geographic region (like a country...

...
 

June 06, 2017


Revolution Analytics

In case you missed it: May 2017 roundup

In case you missed them, here are some articles from May of particular interest to R users. Many interesting presentations recorded at the R/Finance 2017 conference in Chicago are now available to...

...
Ronald van Loon

What does new GDPR European Union law mean for your business?

Today’s consumers are more powerful than ever before, and get every bit of information that they can before they make a purchase. The Internet is helping them greatly, and most of the buying is done online. The pace is so rapid that it won’t be long before online purchases are more common than offline ones.

What does this mean for businesses?

You have to unify your marketing and sales channels so that you can understand your consumers better, and can offer them a personalized cross-channel experience. Customer experience is the trick to mastering this. And so, you must come up with ways that allow you to improve and offer a seamless customer experience across all channels, so much so that you outdo your competitors if you want most of the market share.

Consider the examples of Apple, Amazon, and other giant retailers out there. What are the common elements in their marketing and sales campaign? Following potential leads and consumers over multiple channels and sending them personalized messages. They have advanced analytics systems in place that give them insight, which are then used for delivering a better customer experience than before.

As of now, current technologies allow businesses to control their clients’ data. But all this is going to change next year when new laws are enforced in the European Union, .referred to as the General Data Protection Regulation (GDPR). A law that gives consumers control over their own data and companies must become aware of these new policies, understand the manner in which they will be affected, and take necessary steps to achieve compliance.

Understanding the New Law

The GDPR law shifts data control in favor of clients, and using that control, they will be able to decide which companies can store and use their personal data. They’ll be able to specify the exact manner in which their data could be used by businesses.

The GDPR Standards

As a part of becoming compliant, businesses have to meet GDPR standards.

  • Implement correct data management policies.
  • Understand and know clients’ rights in light of the new law. Accordingly, you should be able to take appropriate action at the request of your client.

Clients’ Rights

The GDPR law gives the following rights to all clients.

  • Submit a formal request to access their personal information, which a company has.
  • Rectify their data and restrict the company from processing it.
  • Ask a company to completely remove their data.
  • Withdraw consent for any reason at any time.
  • Obtain and reuse their data across different platforms for individual purposes.

Building Trust and Gaining Client Consent are Important

Organizations should manage their processes efficiently so that they can become compliant. They must understand and mitigate risks, while simultaneously building trust with clients and gaining their consent. This should be a key focus because without client consent, no business is allowed to take any kind of action with personal information for anything other than contractual or legal obligations. When they do get it, they will be able to collect, use, process, and store the data, but only how the clients want them to.

Consequences of Not Complying

What if businesses decide to ignore all this and not bother with compliance? Data Protection Authorities have several measures to enforce GDPR provisions, ranging from a reprimand to a ban on data processing altogether, and fines up to four percent of the global annual turnover.

And it doesn’t end here; you will lose client trust, and may well end up damaging your reputation. All of this will affect your other potential and new customers as well, and they may decide not to buy from you, meaning you will lose both leads and money. Data breaches for instance cause a permanent 1.8% drop in stock prices due to reputational damage (Oxford Economics and CGI).

So it is absolutely essential to maintain client trust and stay compliant, and this is true irrespective of the industry you operate in.

The Challenges Involved

Locating Information 

You are compliant to GDPR when you can respond to clients, letting them know what information you have on them. But the problem is that most of you may not even be aware of where you store this data, which may prevent you from responding promptly if clients want their data to be removed. Consider the banking industry for instance, where they usually have files and files of data, dating back to over 10 years ago; the old records wouldn’t even be digital. Thus, you may find it difficult to quickly locate clients’ data.

Managing Data Streams 

Businesses usually have numerous data streams to handle, and when working towards compliance, managing these effectively will be a challenge. Also, since you would need clients’ consent, you may not be able to use any sensitive details in any of your application systems; it all depends on how the clients want you to handle their data.

What solutions can an organization implement?

Achieving GDPR compliance means that businesses should take several steps at their end. These can be defined as follows at the highest level.

  • Locate and document the processing of personal data, and make it transparent to your consumers.
  • Ensure that personal data can be accessed, transported, and deleted, so that you can quickly respond to clients’ requests.
  • Store all personal details in a manner which complies with GDPR.
  • Gain protection from data breaches, and minimize the risks involved.
  • Monitor and manage data continuously to ensure that GDPR standards are being met.

Protecting Client’s Data

Protecting clients’ data is crucial if you want to gain their trust. Protection by Design is a recommended approach, because it promotes privacy and compliance through the data lifecycle. The two most common techniques are pseudonymisation and data minimization.

Pseudonymisation lowers risks by translating data into not-directly personal identifiable information. It remains personal data because you can still combine it with other pieces of data, such as a translation of the pseudonyms. But without this additional information, the data remains anonymous if it would fall into unwanted hands. Data minimization on the other hand is a technique that lowers risks by using only that what is strictly necessary to fulfill the intended purpose. This way datasets remain as small as possible, lowering the chance for unintended use or damage in case of a data breach. When privacy risks are minimized, your clients trust you more and are assured that their data will remain secure throughout the process.

Implementing the Technical Infrastructure

Your infrastructure should be compliant, controlled, and portable. Collect data only for specific purposes, and give your customers the right to object. The information which you do gather should be stored in a self-controlled environment and subjected to protection regulations. You can also implement a data governance solution to get deeper insight into the entire lifecycle. This will also help you in building a searchable catalogue of all information while developing an access and control point for data related tasks.

Minimizing the Risks

  • Review your current processes, and create documentation on personal data your company handles and the methods through which you obtain it.
  • Bring data protection officers or DPOs on board so that they can help you define personal data and achieve compliance.
  • Use data stream manager applications to handle all your data streams. Doing so, you will be able to process these streams in real time, allowing you to respond to clients’ requests more quickly.

Want more useful information on GDPR?

Follow Ronald van Loon and Janus de Visser to stay updated with the latest on GDPR, understand your role, and learn the tricks to gain compliance.

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post What does new GDPR European Union law mean for your business? appeared first on Ronald van Loons.


BrightPlanet

Four Questions You Need to Answer Before Starting an Open Source Anti-Fraud Program

You’ve faced your fair share of fraudulent situations. They have negatively impacted your business. You’re sick of the detrimental and long-lasting effects of fraud. You’re ready to take action. But how? An open source anti-fraud program can be of great help in situations such as yours. Being proactive allows you to halt fraud attempts in […] The post Four Questions You Need to Answer Before Starting an Open Source Anti-Fraud Program appeared first on BrightPlanet.

Read more »
 

June 05, 2017


Revolution Analytics

Powe[R] BI: Free e-book on using R with Power BI

A new (and free!) e-book on extending the capabilities of Power BI with R is now available for download, from analytics consultancy BlueGranite. The introduction to the book explains why R and Power...

...
 

June 04, 2017


Simplified Analytics

Cybersecurity in Digital age

You must have heard about the global cyberattack of WannaCry ransomware in over 200 countries. It encrypted all the files on the machine and asked for payment. Ransomware, which demands payment after...

...
 

June 02, 2017


Revolution Analytics

Because it's Friday: Disappearing Dots

It's been a while since we posted an optical illusion, and this one (via Max Galka) jis ust too good to pass up. Here are the instructions, from the source: First, look at any yellow dot as the...

...
decor