Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


May 23, 2017

Revolution Analytics

Create smooth animations in R with the tweenr package

There are several tools available in R for creating animations (movies) from statistical graphics. The animation package by Yihui Xie will create an animated GIF or video file, using a series of R...



Nicole Job: Leveraging Data Harvesting and Deep Web Technologies

Since her college days at SDSU, Nicole Job lives for a worldly perspective. She tried her hand at many paths, from nursing to psychology, but no one subject could truly define her. She ended up majoring in global studies with the ambition to find a career that would keep her mind sharp. Little did she […] The post Nicole Job: Leveraging Data Harvesting and Deep Web Technologies appeared first on BrightPlanet.

Read more »

Revolution Analytics

AzureDSVM: a new R package for elastic use of the Azure Data Science Virtual Machine

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft) The Azure Data Science Virtual Machine (DSVM) is a curated VM which provides commonly-used tools and...

Big Data University

This Week in Data Science (May 23, 2017)

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

Featured Courses From Cognitive Class

The post This Week in Data Science (May 23, 2017) appeared first on BDU.


May 22, 2017

Revolution Analytics

Preview of EARL San Francisco

The first ever EARL (Enterprise Applications of the R Language) conference in San Francisco will take place on June 5-7 (and it's not too late to register). The EARL conference series is now in its...


May 21, 2017

Simplified Analytics

Top 7 Virtual Reality Industry use cases

Today Digital Transformation has entered our life and we have been subconsciously using it in our day to day life. e.g. Smartphones, Smart cars, internet connected devices etc. Virtual Reality...


May 19, 2017

Revolution Analytics

The history of the universe, in 20 minutes

Not content with merely documenting the history of Japan, Bill Wurtz is back with a history of the entire universe. It gets rather Earth-centric once it gets going, but I guess that was inevitable....


Revolution Analytics

R/Finance 2017 livestreaming today and tomorrow

If you weren't able to make it to Chicago for R/Finance, the annual conference devoted to applications of R in the financial industry, don't fret: the entire conference is being livestreamed (with...


Rob D Thomas

Suggesting a 'pilot' is a weakness

This was my observation on Twitter/LinkedIn a couple weeks ago: Nearly everything I share in such forums is based on actual events/observations, not theory. I didn’t expect any reaction. 82...

Cloud Avenue Hadoop Tips

AWS and Big Data Training

I have been into training line (along with consulting and projects) for a few years, both online and classroom, using a Wacom Tablet which I bought 5 years back. The Tablet helps me to convey the message better and quick, whether the participant is in some other part of world or in front of me. I am really passionate about the trainings, as it helps me think about a particular concept from different perspectives.

So, here is an AWS demo on how to create a ELB (Elastic Load Balancer) and EC2 (Linux Servers) in the Cloud. This particular combination can be used for the sake of High Availability or Scaling Up as more and more users start using your application.

Similar, below is the Big Data Demo on creating a HBase Observer, which is very similar to a Trigger in RDBMS. BTW, those who are new to HBase, it's a NoSQL Database of Columnar type. There are a lot of NoSQL databases and HBase is one of them.

If you are interested in any of these trainings, please contact me at for further details. Both Big Data and Cloud are on the rage, so don't miss the opportunities around them.

See you soon !!!

May 18, 2017

Silicon Valley Data Science

Getting Value Faster with a Data Strategy

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

There’s a running joke in our industry about calls to IT service companies from client executive teams who “want to install some Hadoops.” The joke stems from the many businesses, eager to get in on the advantages of using data, who want to quickly install some software and get magic results. If you work with data then you know that these “Hadoops” (or any other tool or platform) can’t magically address underlying business concerns.

A McKinsey-Oxford survey discovered that, of the major IT projects that had significant cost overruns, the largest contributing factor was unclear objectives and a lack of business focus. If you don’t take the time to carefully define the set of problems you need to solve up front, then you’ll either be left with a system that was designed for only a narrow sliver of use cases, or one that becomes bloated with add-ons meant to address a multitude of issues in whack-a-mole style. Either way, the resulting system will probably be abandoned shortly after the initial project ends.

Companies that demand immediate technical results often don’t want to take the time to develop a data strategy up front, but it will actually accelerate business value for your company. It will ensure that each technical build-out directly addresses your most critical business goals.

You can find the significant details of a data strategy in our position paper, but as a quick overview, the process of creating a data strategy consists of the following components:

  • Identifying your business goals and current capabilities
  • Determining future requirements
  • Designing a project roadmap prioritized to deliver value early
  • Aligning business and technology stakeholders around technology investment

In this post, I’ll describe how these components fit together and work to bring you business value quickly.

Identifying your Goals and Capabilities

Pinpointing your priorities

Many of our prospective clients have related a similar story: a team at a product company hears about a new customer intelligence platform that will yield new capabilities for integrating and analyzing consumer data. It will provide a holistic customer view, predictive analytics, campaign analytics, and more. Unsurprisingly, the team jumps at the opportunity and kicks off an IT project to install or build the tool, despite the substantial effort involved.

At first, everyone seems to win. The company checks a box on IT investment, and the business analysts can now get a better view of their customer ecosystem. However, difficulties quickly pop up. The tool doesn’t work as advertised, for example, or it doesn’t produce results as quickly as the company had hoped. Perhaps the analysts need to create or modify models, but the tool will not easily allow it. Many times, the company has forgotten to ask key questions regarding the purchase in the first place, such as: What exactly do I want to learn about my customers? Do I have the right customer data to begin with?

We’ve encountered many situations in which no one asked such questions. Yes, the system helped the product company to answer a lot of questions about its customers, but those questions didn’t match the company’s key needs and priorities. We’ve seen many businesses with their needs unmet, despite their having purchased promising technical solutions.

When your strategy begins with your business priorities, you will ensure that you are solving your business’ most important problems instead of just picking one to address at random. A good data strategy helps you gain the biggest return by forcing you to address the items that will produce the largest impact first. This gives you the chance to plan ahead and identify which company groups and functions will be impacted by any new tools, and it allows you to adjust your strategy accordingly.

Devising a data strategy takes time, but a focused data strategy effort should take only 6–8 weeks. Yes, that is time in which you could be kicking off an engineering project, but at the end of that investment you will know which project makes the most sense as a starting point and how it fits into your overall roadmap. In the end, there is no wasted time and you’ve minimized a common risk for many large platform-led projects: surprises.

Cataloging your capabilities

After you prioritize your business goals, the next step is to identify the functional and technical capabilities required to meet those goals, along with the high-level use cases that bring those capabilities to life. This process has several advantages over simply building or installing a technical system right away. First and foremost, once you explicitly lay it out, you will know what specific capabilities you need to meet your goals—and those may not necessarily include the latest and greatest big data tool.

An important part of strategy assessment is also discussing these capabilities with project stakeholders (more on stakeholders later). This gives you the chance to verify whether multiple internal groups would benefit from the same capabilities, and whether those capabilities already exist somewhere in your company. You can then take those teams’ technical learnings into account when building out your future project roadmaps.

When identifying your desired functional and technical capabilities, a data strategy also gives you a comprehensive look into your data ecosystem. You have the opportunity to identify any data gaps up front. A goal may be to analyze the effectiveness of your consumer marketing campaign, but if you haven’t been capturing the proper data from your customer interactions, then even developing the best analytical models will not give you the results you need. Going through a data strategy assessment will keep you from losing valuable time scrambling to acquire necessary data after your new technical capabilities are already built.

Identifying Dependencies: Doing the foundational projects first

The next part (and benefit) of the data strategy process is mapping dependencies between priorities, capabilities, and projects. With these dependency mappings, you make sure you are solving your prioritized problems in the optimal order. For example, consider a scenario in which you’ve listed your business priorities, along with the technical capabilities needed to achieve them. It’s clear that you need a new customer analytics platform, and there are many small projects involved. Your predictive modeling project and customer data integration project are of equal priority, but for your purposes, the data integration will enable many more additional projects and capabilities than the predictive analytics. Thus, you can easily see that you should tackle the data integration project first.

Similar to how product startups focus on the minimum viable product (MVP), imagine a minimum viable capability (MVC). Mapping all of the dependencies while building the roadmap allows you to produce an MVC and then build upon that. Then you can combine dependent projects to produce results even faster. Through each subsequent project, the existing MVC is utilized and expanded to incrementally create your complete list of capabilities. As a result, you get to test your hypotheses early, measure results, and tweak as necessary.

For instance, instead of fully installing a large system before seeing any benefit, you can use a data strategy to plan out how to build smaller sets of functionality as they are needed. As an example, when integrating a subset of customer data, you can perform predictive modeling on that subset, then scale out the modeling as you integrate additional consumer data. As a result of these incremental projects, you end up seeing ROI and business results fairly quickly and can adjust your future strategy as necessary during each iteration. There is a small delay in that you haven’t built a technical system right away, but planning for these dependencies and the corresponding small, incremental projects assists you in achieving your desired results much more quickly. As an added note, these small IT projects are also seven times more likely to be successful (e.g., on-time and within budget) than larger, all-inclusive ones, according to a Standish Group research report.

Involving the Right Stakeholders

In addition to the above benefits, one final point to note is that creating a data strategy involves bringing together many different business and technical stakeholders from your organization. When you do so, you obtain everyone’s input at the beginning, instead of getting critical feedback after it’s too late.

Too often, data and analytics projects get kicked off without involving all of the affected parties. For example, the marketing analytics department plans and scopes a campaign analysis initiative, but they don’t involve the key IT or engineering stakeholders until later stages. This ultimately results in project delays, as the IT department may not have reserved infrastructure, allocated storage and processing, etc. Thinking through a data strategy with contributions from all relevant employees is key to identifying and avoiding problems early.

Roadmap: Tying it all together

After looking at the data, capabilities, use cases, and stakeholder concerns, the final, tangible deliverable from the data strategy process is a project roadmap directly based on your business objectives. Although there are many ways a roadmap can quickly deliver value for your company, the most significant is that it gives you real, meaningful use cases to tackle—instead of you simply picking an issue to go after without really looking at all your key business priorities holistically.

By giving you a comprehensive look at your business priorities, technical ecosystem, and stakeholder inputs first, a data strategy will help you avoid big problems later. Ultimately, this leads to faster value.

The post Getting Value Faster with a Data Strategy appeared first on Silicon Valley Data Science.

Revolution Analytics

Clean messy data by providing examples in Excel

While Excel isn't usually my tool of choice for manipulating or analyzing data (I prefer to use it as a data source for R), it has just learned a new trick that's likely to prove useful from time to...


How Machine Learning is helping Call Centres improve the Customer Experience

The call centre world, unsurprisingly, ranks as one of the highest adopters of data analytics platforms year on year. This is largely due to the invaluable insights we gain through the analysis of thousands of calls received each day by the typical call centre.  With speed being of the essence in making the right decision at the right time for each caller many call centres are turning to machine learning to automate their data analysis and make crucial customer experience decisions within seconds.


May 17, 2017

Revolution Analytics

An Introduction to Spatial Data Analysis and Visualization in R

The Consumer Data Research Centre, the UK-based organization that works with consumer-related organisations to open up their data resources, recently published a new course online: An Introduction to...


Revolution Analytics

R in Financial Services: Challenges and Opportunities

At the New York R Conference earlier this year, my colleague Lixun Zhang gave a presentation on the challenges and opportunites financial services companies encounter when using R. In the talk, he...


May 16, 2017

Revolution Analytics

R and Python support now built in to Visual Studio 2017

The new Visual Studio 2017 has built-in support for programming in R and Python. For older versions of Visual Studio, support for these languages has been available via the RTVS and PTVS add-ins, but...



How We Do Data Harvesting

Here’s the scenario: you’re faced with a data problem that is negatively affecting your business. You don’t know where to find the data, let alone how you’ll be able to obtain it. That’s where we can help. Data harvesting is a meticulous process and at the core of all we do. In order to discover and gather […] The post How We Do Data Harvesting appeared first on BrightPlanet.

Read more »
Silicon Valley Data Science

Getting Started with Predictive Maintenance Models

In a previous post, we introduced an example of an IoT predictive maintenance problem. We framed the problem as one of estimating the remaining useful life (RUL) of in-service equipment, given some past operational history and historical run-to-failure data. Reading that post first will give you the best foundation for this one, as we are using the same data. Specifically, we’re working with sensor data from the NASA Turbofan Engine Degradation Simulation dataset.

In this post, we’ll start to develop an intuition for how to approach the RUL estimation problem. As with everything in data science, there are a number of dimensions to consider, such as the form of model to employ and how to evaluate different approaches. Here, we’ll address these sub-problems as we take the first steps in modeling RUL. If you’d like to follow along with the actual code behind the analysis, see the associated GitHub repo.


Before any modeling, it’s important to decide how to compare the performance of different models. In the previous post, we introduced a cost function J that captures the penalty associated with a model’s incorrect predictions. We are also provided with a training set of full run-to-failure data for a number of engines and a test set with truncated engine data and their corresponding RUL values. With these in hand, it’s tempting to simply train our models with the training data and benchmark them by how well they perform against the test set.

The problem with this strategy is that optimizing a set of models against a static test set can result in overfitting to the test set. If that happens, the most successful approaches may not generalize well to new data. To get around this, we instead benchmark models according to their mean score over a repeated 10-fold cross validation procedure. Consequently, each cross validation fold may contain a different number of test set engines, if the training set size is not divisible by the number of folds. Since the definition of J from the previous post involves the sum across test instances (and thus depends on the test set size), we instead modify it to be the mean score across test instances.

The benchmarking will take place only on data from the training set, and the test set will be preserved as a holdout which can be used for final model validation. This procedure requires us to be able to generate realistic test instances from our training set. A straightforward method for doing this is described in this paper.

Information in the sensors

Most of the sensor readings change over the course of an engine’s lifetime, and appear to exhibit consistent patterns as the engine approaches failure. It stands to reason that these readings should contain useful information for predicting RUL. A natural first question is whether the readings carry enough information to allow us to distinguish between healthy and failing states. If they don’t, it’s unlikely that any model built with sensor data will be useful for our purposes.

One way of addressing this is to look at the distribution of sensor values in “healthy” engines, and compare it to a similar set of measurements when the engines are close to failure. In the documentation provided with the data, we are told that engines start in a healthy state and gradually progress to an “unhealthy” state before failure.

RUL healthy failing sensors

The figure above shows the distribution of the values of a particular sensor (sensor 2) for each engine in the training set, where healthy values (in blue) are those taken from the first 20 cycles of the engine’s lifetime and failing values are from the last 20 cycles. It’s apparent that these two distributions are quite different. This is promising—it means that, in principle, a model trained on sensor data should be able to distinguish between the very beginning and very end of an engine’s life.

Mapping sensor values to RUL

The above results are promising, but of course insufficient for the purposes of RUL prediction. In order to produce estimates, we must still determine a functional relationship between the sensor values and RUL. Specifically, we’ll start by investigating models (f(\cdot)) of the form described below, where \text{RUL}_{j}(t) is the RUL for engine j at time t, \text{S}_{j,i}(t) is the value of sensor i at that time.

(1)   \[\text{RUL}_{j}(t) = f(\text{S}_{j, 1}(t), \text{S}_{j, 2}(t), ..., \text{S}_{j, 21}(t)) \]

It’s important to note here that this isn’t the only way to estimate RUL with this kind of data. There are many other ways to frame the modeling task that don’t require a direct mapping from sensor values to RUL values at each time instant, such as similarity or health-index based approaches. There is also a whole class of physical models that incorporate knowledge about the physics of the underlying components of the system to estimate its degradation. To preserve generality and keep things simple in the discussion below, we won’t consider these approaches for the moment.

Linear regression

A simple first approach is to model \text{RUL}_{j}(t) as a linear combination of \text{S}_{j,i}(t)’s:

(2)   \[\text{RUL}_{j}(t) = \theta_{0} + \sum_{i=1}^{21}\theta_{i}\text{S}_{j, i}(t) + \epsilon \]

This is illustrated in the figure below. In blue are the values of a particular sensor (sensor 2 in this case) plotted against the true RUL value at each time cycle for the engines in the training set. Shown in red is the best fit line to this data (with slope \theta_{2}).

remaining useful life sensors

The figure below shows the performance of the linear regression model (using all sensor values as features). Each point represents a single prediction made by the model on a test engine in the cross validation procedure. The diagonal line represents a perfect prediction (\text{RUL}_{predicted} = \text{RUL}_{true}). The further a point is away from the diagonal, the higher its associated cost.

linear regression

Given the plots above, there are a few reasons to think this approach is perhaps too simple. For example, it is clear that the relationship between the sensor values and RUL isn’t really linear. Also, the model appears to be systematically overestimating the RUL for the majority of test engines. Using the same inputs, it’s reasonable to think that a more flexible model should be able to better capture the complex relationship between sensor values and RUL.

More complex models

For the reasons just mentioned, the approach above is likely to be leaving a lot on the table. To evaluate this idea, two other models were trained on the same task: a regression tree and a feedforward neural network. In both cases, hyperparameters were optimized via cross validation (not shown), resulting in a tree with a maximum depth of 100 and neural network with two 10-unit hidden layers. The figure below shows diagnostic plots of these models, along with their scores.

regression tree neural network graphs

Notice that in both cases, the models appear to capture some characteristics of the expected output reasonably well. In particular, when the true RUL value is small (~25 to 50) they exhibit a high density of predictions close to the diagonal. However, both of these models score worse than the simpler linear regression model (higher CV score).

To get a feel for exactly how these models are failing, recall that our cost function scales exponentially with the difference between predicted and true RUL, and overestimates are penalized more heavily than underestimates. In the figure above, this difference corresponds to the distance along the y-axis from each point to the diagonal. Notice that when the true RUL is in the range of about 50 to 100, both models tend to overestimate the RUL by a lot, leading to poor performance.

Improving the models

To understand how to improve our models, it’s necessary to take a step back and examine exactly what we are asking them to do. The figure below shows some example sensors from an engine in the test set plotted over time. The dashed line shows the corresponding RUL at each time step which, as we’ve (implicitly) defined it, is linearly decreasing throughout the life of the engine, ending at 0 when the engine finally fails.

remaining life sensor values IoT

Notice that only near the end of the engine’s life do the sensor readings appear to deviate from their initial values. At the beginning (about cycles 0 through 150 in the figure above), the sensor readings are not changing very much, except for random fluctuations due to noise. Intuitively, this makes sense—even though the RUL is constantly ticking down, a normally-operating engine should have steady sensor readings.

Given how we’ve framed the predictive modeling task, this suggests a major problem. For a large portion of each engine’s life, we’re attempting to predict a varying output (RUL), given more or less constant sensor values as input. What’s more, reliably predicting RUL values of healthy engine cycles won’t necessarily guarantee better performance on the test set1. As a result, these early engine cycles are likely contributing a large amount of training set error, the minimization of which is irrelevant to performance on the test set.

remaining life sensor values IoT maintenance

One approach to this issue is illustrated by the dotted line in the figure above, in which we redefine RUL. Instead of assigning an arbitrarily large RUL value across the life of the engine, we can imagine limiting it to a particular maximum value (85 in this case). If RUL is related to underlying engine health, then this is intuitive: the entire time an engine is healthy, it might make sense to think of it as having a constant RUL. Only when an engine begins to fail (some set amount of time before the end of its life) should its declining health be reflected in the RUL.

linear regression IoT RUL

The models described above were re-trained using this modified RUL assignment scheme (max RUL = 85), and results are shown in the figure above. Notice that this simple tweak yields an order of magnitude gain in performance across all models, as measured by the overall cost. From the figure it’s apparent that this alternative labeling results in an upper limit on the RUL predictions that the model can make, which makes sense given the inputs. As a result, these models are less prone to the drastic overestimates that plagued the previous versions.

There are still many aspects of this problem still left to explore. From the figures above, it’s clear that the models struggle to produce accurate estimates when the RUL is especially large or small. Also, we’re treating each time point as an independent observation, which isn’t quite right given what we know about the data. As a result, none of our models take advantage of the sequential nature of the inputs. Additionally, we aren’t doing any real preprocessing or feature extraction. Nevertheless, compared to our initial models, we’ve managed to demonstrate a significant improvement in predicting RUL.

What’s next?

In future posts, we’ll dig a bit deeper into some of the nuances of this problem and develop a better intuition for approaching predictive maintenance in IoT systems. Want to play with the data yourself? In this repository, you’ll find some scripts and notebooks to get you started.

sign up for our newsletter to stay in touch

Bottom: 1. Recall that the challenge is to predict a single RUL value after observing a series of sensor measurements. A quick look at the test set shows that many of the true RUL values are in the range of 25 to 100, presumably when these engines are no longer healthy.

The post Getting Started with Predictive Maintenance Models appeared first on Silicon Valley Data Science.

Knoyd Blog

Data Science For Everyone

Gut feeling used to be the biggest asset of successful businessmen in the past. Nowadays, intuition still plays an important role, but with all the available knowledge and technologies, there has been a significant shift. One of the most important sources of a competitive advantage these days is data. Big Data is a hype and undoubtedly a bandwagon to jump on. But how?

There is hardly a business that does not deal with data on a daily basis. Firms collect data on their customers, employees, operations, machinery performance, energy consumption, processes - the list goes on and on. With a touch of data science, this data can be transformed from annoying storage costs into useful insights. And we are no longer talking only about corporate giants, such as banks, big manufacturers or telecommunication service providers. It is true that data science used to be a privilege accessible only to the established companies, who could afford to employ their own team of data experts or hire an external consulting company. But this has changed as well.

Even smaller e-shops, retailers or startups collect more and more data every day. Small and medium-sized enterprises represent 99% of all businesses in the EU and the ratio is more or less similar across the globe. “We have realized that these firms come across the same challenges as their larger counterparts albeit on a smaller scale. In order to keep the business going, they need to keep the customers happy, the costs low, and the profits high. Therefore they too can benefit from data science solutions,” says Juraj Kapasny, co-founder of data science consulting startup Knoyd. The most important thing to figure out was how to make data science affordable for everyone. “When dealing with smaller companies, personal approach is the key. They do not expect a team of sales reps in fashionable suits to impress the non-technical people in the room with catchy slogans and presentations. They want to see that we care about them and that we can deliver on our promises. Another major difference is that we do not use expensive software, but rather open source frameworks such as R, Python or Spark, which are the alpha and omega for the data scientists anyway. Consequently, you pay only for the time spent working on your project. In the end, it is the same data scientist delivering the same high-quality solution only without the glitter that makes it all so expensive.”


The problem is that the companies often do not know where to start. Although nearly all entrepreneurs are now familiar with the term data science, its meaning still remains rather obscure. It is used to describe pretty much anything from Google Analytics and business intelligence to predictive modeling, data engineering or machine learning. The reason is that, when it comes to data science, there is no one-size-fits-all solution. Naturally, there are some common business applications, for instance, credit scoring for financial institutions, churn prediction for telco and internet providers, or predictive maintenance for manufacturers, which can be, after some customization and adjustments, applied across these industries. However, most of the solutions are designed from scratch and tailored to meet the specific needs of a given company. Besides, companies seek help in different stages of their data science efforts. Some are data science newbies who need a complete solution including advice on what data to collect and how, while others just need a push in the right direction, like the refinement of an existing recommendation engine for e-commerce or enhancement of a current credit scoring model.


Most of the companies are still closer to the newbie end of the spectrum and do not have the perfect recipe to successfully use data science to their advantage yet. Hence, if there is an innovative entrepreneur who wants to enter into the realm of data science without previous experience, we are facing a seemingly unsolvable situation. On the one hand, we have a company which wants to know what the data scientist can do for them and on the other hand, we have a data scientist who needs to see the data before drawing any conclusions. “After signing an NDA, it takes us usually about a week or two, depending on the size and structure of the client’s data, to analyze it, determine what we can do for them, and estimate the time and cost of such a project. We call this step the opportunity analysis and we guarantee that, if we do not find any opportunities or when we do find something and the client decides to proceed with the suggested project with us, this analysis is free of charge. So there is basically no risk from the customer’s perspective. This phase is very important for both sides − we are establishing a relationship with the customer and showing them what their data is capable of. After that it is up to the client to decide whether they wish to go further or not,” Juraj explains.

As it is often the case, the first step is usually the hardest. As soon as you try it, you will see for yourself that there is no black magic behind data science. There is just a skilled person, acting as a middleman, translating from the language of your data to the business terms you are familiar with. You will be able to better understand both your company and your customers and to achieve more with fewer resources. Once you see it works, you can start thinking about other aspects of your business where data science can be deployed. Or not. If data does not persuade you, you still have your intuition to rely on.

Big Data University

This Week in Data Science (May 16, 2017)

Here’s this week’s news in Data Science and Big Data.watermission_embed

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News


Featured Courses From BDU

The post This Week in Data Science (May 16, 2017) appeared first on BDU.

Ronald van Loon

Enterprise Journey to Becoming Digital

Do you want to be a digital enterprise? Do you want to master the art of transforming yourself and be at the forefront of the digital realm?

How can you change your business to achieve this?

Derive new values for yourself, and find better and more innovative ways of working. Put customer experience above and beyond everything as you find methodologies to support the rapidly changing demands of the digital world.

Your transformation will be successful only when you identify and practice appropriate principles, embrace a dual strategy that enhances your business capabilities and switch to agile methodologies if you have not done it already.

The journey to becoming a digital maestro and achieving transformation traverses through four main phases.

  • Becoming a top-notch expert with industrialized IT services – by adopting six main principles
  • Switching to agile operations to achieve maximum efficiency – so that you enjoy simplicity, rationality and automation
  • Creating an engaging experience for your consumers using analytics, revenue and customer management – because your customers come first; their needs and convenience should be your topmost priority
  • Availing opportunities for digital services – assessing your security and managing your risks

Becoming a top-notch expert with industrialized IT services

There are five key transformation principles that can help you realize the full potential of digital operations and engagement.

  • Targeting uniqueness that is digitized
  • Designing magical experiences so as to engage and retain your consumers
  • Connect with digital economics, and collaborate so as to leverage your assets
  • Operate your business digitally, customer experience being the core
  • Evolving into a fully digital organization through the side by side or incremental approach

Initially a digital maturity analysis has to be performed, followed by adoption of a targeted operational model. Maturity can be divided into five different levels: initiating, enabling, integrating, optimizing and pioneering, which are linked to seven different aspects: strategy, organization, customer, technology, operations, ecosystem and innovation, of which the last two are the most critical. The primary aim should be to cover all business areas that are impacted by and impact digital transformation.

Before taking a digital leap, the application modernization wheel should be adopted. Identify your targets, which will act as main drivers. Determine application states, and then come up with a continuous plan. This is referred to as the Embark phase, during which you understand the change rationale of your applications, and then improve metrics, which drive changes. During the Realize phase, you analyze ways in which you can change your operations and speed up your delivery. In the process, you have to improve quality, while ensuring your product line is aligned with your business needs. You establish DevOps, beginning from small teams, and then moving forward using new technologies.

The third phase is Modernize, during which you plan and implement your architecture such that your apps are based on API services. The last stage is Optimize in which performance is monitored, and improvements are made when and where they are necessary.

Switching to agile operations to achieve maximum efficiency

Data centers now feature several applications, suitable for the IT, telecommunication and enterprise sectors, but their offered services have to be responsive to the changing trends and demands. Ericsson brings agility into the picture so as to achieve efficiency through automation. This can be made possible with the NFV Full Stack, which includes a cloud manager, execution environment, SDN controllers and NFV hardware. The solution is capable to support automated deployment while providing you flexibility through multi VIM support. Check out this blog post to see a demonstration of a virtualized, datacenter and explore their vision of future digital infrastructure.

NFV’s potential can be fully achieved only when the hybrid networks are properly managed, which dynamic orchestration makes a possibility. The approach taken automates service design, configuration and assurance for both physical and virtual networks. Acceleration of network virtualization is being realized through the Open Platform for Network Functions Virtualization (OPNFV), a collaborative project under the Linux Foundation that is transforming global networks through open source NFV. Ericsson is a platinum-level founding OPNFV member, along with several other telecom vendors, service providers and IT companies leading the charge in digitalized infrastructure.

Creating an engaging experience for your consumers

Customer experience is the central focus when you are in the digital realm. Customer experience should be smooth, effortless and consistent across all channels.

Design a unique omnichannel approach for your customers. This means that you should be able to reach out to your customers through mobile app, social media platforms and even wearable gadgets. Analyze real-time data, and use the results for improving purchase journeys obvert different channels like chatbots and augmented reality. Advanced concepts like clustering and machine learning are used to cross data over different domains, and then take appropriate actions. For instance, if you were a Telco, you should be able to offer a new plan, bundle or upgrade to each customer at the right time. All of the analytics data can also be visualized for a complete understanding through which the customer journey can be identified, and the next best action can be planned out.

 Availing opportunities for digital services

Complexity increases when all your systems are connected, and security becomes a more important concern. You should be able to identify new vulnerabilities and threat vectors, and then take steps to protect your complete system. And this protection should extend to your revenues, and help you prevent fraud.

A Security Manager automates security over the cloud as well as physical networks. The two primary components are Security Automation and 360 Design and Monitoring. New assets are detected as security is hardened, which are then monitored continuously.

Additionally the Digital Risk and Business Assurance enable your business to adapt in the dynamic environment while reducing impact on your bottom line. Assurance features three levels: marketplace, prosumer and wholesale assurance. The end result is delivery of a truly digital experience.

Want proof that the above methodologies do work wonders? Two of Ericsson customers, Verizon and Jio, have already been nominated as finalists for the TM Forum EXCELLENCE Awards.

I also encourage you to join and/or follow TM Forum Live this week. If you’re headed to the conference, be sure to check out the Ericsson booth and connect with the team to learn more and discuss your digital transformation journey.

If you would like to read more from Ronald van Loon on the possibilities of Big Data and IoT please click ‘Follow‘ and connect on LinkedIn and Twitter.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Enterprise Journey to Becoming Digital appeared first on Ronald van Loons.


May 13, 2017

Simplified Analytics

Internet of (Medical) things in Healthcare

Over the past few decades, we’ve gotten used to the Internet and cannot imagine our lives without it. Millennials and new age kids don’t even know what is life without being online. With the...


May 12, 2017

Revolution Analytics

Because it's Friday: Video projection

Images are just data. That's why, using machine learning, we can predict what an image represents by pushing its pixels into a sufficiently-trained neural network. In much the same way, given one...


Revolution Analytics

Analyzing the home advantage in English soccer, with R

It's well-known that the home team has an advantage in soccer (or football, as it's called in England). But which teams have made the most of their home-field advantage over the years? Evolutionary...

Knoyd Blog

Test Title

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet. Quisque rutrum. Aenean imperdiet. Etiam ultricies nisi vel augue. Curabitur ullamcorper ultricies nisi. Nam eget dui. Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. Nam quam nunc, blandit vel, luctus pulvinar, hendrerit id, lorem. Maecenas nec odio et ante tincidunt tempus. Donec vitae sapien ut libero venenatis faucibus. Nullam quis ante. Etiam sit amet orci eget eros faucibus tincidunt. Duis leo. Sed fringilla mauris sit amet nibh. Donec sodales sagittis magna. Sed consequat, leo eget bibendum sodales, augue velit cursus nunc,


May 11, 2017

Revolution Analytics

Analyzing data on CRAN packages

There's a handy new function in R 3.4.0 for anyone interested in data about CRAN packages. It's not documented, but it's pretty simple: tools::CRAN_package_db() returns a data frame with one row for...

Silicon Valley Data Science

Managing Spark and Kafka Pipelines

Do you fully understand how your systems operate? As an engineer, there is a lot you can do to aid the person who is going to manage your application in the future. In a previous post we covered how exposing the tuning knobs of the underlying technologies to operations will go a long way to making your application successful. Your application is a unique project—it’s easier for you to learn the operational aspects of the underlying technologies, than for others to learn the specifics of all the applications.

Notice I said “the person who is going to manage your application in the future” and not “operations.” The line between development and operations is blurring. The importance of engineering knowledge in operations, and the importance of getting engineers to put operational thinking into their solutions, is really what is behind Amazon CTO Werner Vogel’s famous edict:

“Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software.”

Monitoring and alerting is not traditionally seen as a development concern, but if you had to manage your application in production, you would quickly want to figure it out. In this post, we will cover some of the basics of monitoring and alerting as it relates to data pipelines in general, and Kafka and Spark in particular.

What needs to be monitored?

Besides alerting for the hardware health, monitoring answers questions about the health of the overall distributed data pipeline. The Site Reliability Engineering book identifies “The Four Golden Signals” as the minimum of what you need to be able to determine: latency, traffic, errors, and saturation.

Latency is the time it takes for work to happen. In the case of data pipelines, that work is a message that has gone through many systems. To time it, you need to have some kind of work unit identifier that is reflected in the metrics that happen on the many segments of the workflow. One way to do this is to have an ID on the message, and have components place that ID in their logs. Alternatively, the messaging system itself could manage that in metadata attached to the messages.

Traffic is the demand from external sources, or the size of what is available to be consumed. Measuring traffic requires metrics that either specifically mean a new arrival or a new volume of data to be processed, or rules about metrics that allow you to proxy the measure of traffic.

Errors are particularly tricky to monitor in data pipelines because these systems don’t typically error out on the first sign of trouble. Some errors in data are to be expected and are captured and corrected. However, there are other errors that may be tolerated by the pipeline, but need to be feed into the monitoring system as error events. This requires specific logic in an application’s error capture code to emit this information in a way that will be captured by the monitoring system.

Saturation is the workload consuming all the resources available for doing work. Saturation can be the memory, network, compute, or disk of any system in the data pipeline. The kinds of indicators that we discussed in the previous post on tuning are all about avoiding saturation.

In each of these cases, imagine an operations person trying to determine what these are for your app, from only the deployment information. As a developer of the data pipe, you have unique information about the functioning of the application, and also have an opportunity to produce and expose this key information in your code.

Kafka and Spark monitoring

On the Kafka side, dials and status aren’t enough for a pipeline—we need to see end to end. You owe it to yourself to look at the Kafka Control Center. This tool adds monitors at each producer and consumer, and gives end to end metrics explorable by time frame, topic, partition, consumer, and producer. If you are not getting the enterprise license, look at the features of Kafka Control and add its features to your custom efforts.

Another framework to look into for monitoring Kafka is the Kafka Monitor, which sets up a test framework that essentially allows you to build end to end tests. You will want tests that exercise the elements listed in the paragraph above.

Spark and Kafka both have the ability to be monitored via JMX, Graphite, and/or Ganglia. JMX makes sense if you are integrating it into an existing systems. Graphite excels at building ad hoc dashboards, and Ganglia is a good monitoring infrastructure with deep roots in the Hadoop monitoring ecosystem. These systems will give you a good feel for overall throughput, efficiency of reads and writes, and provide values needed to match batch duration versus amount of compute per batch.

Finally, if you really want to get a picture of the data graph as a whole then you can implement distributed tracing. Open Zipkin is a implementation of the Google Dapper paper and is a whole other level of tracing. With distributed tracing, your monitoring/tracing system pulls together your event stream into a single trace that includes all nodes through the system. Distributed traces provide a tree of all the events that happened, with a packet of data as it goes through the system. Distributed traces also provide information from all the nodes of a system for a given message.

While this provides all the info you might need in a single message, it produces a lot of data, and so the Dapper-based systems are based on sampling. Since all events are not sampled, it can only be effectively used for persistent problems.

Alerting for data pipelines

Alerting systems should alert at the business service level (“customer data is not getting to the database), not at the error event level (“process on node 3 failed”).

A business service is made up of of one or more jobs; those jobs are composed of one or more instances of code working on some infrastructure. When errors start happening on various instances of code, or some key computers go down, an avalanche of error events start accumulating. These events need to be easy to find when the conditions kick off a service failure alert. A good alerting system should have many events, but only one alert per service-level problem. Additionally, events should be indexable by the services they impact.

For your pipeline to succeed in production, you must create documentation of what to look for, as well as the hierarchy of events—from instances, to jobs, to services. When an alert triggers incident management, the person handling it should have enough information to know what to look for, and then be able to quickly look for the problem. For example, they may ask themselves:

  • Are any of the hosts associated with this service down?
  • Is data flowing end to end, are any queues backed up?
  • Which jobs in this service are showing error events?

The systems identified in monitoring will also have alerting mechanisms. Your infrastructure team likely already has an alerting system and so you will need to find out what is being used in your organization and conform to that. Generally, these systems have simple ways for you to submit your alerts; that is the easy part. You will also need to work out what happens when someone gets alerted.

The reality of alerting without developer input

What I’ve described thus far is what would happen in an ideal world (and what you should work toward in your own systems). The reality, however, is that we often encounter alerting systems that do not take these precautions. Without the critical information about the specifics of the applications, operations builds a system based on trial and error.

Most of us are in organizations in the middle and hopefully this post has encouraged you to learn more about how to instrument your system (or has validated you if you’re already doing so). Not sure where to start? Contact us to learn more about how we can help.

Further resources

A particularly inspirational talk on adopting an operational development outlook is from Theo Schlossnagle: A Career in Web Operations.

Below, for ease, is a list of the links used in this post (along with a few extras).

sign up for our newsletter to stay in touch

The post Managing Spark and Kafka Pipelines appeared first on Silicon Valley Data Science.

Ronald van Loon

How Artificial Intelligence will Transform IT Operations and DevOps

To state that DevOps and IT operations teams will face new challenges in the coming years sounds a bit redundant, as their core responsibility is to solve problems and overcome challenges. However, with the dramatic pace in which the current landscape of processes, technologies, and tools are changing, it has become quite problematic to cope with it. Moreover, the pressure business users have been putting on DevOps and IT operations teams is staggering, demanding that everything should be solved with a tap on an app. However, at the backend, handling issues is a different ball game; the users can’t even imagine how difficult it is to find a problem and solve it.

One of the biggest challenges IT operations and DevOps teams face nowadays is being able to pinpoint the small yet potentially harmful issues in large streams of Big Data being logged in their environment. Put simply, it is just like finding a needle in the haystack.

If you work in the IT department of a company with online presence that boasts 24/7 availability, here is a scenario that may sound familiar to you. Assume that you get a call in the middle of the night from an angry customer or your boss complaining about a failed credit card transaction or an application crash. You go to your laptop right away and open the log management system. You see there are a more than a hundred thousand messages logged at the set timeframe – a data set impossible for a human being to review line by line.

So what do you do in such a situation?

It is the story of every IT operations and DevOps professional; they spend many sleepless nights, navigating through the sea of log entries to find critical events that triggered a specific event. This is where real-time and centralized log analytics come to the rescue. It helps them in understanding the essential aspects of their log data, and easily identify the main issues. With this, the troubleshooting process becomes a walk in the park, making it shorter and more effective, as well as enabling experts to predict the future problems.

AI and Its Effect on IT Operations and DevOps

While Artificial Intelligence (AI) used to be the buzzword a few decades ago, it is now being commonly applied across different industries for a diverse range of purposes. Combining big data, AI, and human domain knowledge, technologists and scientists have become able to create astounding breakthroughs and opportunities, which used to be possible in science fiction novels and movies only.

As IT operations become agile and dynamic, they are also getting immensely complex. The human mind is no longer capable of keeping up with the velocity, volume, and variety of Big Data streaming through daily operations, making AI a powerful and essential tool for optimizing the analyzing and decision-making processes. AI helps in filling the gaps between humans and Big Data, giving them the required operational intelligence and speed to significantly waive off the burden of troubleshooting and real-time decision-making.

Addressing the Elephant in the Room – How AI can Help

In all the above situations, one thing is common; these companies need a solution – as discussed in the beginning – that helps IT and DevOps teams to quickly find problems in the mountain of log data entries. To identify that single log entry putting cracks in the environment and crashing your applications, wouldn’t it be easy if you just knew what kind of error you are looking for to filter your log data? Of course, it would cut down the amount of work by half.

One solution can be to have a platform that has collected data from the internet about all kinds of related incidents, observed how people using similar setups resolved them in their systems, and scanned through your system to identify the potential problems. One way to achieve this is to design a system that mimics how a user investigates, monitors, and troubleshoots events, and allows it to develop an understating how humans interact with data instead of trying to analyze the data itself. For example, this technology can be similar to Amazon’s product recommendation system and Google’s PageRank algorithm, but it will be focused on log data.

Introducing Cognitive Insights

A recent technology implements a solution as envisioned by this post. The technology – which has been generating quite a lot of buzz lately- is called Cognitive Insights. This groundbreaking technology uses machine-learning algorithms to match human domain knowledge with log data, along with open source repositories, discussion forums, and social thread. Using all this information, it makes a data reservoir of relevant insights that may contain solutions to a wide range of critical issues, faced by IT operations and DevOps teams on a daily basis.

Overview of top insights that contain solutions to critical issues 

The Real-Time Obstacles

DevOps engineers, IT Operations managers, CTOs, VP engineering, and CISO face numerous challenges, which can be mitigated effectively by integrating AI in log analysis and related operations. While there are several applications of Cognitive Insights, the two main use cases are:

  • Security

Distributed Denial of Service (DDoS) attacks are increasingly becoming common. What used to be just limited to governments, high-profile websites, and multinational organizations is now targeting prominent individuals, SMBs and mid-sized enterprises.

To ward off such attacks, having a centralized logging architecture to identify suspicious activities and pinpoint the potential threats from thousands of entries is essential. For this, anti-DDoS mitigation through Cognitive Insights has been proven to be highly effective. Leading names, such as Dyn and British Airways, that sustained significant damage from DDoS attacks in the past now have a full-fledge, ELK-based anti-DDoS mitigation strategy in place to keep hackers at bay and secure their operations from any future attacks.

Cognitive Insights pinpoints the potential threats from thousands of entries 

IT Operations

Wouldn’t it be great to have all your logs compiled into a single place, with each entry carefully monitored and registered? Well, certainly. You will be able to view the process flow clearly and execute queries pertaining to the logs from different applications all from one place, hence dramatically increasing the efficiency of your IT operations. To solve one of the biggest challenges IT operations and DevOps teams face is being able to pinpoint the small yet potentially harmful issues in large streams of log data in their environment. This is precisely what Cognitive Insights does. Since the core of this program is based on the ELK stack, it sorts and simplifies the data and makes it easy to have clear picture of your IT operations. Asurion and Performance Gateway are perfect examples that have leveraged from Cognitive Insights and taken their IT game up a notch.

Quickly find the needle in the “IT operations” haystack and eliminate the main problems

The Good AI Integration can Yield

Using AI driven log analytics systems, it becomes considerably easy to find the needle in the haystack, and efficiently solve issues. Such a system will have a considerable impact on management and operations of the entire organization. Like the problems of companies discussed above in this blog, integrating AI with log management system will benefit in:

  • Improved customer success
  • Monitoring and customer support
  • Risk reduction and resource optimization
  • Maximize efficiency by making logging data accessible

In other words, Cognitive Insights and other similar systems can be of great help in data log management and troubleshooting.

Rent-A-Center (RAC) is a Texas-based, Fortune 1000 company that offers a wide range of rent-to-own products and services. It has over 3000 stores and 2000 kiosks spread across Mexico, Puerto Rico, Canada, and United States. The company tried integrating two different ELK stacks, but handling 100GB data every day was too much of a hassle, not to mention the exorbitant cost and time spent every day for disk management, memory tuning, additional data input capabilities, and other technical issues. RAC transitioned to Cognitive Insights, which gave them the confidence that they will be able to detect future anomalies and made it quite easily to scale the constantly growing volume of data. They benefitted from a dedicated IT team managing on-premise and off-premise ELK stacks.

The Role of Open Source in Data Log Management

Many reputed vendors are proactively researching and testing AI in different avenues to enhance the efficiency of data log management systems. Some of the vendors are:

There is no surprise in the fact that ELK is fast becoming part of the trend, and more and more vendors are offering logging solutions. This is because it has become a great way for companies to install a setup without incurring a staggering upfront cost. It also allows for some basic graphing and searching capabilities, and in order for the organizations to recognize the issues in their haystack of log data, they can opt for latest technologies, like Cognitive Insights, to quickly find the needle and eliminate the main problems.

Make sure you join in the discussion online about AI. For even more insights and information on Artificial Intelligence and Big Data, connect with Ronald van Loon on LinkedIn and Twitter.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How Artificial Intelligence will Transform IT Operations and DevOps appeared first on Ronald van Loons.

Ronald van Loon

Data Analytics is Transforming Healthcare Systems

Big Data Analytics is entirely transforming business paradigms. Automated databases are enabling businesses to perform mundane tasks more efficiently. And, the commercial sector isn’t the only area to benefit from data analytics. Its impact is widespread and is being seen across many different sectors, including healthcare.

Access to healthcare facilities is a basic, human need. However, the healthcare sector is extremely expensive, even when compared to the other developed economies. In the United States, the burden of the expense ultimately falls on the consumer since the sector is mostly dominated by private companies. America, however, ends up spending more on its public healthcare system than countries where the system is fully publicly funded.

Under such circumstances where people are paying a significantly higher price, they deserve a service that matches the price tag. The question is then: how can data analytics help increase the efficiency of healthcare systems in United States and around the world?

Performance Evaluation

Keeping a tab on hospital activities by maintaining relevant databases can help administrators find inefficiencies in service provision. Based on the results found from data analysis, specific actions can be taken to reduce the overall costs of a healthcare facility. Reduced costs may be reflected in the form of reduced expense burden on the consumers of those healthcare facilities.

According to U.S News and World Report, Beaufort Memorial Hospital in South Carolina found that they could save approximately $435,000 annually by implementing a simple act of discharging patients half a day early. Hospital administration didn’t just make a random decision. They reached their conclusion after carefully analyzing the data. Such is the scope of data analytics in the field of healthcare.

Financial Planning

Data Analysis also helps administrators to allocate funds efficiently. An organization can achieve maximum transparency in terms of finances by incorporating automated database systems. The chances of embezzlement and fraud can be reduced significantly. The United States spends approximately 7.4% of public spending on the healthcare sector. A significant amount of money is lost annually due to fraudulent activities in the health sector. George Zachariah, who is a consultant at Dynamics Research Corporation in Andover, says that, “In recent times it has become imperative for the organizations to use analytics tools to track fraudulent behavior and incorrect payments.”

Patient Satisfaction

We have already established the idea that by incorporating analytics in healthcare systems we can improve the efficiency of the organization in terms of administrative tasks. However, the core objective of a healthcare facility is to cater to the needs of patients. Data analytics has not only proven to be beneficial for the mundane tasks related to administration but also to have a positive impact on a patient’s overall experience. By maintaining a database of patients’ records and medical histories a hospital facility can cut down the cost of unnecessary, repetitive processes.

In addition, analytics can help in keeping an updated record of a patient’s health. With the adoption of more advanced analytics techniques, healthcare facilities can even remind patients to maintain a healthy lifestyle and provide lifestyle choices based on their medical conditions. After all, the whole purpose of introducing digital technology in the health sector is to make sure that people are getting the best facilities at a subsidized cost.

Healthcare Management

Automating the processes can help healthcare organizations to obtain useful metrics about the population. It can reveal information such as, if a certain segment of the population is more prone to a certain disease. Moreover, if a healthcare facility is operated in multiple units, analytics can prove beneficial to ensure the consistency across all facilities and specific departments. For example, Blue Cross of Idaho used Pyramid Analytics BI office to create a population health program. The results were evident in the form of reduced ER visits and emergency cases.

Quality Scores and Outcome Analysis

Under certain circumstances, a patient requires consultation from various specialists. Analytics can play a significant role in filling the communication gaps between consultants in such situations. It is important for each of the consultants to coordinate for the patient’s quick recovery. However, owing to the busy schedules of the consultants most of the time it proves to be very inconvenient for them to communicate. Digitalizing can solve the problem by providing a communication medium where patients data, medical history and current progress can be stored and reviewed by each individual working on that case. By keeping a continuous check on patients’ health, cases of readmission can also be greatly reduced. A regional medical center can employ analytics to strategically classify patients based on their quality scores and allocate maximum resources to patients most at risk.

Labor Utilization

In addition to the patients’ data, hospitals and healthcare facilities can also store staff’s data. They can observe staff performance and find any loopholes or inefficiencies in the system. Based on the results, they can arrange that staff is divided strategically across departments. Some healthcare units call for more staff than others. Failure to understand the organization’s requirements will lead to a loss for both stakeholders.

If you are interested in learning more about how data analytics will transform the healthcare systems in United States and across the world, then join the Pyramid Analytics webinar hosted by Ronald van Loon and Angelika Klidas.

For more interesting insights about Big Data Analytics follow me on Linkedin and Twitter.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Data Analytics is Transforming Healthcare Systems appeared first on Ronald van Loons.


Three ways data improves Customer Retention

The amount of data now available to us is overwhelming: every two days we create as much information as we did from the beginning of time until 2003. As a Marketer, the challenge is determining what data is useful and how to turn it into marketing wisdom that leads to customer retention and growth. Considering that it costs 5 times as much to on-board a customer than it is to retain one, companies would do well to leverage their data to develop and drive retention strategies.

In this post, I look at 3 ways data can be used to build and drive customer retention strategies that result in reduced churn rates and open new avenues for meaningful engagement with target markets.


May 10, 2017


Providing Structure and Security to Web Data Harvesting

Over the years, one of our main focuses has been getting the most out of our Deep Web search. As multiple industries are discovering the demanding need for data harvesting and managing their Big Data, we’ve taken the necessary steps to provide structure and security for their respective businesses. Here are some of the new and […] The post Providing Structure and Security to Web Data Harvesting appeared first on BrightPlanet.

Read more »

May 09, 2017

Revolution Analytics

Predicting Hospital Length of Stay using SQL Server R Services

Last week, my Microsoft colleagues Bharath Sankaranarayan and Carl Saroufim presented a live webinar showing how you can predict a patient's length of stay at a hospital using SQL Server R Services....

Silicon Valley Data Science

From Data Managers to Platform Providers

One of the benefits of being co-chair for this June’s Spark Summit (June 5-7, San Francisco) is the insight into how this technology is being used in leading companies. In particular, I am excited to see evidence of an important pattern: the creation of internal service platform to meet the data science and analytic needs of organizations.

These data science platforms give access to data and computation power to communities of business analysts and data scientists within an organization, while conferring the benefits of managed access and scalability to the organization.

I expect this model to become the norm for analytics organizations in enterprises over the next five years. There are two factors that are driving this change.

Firstly, there is the rapid increase in demand for data. In a digital world, data is the way we understand ourselves, our customers, and our competition. Everyone needs data to do their jobs. Sounds like a great thing, but it comes with headaches: if it’s hard to get at data, people will hoard it; point-to-point sharing of data means people will duplicate it, stash it where they can. It’s a data governance nightmare, and makes it really hard for people to build on each other’s work. Strict rules or poor service levels just drive bad behavior underground. The answer is to provide an organized data platform that gives better service.

Secondly, we have the technology now to make organization-wide data platforms economic. As we move to commodity, scale-out, analytics platforms, we have to worry less about guarding resources and policing usage. We can develop a more sophisticated infrastructure that looks a lot more like a data community. Analysts can help each other, sharing data and models.

The advantages of a successful analytic platform are clear: better service levels for data users in the organization, and the prospect of making data governance a feasible endeavor.

Data management professionals face a transition in their roles: from data custodians, to data evangelists; from functioning as a utility, to providing a user-facing product—data as a service. It’s an exciting time, and I’m glad there are some great examples to learn from. These sessions at Spark Summit tell this story further:

To see these talks and more, consider joining me at Spark Summit. Register with the code EDD2017 and get a 15% discount.

The post From Data Managers to Platform Providers appeared first on Silicon Valley Data Science.

Big Data University

This Week in Data Science (May 9, 2017)

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News


Featured Courses From BDU


Upcoming Data Science Events

The post This Week in Data Science (May 9, 2017) appeared first on BDU.


May 08, 2017

Revolution Analytics

In case you missed it: April 2017 roundup

In case you missed them, here are some articles from April of particular interest to R users. The rxExecBy function (in Microsoft R Server) deploys "embarassingly parallel" problems to remote compute...


May 07, 2017

Simplified Analytics

Terminator or Iron Man – What will AI bring in future?

In the age of Digital Transformation, Artificial Intelligence has come a long way from Siri to driverless cars. If you have used a GPS on Google Maps to navigate in your car, purchased a book...


May 05, 2017

Revolution Analytics

Because it's Friday: Bayesian Trap

If you get a blood test to diagnose a rare disease, and the test (which is very accurate) comes back positive, what's the chance you have the disease? Well if "rare" means only 1 in a thousand people...


Revolution Analytics

Statistician ranked as best job in 2017 by CareerCast

According to job hunting site CareerCast, the best job to have in 2017 is: Statistician. This is according to their 2017 Jobs Rated report, based on an evaluation of Bureau of Labor Statistics...

Ronald van Loon

DES-Digital Business World Congress 2017

Anybody who has been following my social media account would know that I am a big supporter of incorporating data analytic techniques in everyday business processes. I am really excited about attending the biggest digital conference, DES 2017 (Digital Business World Congress) as a speaker and a co-host. This conference will allow me to communicate with a global audience and network with the like-minded digital savvy and data-driven geeks (I mean professionals).

We are in the midst of a digital revolution and companies often confront a contradicting situation about operating the businesses traditional way or jumping on the digital transformation bandwagon. The obvious and the smartest move would be to adapt to the changing business environment so that you don’t become obsolete. However, completely altering the existing business dynamics is often not an easy task for business owners. It requires determination and a comprehensive understanding of the transformation process.

It is my honor to be invited to co-host the Big Data & Analytics series at the Digital Business World Congress 2017. I will share my experiences and insights about accepting and incorporating the technological advancements in the existing business models.

Turning data into action

Are you a business owner who is often confused about taking the decisions regarding business? Are you contemplating between two alternating options and unsure which is the right way to go? Instead of wasting your time contemplating, try and analyze the relevant data. The answer is always hidden behind the cryptic numerals. Take useful insights from the data and fuse it with focused action to achieve the growth you desire in your business.

Three Pillars of data-driven culture

I always say that there are three main pillars of creating a data-driven culture in an organization and those pillars are; People, Processes and Technology. It is absolutely imperative that your organization is equipped with the relevant human capital who can grasp the idea of manual to digital transformation. They should possess the skills to successfully continue and further the transition. It is important to understand that the process of incorporating relevant technologies in the business process is an ongoing phenomenon. Therefore, the company has to be constantly aware of the changes happening in the tech world.

I will be talking about these topics in detail at the conference, along with discussing other useful and interesting topics such as Artificial Intelligence, Data-driven culture in organizations, Data-driven customer experience and the overall Data governance journey.

DES 2017 is the right platform for you if you want to learn more about digitalizing your business and inducing a data-driven culture in your organization. You will get a chance to listen to the experiences of some of the best industry professionals by attending the event on 23rd May.

For more updates about the conference and a heads-up on the digital world, follow me on Twitter and LinkedIn.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post DES-Digital Business World Congress 2017 appeared first on Ronald van Loons.


May 04, 2017

Silicon Valley Data Science

Noteworthy Links: Artificial Intelligence

Interested in how AI is being applied out in the real world? Check out the stories below, ranging from fighting food insecurity, to a very low-level version of a butler.

Farm robots—The FarmView project out of Carnegie Mellon University is focused on how technology can help farmers. This article looks at how one of their robots is helping out on a sorghum farm in South Carolina. Among other things, the robot helps the farmer gather data on their crops faster and more efficiently. The FarmView researchers hope that their work can eventually tackle food insecurity.

Dance Dance Convolution—Remember Dance Dance Revolution (DDR)? Well, more machines have entered the world of dance. DDR’s continued popularity inspired a group at UC San Diego to create a choreographing program. While this is a cool idea, the article points out that this is another area of human creativity that technology now has a hand in. What that means for the art of dance is debatable, but an interesting thought exercise.

Guide to building a chess AI—Using JavaScript, this post walks you through building a chess AI, and includes some next steps at the end if you want to get even more complicated and refined. If you try this out, comment and let us know if you’re able to beat what you’ve created!

Build a talking, face-recognizing doorbell for about $100—Want your doorbell to alert you to who is visiting? This post walks through how to set that up, using Amazon Echo and Raspberry Pi. It includes the hardware, as well as suggestions for storing your video.

Want to keep up with what we’re up to? Sign up for our newsletter to get updates on blog posts, conferences, and more.

The post Noteworthy Links: Artificial Intelligence appeared first on Silicon Valley Data Science.


How OSINT Strengthens Your Security Risk Management

We spend our days focused on finding, harvesting, curating and helping clients develop insights from open source intelligence (or OSINT). If you aren’t leveraging OSINT, you are missing large amounts of relevant data that is freely available to you. OSINT is under-used as a foundational tool for security and risk management. Why is that? To begin, […] The post How OSINT Strengthens Your Security Risk Management appeared first on BrightPlanet.

Read more »

Forrester Blogs

Cloudera IPO Highlights The Big Data And Hadoop Opportunity

Read moreCategories:ForecastView big data cloudera hadoop

InData Labs

How AI is Revolutionizing Healthcare

How artificial intelligence and machine learning techniques are shaping the healthcare industry and health outcomes for millions of people around the world.

Запись How AI is Revolutionizing Healthcare впервые появилась InData Labs.

Revolution Analytics

Real-time scoring with Microsoft R Server 9.1

Once you've built a predictive model, in many cases the next step is to operationalize the model: that is, generate predictions from the pre-trained model in real time. In this scenario, latency...


May 03, 2017

Revolution Analytics

Technical Foundations of Informatics: A modern introduction to R

Informatics (or Information Science) is the practice of creating, storing, finding, manipulating and sharing information. These are all tasks that the R language was designed for, and so Technical...


What is Customer Segmentation?

Effective communication helps us better understand and connect with those around us. It allows us to build trust and respect, and to foster good, long-lasting relationships. Imagine having this ability to connect with every customer (or potential customer) you interact with through communication that addresses their motivators and desires. In this blog post, I take a brief look at ‘customer segmentation’ and how it can foster the type of communication that leads to greater customer retention and conversion rates.

Cloud Avenue Hadoop Tips

Attaching an EBS Disk to a Linux Instance

In the previous blog, we looked the sequence of steps to create a Linux instance and log into it. In this blog, we will create a new hard disk (actually an EBS) and attach it to the Linux instance. I am going with the assumption that the EC2 instance has already been created.

1. Goto the EC2 management console and click on the Volumes in the left pane. And then click on `Create Volume`. Change the volume size to 1GB and click on Create.

2. It takes a couple of seconds, but there will be two EBS volumes. An 8 GB volume (in-use) which was automatically created at the the time EC2 creation. Another 1 GB volume (attached) which was created in the above step.

3. Select the 1 GB volume. Click on Actions and then Attach Volume. We have created a Linux EC2 instance and an EBS volume, now we need to attach these two.

4. Move the cursor to the instance box and select the EC2 instance to which the EBS has to be attached and then click Attach. The state of the 1GB instance should change from available to in-use.

5. Get the ip address of the EC2 instance and login to it using Putty as mentioned here.

6. Change to root (sudo su) and get the list of the partition tables (fdisk -l). The commands are mentioned in the parentheses. Note that the device name is /dev/xvdf from the below commands. It may vary, so the previous mentioned commands have to be run.

7. Build the Linux filesystem (mkfs /dev/xvdf), create a folder (mkdir /mnt/disk100) and finally mount the file system (mount /dev/xvdf /mnt/disk100). The commands are mentioned in the parentheses. Note that, I have chosen to create disk100 folder, you can replace with any other folder name.

Now the device has been mounted to /mnt/disk100 folder, the data in this folder will be written to the 1 GB EBS volume which we have created in one of the previous step. Even after stopping the EC2 instance, the data will be still there in the EBS instance. This EBS can be attached to another EC2 instance also.

Note that in the case of AWS, an EBS volume cannot be attached to multiple EC2 instances. But, the same thing can be done in the Google Cloud Platform.

Don't forget to terminate the Linux EC2 instance and delete the EBS volume. In the next blog, we will attach an EBS volume to an Windows EC2 instance, which is a bit more easier.

May 02, 2017

Silicon Valley Data Science

SVDS Data Strategy: New Video Available

If you’ve attended a data industry conference in the last three years, you may well have noticed our Developing a Modern Enterprise Data Strategy tutorial. In it, we explain our method for data strategy, and equip you with the language and tools make the case for data strategy in your own organization.

We’re happy to announce that we have produced Developing a Modern Enterprise Data Strategy as a video product, available from O’Reilly Media and Safari Books Online.

Get the Video

Over four chapters and three hours, we cover the role of data strategy in modern business, how to map data investments to business goals, the role of tools and architecture, and the end game of building a data-driven organization.

And if you want to hear this tutorial in person, join us at an upcoming event, such as Interop ITX (Las Vegas, 15 May 2017).

data strategy

The post SVDS Data Strategy: New Video Available appeared first on Silicon Valley Data Science.

Revolution Analytics

The Datasaurus Dozen

There's a reason why data scientists spend so much time exploring data using graphics. Relying only on data summaries like means, variances, and correlations can be dangerous, because wildly...


Revolution Analytics

Using Microsoft R with Alteryx

Alteryx Designer, the self-service analytics workflow tool, recently added integration with Microsoft R. This allows you to train models provided by Microsoft R, and create predictions from them,...

Big Data University

This Week in Data Science (May 2, 2017)

Here’s this week’s news in Data Science and Big Data.machinelearning

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News


Featured Courses From BDU


Upcoming Data Science Events

Cool Data Science Videos

The post This Week in Data Science (May 2, 2017) appeared first on BDU.


April 30, 2017

Ricky Ho

An output of a truly random process

Recently I have a discussion with my data science team whether we can challenge the observations is following a random process or not.  Basically, data science is all about learning hidden...


April 29, 2017

Simplified Analytics

5 ways to improve the model accuracy of Machine Learning!!

Today we are into digitalage, every business is using big data and machine learning to effectively target users with messaging in a language they really understand and push offers, deals and ads that...


April 28, 2017

Revolution Analytics

Where Europe lives, in 14 lines of R code

Via Max Galka, always a great source of interesting data visualizations, we have this lovely visualization of population density in Europe in 2011, created by Henrik Lindberg: Impressively, the chart...


Revolution Analytics

Because it's Friday: Powerpoint Punchcards

A "Turing Machine" — a conceptual data processing machine that processes instructions encoded on a linear tape — is capable of performing the complete range of computations of any modern computer...


Revolution Analytics

Make pleasingly parallel R code with rxExecBy

Some things are easy to convert from a long-running sequential process to a system where each part runs at the same time, thus reducing the required time overall. We often call these "embarrassingly...


Revolution Analytics

dv01 uses R to bring greater transparency to the consumer lending market

The founder of the NYC-based startup dv01 watched the 2008 financial crisis and was inspired to bring greater transparency to institutional investors in the consumer lending market. Despite being an...


Finovate Spring 2017: Highlights from Day 2

In this blog post, I will be covering some of the highlights from day 2 of the Finovate Spring Conference. Whereas the previous review of day 1 mostly covered innovations that assist the loans origination process, in this blog I’ll cover some of the analytical offerings. I’ll also cover some of the other offerings including training, investment platforms, and bot-based technology. 


April 27, 2017

Silicon Valley Data Science

How Mature Are Your Data Capabilities?

In a previous post on data maturity, we discussed a company that was just embarking on a transformation: launching a new services business and building data capabilities to support that business. But what if you’re not starting from the beginning? What if you’ve already been embracing new technology, conducting pilots, and launching new analytical platforms? Recently, we were working with a Fortune 500 industrial company in the midst of developing software services to improve product R&D and enrich the customer experience.

Their goal was to use data to empower decision makers across every part of the organization to make robust, data-driven choices. The company had great talent, technical vision, and infrastructure. Still, they weren’t generating the progress they would have liked at a rate they would have expected. What was wrong?

We were asked to perform an assessment of their capabilities and architecture to help them become truly data-driven.

Assess Your Data Maturity

Understanding their overall data maturity shined a light on areas requiring attention to get the most of their technology investments:

  • Missing links between projects and metrics: The initiative’s overall success was being measured by a single metric that they could only begin tracking in 2020—at the completion of the transformation. This led to significant uncertainty within project teams building new capabilities and platforms. Many teams were unsure where their analytical work fit in the big into the larger efforts and, more importantly, whether they were contributing to the overall success of the transformation.
  • Lack of cross-functional teams: The analytical infrastructure built by the engineering team was impressive, but was sorely underutilized. The data scientists had not been trained to use it and did not know how to access it. We heard from an analytics manager: “Seventy percent of my team’s time is spent on writing UDFs and Pig scripts to access data!” Creating teams that facilitated collaboration between engineers and data scientists was an opportunity for quick productivity gains with expensive talent.
  • Siloed business functions: Teams felt a lack of clear objectives that stemmed from communication and information sharing issues with other teams. For example, one team integral to product development described their view of the future as “a dusty window.” Business units on the consumption side of application development experienced very uneven usage of analytical tools. Strengthening these partnerships was crucial as the overarching project’s success relied specifically on strong analytical capabilities throughout the entire organization.

In working together, we helped the industrial company link their data and analytical capabilities with their ultimate business objectives, allowing them to create the right metrics to truly understand their progress. We helped them improve their devops capabilities and better integrate their engineering and data science teams. Collectively, this helped them break down technical and organizational siloes that were hampering progress.

Understanding the uneven maturity of their capabilities across people, process, and systems gave them the answers they needed to the question on everyone’s minds: How can we see real results faster? A view of their current state of maturity along with a clear target for where they needed to be set a baseline and a way to measure progress.

Frustrated with the progress you’re seeing from your data and analytics investments? You don’t have to be. Take our data maturity assessment, and learn where you stand. Understanding where your capabilities are strong and where they’re lacking gives you a great lens for directing and prioritizing investments. Doubling down on your strengths is often a good strategy, but not if immature capabilities are holding you back.

The post How Mature Are Your Data Capabilities? appeared first on Silicon Valley Data Science.