decor
 

Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.

 

September 21, 2018


Revolution Analytics

Because it's Friday: Fly Strong

I was about the same age as student pilot Maggie Taraska when I had my first solo flight. Unlike Maggie, I didn't have to deal with a busy airspace, or air traffic control, or engines (I was in a...

...

Revolution Analytics

Applications of R presented at EARL London 2018

During the EARL (Enterprise Applications of the R Language) conference in London last week, the organizers asked me how I thought the conference had changed over the years. (This is the conference's...

...

Forrester Blogs

Nike Scores A Customer-Values Touchdown

Unless you were on the Appalachian Trail for a few weeks, you know what I’m talking about. I’ll start with this paragraph from our newly published analysis of Nike’s “Just Do It” campaign featuring...

...

Forrester Blogs

Nike Scores A Customer-Values Touchdown

Unless you were on the Appalachian Trail for a few weeks, you know what I’m talking about. I’ll start with this paragraph from our newly published analysis of Nike’s “Just Do...

...

Forrester Blogs

Bad Bots Are Stealing Data And Ruining Customer Experience

Bad Bots Are Affecting Your Company More Than You Might Think Every online customer touchpoint — including websites, mobile apps, and APIs — is being attacked by bots. What are these bad bots doing?...

...

Forrester Blogs

Adobe Changes Its Marketing Cloud Trajectory With Marketo Acquisition

I am still catching my breath. Adobe has agreed to acquire Marketo for $4.75 billion. The deal is the biggest in Adobe’s history, and a massive encore to the acquisition of Magento for a mere $1.7...

...

Forrester Blogs

Adobe Changes Its Marketing Cloud Trajectory With Marketo Acquisition

I am still catching my breath. Adobe has agreed to acquire Marketo for $4.75 billion. The deal is the biggest in Adobe’s history and a massive encore to the acquisition of Magento for a mere...

...
 

September 20, 2018


Revolution Analytics

AI, Machine Learning and Data Science Roundup: September 2018

A monthly roundup of news about Artificial Intelligence, Machine Learning and Data Science. This is an eclectic collection of interesting blog posts, software announcements and data applications from...

...

Forrester Blogs

It’s Mostly Quiet On The GDPR Front — But This Is Not The Time For Complacency

So where are all the GDPR enforcement actions? The General Data Protection Regulation (GDPR) entered into force at the end of May 2018, giving unprecedented powers to regulators. From ongoing...

...
Cloud Avenue Hadoop Tips

Node fault tolerance in K8S & Devlarative programming

In K8S, everything is declarative and not imperative. We specify the target state to K8S and it will make sure that that the target state is always there, even in the case of failures. Basically, we specify what we want (as in the case of SQL) and not how to do it.
In the above scenario, we have 1 master and 2 nodes. We can ask K8S to deploy 6 pods (application instances) onto the nodes and K8S will automatically schedule the pods across the nodes. In case one of the node goes down, then K8S will automatically reschedule the pods from the failed node to a healthy node. I iterate, we simply specify the target state (6 nodes) and not where to deploy, how to address the failure scenarios etc. Remember declarative and not imperative.

For some reason it takes ~6 minutes for the pods to be rescheduled on the healthy nodes, even after the configuration changes mentioned here. Need to look into this a bit more.

Here is a video demoing the same in a small cluster. We can notice that when one of the node goes down, automatically K8S will reschedule the corresponding pods to a healthy node. We don't need to wake in the middle of the night to rectify a problem as long as have additional resources in case of failures.


Here are the sequence of steps. The same steps can be executed on a K8S cluster on the Cloud or locally on your Laptop. In this scenario, I am running the K8S Cluster on my Laptop. Also, the sequence of steps seem to be lengthy, but can be automated using Helm, which is a package manager for K8S.

Step 1 : Start the K8S cluster in VirtualBox.

Step 2 : Make sure the cluster is up. Wait for a few minutes for the cluster to be up. Freezing the recording here.
kubectl get nodes

Step 3 : Clean the cluster of all the resources
kubectl delete po,svc,rc,rs,deploy --all

Step 4 : Deploy the Docker ghost image (default replica is 1)
kubectl run ghost --image=ghost

Step 5 : Check the number of pods (should be 1)
kubectl get rs

Step 6 : Check the node in which they are deployed
kubectl get pods -o wide | grep -i running

Step 7 : Scale the application (replicas to 6)
kubectl scale deployment --replicas=6 ghost

Step 8 : Check the number of pods again (should be 6)
kubectl get rs

Step 9 : Check the node in which they are deployed (The K8S scheduler should load balance the pods across slave1 and slave2)
kubectl get pods -o wide | grep -i running

Step 10 : ssh to one slave of the bring down one of the node
sudo init 0

Step 11 : Wait for a few minutes (default ~6min). Freezing the recording here.

Step 12 : Check if the pods are deployed to healthy node
kubectl get pods -o wide | grep -i running

Hurray!!! The pods have been automatically deployed on a healthy node.

Additional steps (not required for this scenario)

Step 1 : Expose the pod as a service
kubectl expose deployment ghost --port=2368 --type=NodePort

Step 2 : Get the port of the service
kubectl get services ghost

Step 3 : Access the webpage using the above port
http://master:port

In the upcoming blogs, I will try to explain a few more features of K8S using demos. Keep looking !!!
 

September 19, 2018


Forrester Blogs

Is The Data Ops Workbench A Thing?

Data ops, data engineering, data development — oh my! From new roles and teams to new skills and processes, the hot topic on everyone’s mind is data ops. I started to notice the data ops...

...

Forrester Blogs

Who Will Win The Future Of Payments?

Autonomous payments is the future state of payments – and banks, merchants and payment vendors need to prepare now. Venture funding in this arena already exceeds $28 billion. Coupled with...

...

Forrester Blogs

Wanted: Fintech Expert

Financial services firms are on the hunt. They’re hunting unicorns. As firms like Revolut and TransferWise attain valuations of $1 billion or more, the pressure to find the Stripes of this...

...

Forrester Blogs

Is The Data Ops Workbench A Thing?

Data ops, data engineering, data development — oh my! From new roles and teams to new skills and processes, the hot topic on everyone’s mind is data ops. I started to notice the data ops emergence...

...
 

September 18, 2018


Revolution Analytics

Not Hotdog: A Shiny app using the Custom Vision API

I had a great time at the EARL Conference in London last week, and as always came away invigorated by all of the applications of R that were presented there. I'll do a full writeup of the conference...

...

Forrester Blogs

Media Consultancies Come Of Age

The first Forrester Wave™ on global media agencies reveals the category-adding technology, services, and capabilities to leap beyond conventional media planning and buying. Media agencies are...

...

Forrester Blogs

Media Consultancies Come Of Age

The first Forrester Wave™ on global media agencies reveals the category-adding technology, services, and capabilities to leap beyond conventional media planning and buying. Media agencies are...

...
Cloud Avenue Hadoop Tips

'Kubernetes in Action' Book Review

What is Kubernetes?

Previously we looked at what Dockers and Containers are all about. They are used to deploy microservices. These microservices are light weight services and can be in tens and with hundreds of replications. Ultimately this leads to thousands of containers on hundreds of nodes/machines. This brings some complex challenges.
  • The services in the containers are deployed on nodes which are least utilized and have the required resources like SSD/GPU and is not deterministic. Then how do the services discover each other?
  • There can different failures like network, hardware. How to make sure that at any point of time a fixed number of containers are available irrespective of the failures?
  • Application updates are a norm. How do we update such that there is no downtime of the application? Blue-Green, Canary deployment ....
Like wise there are many challenges which are a common concern across multiple applications when working with microservices and containers. Instead of solving these common concerns in each of the applications, Kubernetes (K8S) does this for us. K8S was started by Google and later on maintained by Cloud Native Computing Foundation (CNCF). Google a few days back took a step back on K8S and let others in the ecosystem get more involved in it.

Although Google started K8S, a lot of other companies have adopted it. AWS EKS, Google K8S Engine (GKE), AKS to name a few. This is another reason why we would be seeing more and more of K8S in the future.

Who doesn't love comics. Here is one on K8S from Google and another here. Another simple Youtube video here.

Review of Kubernetes in Action Book

  • The Kubernetes in Action Book starts with a brief introduction about Docker and K8S. And jumps into the practical aspects of K8S. Wish there was a bit more about Docker.
  • As with any other book the starts with simple concepts and gradually discusses the complex topics. Lot of examples are included in the book.
  • K8S ecosystem is growing rapidly. The only gripe is that the K8S ecosystem is not included in the book.

Conclusion

K8S is a complex piece of software to setup if one is new to Linux. There are multiple ways of setting up K8S as mentioned here. One easy way is to use a preinstalled K8S cluster in the Cloud. But, this comes at a cost and also not everyone is comfortable with the concepts of Cloud.

So, there is Minikube which is a Linux virtual machine with K18S and the required softwares already installed and configured. Minikube is easy to setup and runs on Windows, Linux and Mac. In the future blogs, we will look at the different ways of setting setting up K8S and how to use the same. Keep looking !!!

Finally would recommend the Kubernetes in Action book to anyone who wants to get started with K18S. The way we build applications had been moving from monolithic to microservices way and K18S accelerates the same. So, the book is a must for those who are into software.
Cloud Avenue Hadoop Tips

Getting started with K8S with Minikube on Linux

Why Minikube?

As mentioned in the previous blog, setting up K8S is a complex task and those new to Linux, it might be a bit of challenge. And so we have Minikube to the rescue. The good thing about Minikube is that it requires very few steps and runs on multiple Operating Systems. For those curious there are tons of ways of installing K8S as mentioned here.

Minikube sets up a Virtual Machine (VM). The VM is very similar to those from Cloudera, HortonWorks and MapR which are used for Big Data. These VMs have the different softwares already installed and configured. These make them easy for those who want to get started with the respective softwares and also for demos. But, these VM are not good for using in production.

Minikube is easy to use, but there are a few disadvantages of it. It runs on a single node, so we won't be able to try some of the features like response to a node failure, some of the advanced scheduling. But, still Minikube is nice to get started with K8S.

Installing Minikube on Ubuntu

I tried out the instructions as mentioned here and they work as-is for Ubuntu 18.04, so I thought of not repeating the same in this blog. Go ahead and follow the instructions for completing the setup of Minikube. Here are a few pointers though.

  • When we run the 'minikube start' for the the first time it has to download the VM and so is a bit slow, from then on it's fast.
    • In the VirtualBox UI, the minikube VM will be shown as below in running state. Note that the VM image has been downloaded configured and started.
    • By default not much memory and CPU are allocated to the VM. So, first we need to shutdown the VM as shown below. The status of the VM should be changed to powered off.

    • Now go to the settings of this particular VM and change the memory and the CPU settings. Make sure not to cross the green line as per the VirtualBox recommendations. After making the resource changes, start the minikube again. Notice that the VM will not be download this time.


    Conclusion

    Now that we looked at how to setup to Minikube on Ubuntu, I am aware not everyone has Ubuntu, so we will explore installing Minikube on Windows also.
    Also, we will slowly explore the different features of K8S in the upcoming blogs. So, keep looking.
    Cloud Avenue Hadoop Tips

    Installing Minikube on Windows

    Introduction

    In the previous blog, we looked at installing Minikube on Linux. In this blog we will install Minikube on a Windows machine. To my surprise installation has been dead easy as in the case of Linux.

    Installing Minikube on Windows

    • Install VirtualBox as mentioned here. The blog is somewhat old, but the instructions are more or less the same for installing VirtualBox.
    • Install Chocolatey which is a Package Manager for Windows using the instruction here. From here on Chocolatey can be used to install/update/delete Minikube. It's some what similar to apt and yum in the Linux environments. I have done the same using PowerShell, but the same can be done using the command prompt also.
    • Now it's time to install Minikube as mentioned here, we will use Chocolatey for the same. The 'choco install minikube' command will install Minikube and not the VM in VirtualBox.
    • Now is the time to run 'minikube start' command. This will download/configure the K8S VM, log into the VM and start a few services and also setup kubectl on the host to point towards the VM. Although the VM has started, the status in VirtualBox is shown as 'Powered Off'. Not sure why.

    • Login into the VM using the 'minikube ssh' command and issue the 'sudo init 0' to terminate the VM. Run the 'minikube start' command to start the VM again.

    Conclusion

    In the earlier blog, we installed minikube on Linux and this time on a Windows machine. In both the cases it runs a Linux OS in VirtualBox and so only a Linux container can be run on Minikube, but still we would be able to learn many aspects of K8S. In the upcoming blogs, we will look at the different concepts around K8S and try them out.

    In the upcoming blog, we will explore running a Windows container obviously on Windows OS.
     

    September 17, 2018


    Forrester Blogs

    Stripe Moving Into Point-Of-Sale Transaction Processing

    Stripe, one of the leading payment providers for online and platform businesses, today announced “Stripe Terminal” to tackle the merchant challenges around in-person payments....

    ...

    Forrester Blogs

    And Now . . . Presenting The 2018 ESM Forrester Wave™!

    “The Forrester Wave™: Enterprise Service Management, Q3 2018” is live! Take IT service management, add the age of the customer, stir in some employee experience (EX) and a healthy dose of...

    ...

    Forrester Blogs

    Sales And Service Tech: Two Sides Of The Same Coin?

    Automation and AI are changing the nature of work. Every company job — including every front-office job — will be impacted. Always-available commerce and intelligent guided-selling solutions erode...

    ...

    Forrester Blogs

    Stay Ahead Of Your Customers With Continuous Delivery

    If you regularly follow the financial markets, it’s hard to ignore the growth generated by the FAANG stocks (Facebook, Apple, Amazon, Netflix, Google). When you look more closely at the “ANG” part of...

    ...

    Forrester Blogs

    Four Things You Must Do Right Now To Rock Your 2018 Holiday

    Over $1 trillion (no typo) — that’s how big 2017 holiday topline sales were in the US and Europe combined. As for online sales specifically last year, Forrester projected $129 billion in the US and...

    ...

    Forrester Blogs

    It’s Digital Go Time: The Six Vectors Of Investment

    It’s been 30 years since the internet made customer self-service and automation the watchwords of business. But most of you are still in the starting gates of digital transformation. Only 15%...

    ...
     

    September 15, 2018

    Cloud Avenue Hadoop Tips

    K8S Cluster on Laptop

    Why K8S Cluster on Laptop?

    A few years back I wrote a blog on setting up a Big Data Cluster on the laptop.  This time it's about setting up a K8S Cluster on the laptop. There are a few Zero-Installation K8S setup which can be run in the browser like Katakoda Kubernetes, Play with Kubernetes and K8S can also be run in Cloud (AWS EKS, Google GKE and Azure AKS). So, why install K8S Cluster on the Laptop? Here a few reasons I can think of.
    • It's absolutely free
    • Will get comfortable with the K8S administration concepts
    • Will know what happens behind the scenes to some extent
    • Above mentioned Katakoda and Play with Kubernetes were slow
    • Finally, because we can :)

    More details

    As mentioned in the K8S documentation there are a tons of options for installing it. I used a tool called kubeadm which is part of the K8S project. The official documentation for kubeadm is good, but it's  a bit too generic with a lot of options and also too lengthy. I found that the documentation from linuxconfig.org to be good and up to the point. There are few things missing in the documentation, but it's good to get started.

    I would be writing a detailed article on the setup procedure, but here a few highlights for anyone to get started.

    • Used Oracle VirtualBox to setup three VMs and installed master on one of them and slaves on the other two.

    • Used a laptop with the below configuration. It has a HDD, an SSD would have saved lot more time during the installation process and also the K8S Cluster boot process (< 2 minutes on HDD).
    • Even after the K8S Cluster was started, the Laptop was still responsive. Below in the System Monitor after starting the K8S Cluster.
    •  Below shows kubectl commands to get the list of nodes, services and also to invoke the service.

    Final thoughts

    There are two K8S Certifications Certified Kubernetes Application Developer (CKAD) Program and Certified Kubernetes Administrator (CKA) Program from CNCF. The CKAD Certification was started recently and is much more easier than the CKA Certification.

    The practice for CKAD Certification can done in Minikube which was discussed in the earlier blogs (Linux and Windows). But, for the CKA Certification setting up a Cluster with different configurations, troubleshooting is required and so setting up a Cluster is required.

    Installation using kubeadm was easy, it automates the entire installation process. Installing from scratch would be definitely interesting (here and here). We will get to know what happens behind the scene.

    It took a couple of hours to setup a K8S Cluster. Most of the time was spent on installing the Guest OS, cloning it, fine tuning to make sure K8S runs on a Laptop etc. The actual installation and basic testing of the K8S Cluster took less than 10 minutes.

    In the upcoming blog, we will look at setting up a K8S Cluster on Laptop. Keep looking !!!
     

    September 14, 2018


    Forrester Blogs

    The Algorithm Of You

    Algorithm. It’s a buzzword you hear frequently these days. But does the average consumer understand the impact algorithms have on her life? Absolutely not. Consumers enjoy the illusion of...

    ...

    Revolution Analytics

    How many deaths were caused by the hurricane in Puerto Rico?

    President Trump is once again causing distress by downplaying the number of deaths caused by Hurricane Maria's devastation of Puerto Rico last year. Official estimates initially put the death toll at...

    ...

    Revolution Analytics

    Who wrote that anonymous NYT op-ed? Text similarity analyses with R

    In US politics news, the New York Times took the unusual step this week of publishing an anonymous op-ed from a current member of the White House (assumed to be a cabinet member or senior staffer)....

    ...

    Revolution Analytics

    Because it's Friday: Hurricane Trackers

    With Hurricane Florence battering the US and Typhoon Manghkut bearing down on the Philippines, it's a good time to take a look at the art of visualizing predicted hurricane paths. (By the way, did...

    ...
     

    September 13, 2018

    Cracking Hadoop

    [Video] GraphQL Full Course

      Many different programming languages support GraphQL. This list contains some of the more popular server-side frameworks, client libraries, services, and other useful stuff. https://graphql.org/code/

    The post [Video] GraphQL Full Course appeared first on Cracking Hadoop.

     

    September 12, 2018

    Ronald van Loon

    Digital Transformation: The Ultimate Guide to Becoming an Information Company

    The data revolution is gaining pace at breakneck speed, and we are finally towards the latter end of its implementation. Many inroads have been made by important stakeholders, and numerous organizations are currently trying to incorporate a data culture within their workplace. Future predictions suggest that every company will eventually be part of the data brigade and will benefit from the use of data analysis tools.

    How to Become a Data/Information Company

    With growing hype across the market, many organizations want to know as much as they can about the intricacies involved in becoming a data/information company and what they need to do to become one. Although the questions are many, it is easy to answer them.

    Being a part of the data revolution is no different to what it was back in the day. Similar motivation is needed, and organizations need to know the importance of the process and the benefits that it would bring to them.

    The biggest benefit motivating organizations into taking the leap forward is the promise of enhanced customer experiences. Customer expectations are growing, but organizations currently seem lost for action in regards to what to do in the face of such challenges. They do have the willpower for change, but that willpower requires extensive investment and the promise of better operations to succeed. The use of data related to customer information has really helped organizations customize offerings and give their customers the experience they need. Business models that didn’t use data previously have now latched on to the revolution in data and have started crafting their offerings in such a way that they sound personalized to each and every customer.

    The data process starts through the extraction of data, but this is not where it stops. Most organizations believe that they have done their job by extracting data, but, believe me, this is only one quarter of the job done. The main focus of the data revolution is on the value that is taken out of the extracted data. The data wouldn’t amount to much if the organization wasn’t able to find any value in it. Only if there was value being extracted through the data would this process be considered successful.

    With information from the data in front of them, organizations are eventually expected to act upon them. These actionable insights propel organizations into action and force them into altering their products and services based on these insights. Again, customer experience sits at the center of the process as the ultimate aim is to satisfy the customer and give them an unparalleled experience.

    Maturity Phases to Becoming an Information Company

    Becoming an information company and eventually achieving maturity in monetizing data is definitely not for the faint-hearted. The process requires a lot of primary and secondary skills and needs lot of human investment other than the capital investment that is being provided.

    Any organization that is looking for maturity in data monetization needs the following steps. We are focusing more on the internal facets of being a mature data company than the external aspects.

    1. Perform Business Intelligence: Look at all systems external and internal and perform a business intelligence trial by creating dashboards to see where most of your data is going. Change things around you based on what you are learning. The use of analytical tools and data management processes for gathering internal insights can be extremely useful in the long run. The dashboards you create will help you see who’s using your products or services and how you can help serve them better.
    2. Extend Products and Services with Data and Insights: The insights you get from the data can be used to extend products and services to your clients. Once you know where most of your services are going to, and have a thorough plan on how you can leverage this opportunity, you can actually use the insights to extend products and services to the right people.
    3. Deliver Information Services: Finally, the most important step to becoming an information company is to deliver information services to clients. There have been numerous success stories of clients jumping on the information bandwagon by providing prescriptive and predictive analysis to their clients. You, too, can build a list of clients who you can provide information services going into the future.

    Implementing Data Monetization Strategies

    Once you have reached the maturity stage, you will have to go towards the implementation of data monetization strategies. Now, the tips above might come useful here as well, but the following points well help you out in the monetization process:

    • Establish a Vision: Corporate executives within a company should extend the vision for monetizing data and should allocate resources including workforce time and investment towards ensuring that the data is properly monetized.
    • Agile Multi Disciplinary Teams: Data can be monetized through the use of multi-disciplinary teams made up of agile data-architects, analytics specialists, product managers, marketing professionals and application developers.
    • Develop a Competitive Culture: Unless it is properly communicated and made functional across the workplace, data remains worthless. To extract the most out of data, you need to create a data-driven culture.
    • Convenient, Secure Access to Data: Data can only be monetized if it is clean, consistent and accessible, along with being voluminous in size.
    • Management and Advanced Analytics of Data: The five data management layers of engagement, development, integration, modern core IT and data are the key components of a digital business. Data is only valuable after it has been analyzed and when it is managed carefully.

    Use Cases of Data/Information Implementation

    Real-life examples of organizations joining the digital revolution and altering their offerings in the process serve as an example for all organizations in the future to follow. RELX group is an organization that has set a major precedent for any organization looking to become an information company. RELX, which initially started as a company that provided printed information to clients, moved on to digital information services. RELX realized the potential in the industry and added information to their selling process. Now RELX operates as a successful organization in predictive services for researchers, doctors and lawyers. The company helps run predictive services through the use of their expertise in information and data analysis.

    Otis Elevator is another example of an organization that participated in a digital revolution and benefitted from it. Otis faces the challenge of maintaining more than 2 million elevators across organizations and cities. To keep pace and to increase its level of service, Otis underwent a major technological revamp.

    Accordingly, Otis started by analysing data trends around the company’s more than 300,000 connected elevators, all of which generate data of their own. The data generated from these elevators, and other newly-connected elevators that leverage IoT for real-time data collection, can calculate when an elevator would go out of order based on its health and what can be done to halt the downtime, or in other words, the need for predictive maintenance. The data that Otis generates can also help buildings control the flow of people across floors and learn what can be done to ensure that flow is maintained and optimized. By stepping into the field of data and information, Otis has not only broadened the horizon of what they were doing but has also initiated a whole new realm for better building management in the smart world.

    Finally, it would be unfair to leave out Netflix, the entertainment giant, from this list. Calling Netflix’s data strategy a transformation would be wrong because this is something they have always embedded their culture on, but Netflix has never shied away from taking risks to adapt to the needs of clients and give a stellar performance. They have made huge strides through their personalized offering and exceptional understanding of data insights. They achieved all of this through the presence of a dedicated team and exceptional data analysis tools. Such is their dependency on data analytics now that their data algorithms help save over $1 Billion in the form of customer retention every year.

    With success stories to learn from, this is your chance to plan your data transformation and alter your offerings to please your customers and give them the required customer experience. With due efforts, you, too, can achieve success in this regard.

     About the Authors 

     

     

     

    Andrea Monaci is the Marketing Director for IT Services at Ricoh Europe.

    Connect with Andrea on Linkedin and Twitter to learn more about B2B Eco systems.

     

     

     

     

     

     

     

     

    If you would like to read more from Ronald van Loon on the possibilities of Big Data and the Internet of Things (IoT), please click “Follow” and connect on LinkedInTwitter and YouTube.

    Ronald

    Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

    More Posts - Website

    Follow Me:
    TwitterLinkedIn

    Author information

    Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

    The post Digital Transformation: The Ultimate Guide to Becoming an Information Company appeared first on Ronald van Loons.


    Revolution Analytics

    If not Notebooks, then what? Look to Literate Programming

    Author and research engineer Joel Grus kicked off an important conversation about Jupyter Notebooks in his recent presentation at JupyterCon: There's no video yet available of Joel's talk, but you...

    ...
    Cracking Hadoop

    [Video] Wrangling Data with Python’s Pandas

    Pandas is an open-sourced Python library that provides high performance tools for handling Big Data structures $pip install pandas Using DataFrames as a representation of Data Schemas Documentation http://pandas.pydata.org/pandas-docs/stable/

    The post [Video] Wrangling Data with Python’s Pandas appeared first on Cracking Hadoop.

    Cracking Hadoop

    [Video]Making your Mark as a Woman in Big Data

    Numbers about Women in tech jobs Women in Big Data Mentoring Why is there still a Gender Gap? Fewer girls are entering STEM fields Work-Life Balance Retention Networking Plan of action

    The post [Video]Making your Mark as a Woman in Big Data appeared first on Cracking Hadoop.

     

    September 11, 2018


    Revolution Analytics

    Video: R and Python in in Azure HDInsight

    Azure HDInisght was recently updated with version 9.3 of ML Services in HDInsight, which provides integration with R and Python. In particular, it makes it possible to run R and Python within...

    ...
    Cracking Hadoop

    [VIDEO] Scalable Data Science with SparkR

      R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset? In this talk we will walkthrough many examples how several new features

    The post [VIDEO] Scalable Data Science with SparkR appeared first on Cracking Hadoop.

    Cloud Avenue Hadoop Tips

    Where is Big Data heading?

    Where is Big Data heading?

    During the initial days of Hadoop only MapReduce was the supported software and later Hadoop was extended with YARN (kind of an Operating System for Big Data) to support Apache Spark and others. YARN also increased the resource utilization of the cluster. YARN was developed by HortonWorks and later on contributed to the Apache Software Foundation. Other Big Data Vendors like Cloudera, MapR slowly started adopting it and making improvements to it. YARN was an important turn in Big Data.

    Along the same lines there is another major change happening in the Big Data space around Containerization, Orchestration and separating storage and the compute part. HortonWorks published a blog on the same and call it as Open Hybrid Architecture Initiative. There is a nice articles from ZDNet on the same.

    The blog from HortonWorks is full of detail, but the crux as mentioned in the blog is as below:

    Phase 1: Containerization of HDP and HDF workloads with DPS driving the new interaction model for orchestrating workloads by programmatic spin-up/down of workload-specific clusters (different versions of Hive, Spark, NiFi, etc.) for users and workflows.
     

    Phase 2: Separation of storage and compute by adopting scalable file-system and object-store interfaces via the Apache Hadoop HDFS Ozone project.

    Phase 3: Containerization for portability of big data services, leveraging technologies such as Kubernetes for containerized HDP and HDF. Red Hat and IBM partner with us on this journey to accelerate containerized big data workloads for hybrid. As part of this phase, we will certify HDP, HDF and DPS as Red Hat Certified Containers on RedHat OpenShift, an industry-leading enterprise container and Kubernetes application platform. This allows customers to more easily adopt a hybrid architecture for big data applications and analytics, all with the common and trusted security, data governance and operations that enterprises require.


    My Opinion

    The different Cloud Vendors had been offering Big Data as a service for quite some time. Athena, EMR, RedShift, Kinesis are a few of the services from AWS. There are similar offerings from Google Cloud, Microsoft Azure and other Cloud vendors also. All these services are native to the Cloud (built for the Cloud) and provide tight integration with the other services from the Cloud vendor.

    In the case of Cloudera, MapR and HortonWorks the Big Data platforms were not designed with the Cloud into considerations from the beginning and later the platforms were plugged or force fitted into the Cloud. The Open Hybrid Architecture Initiative is an initiative by HortonWorks to make their Big Data platform more and more Cloud native. The below image from the ZDNet article says it all.
    It's a long shot that the different phases are designed, developed and the customers move to it. But, the vision gives an idea on where Big Data is heading.

    Two of the three phases are involved with Kubernetes and Containers. As mentioned in the previous few blogs, the way the applications are being built is getting changed a lot and its extremely important to get comfortable with the technologies around Containers.
     

    September 10, 2018

    Cloud Avenue Hadoop Tips

    Abstraction in the AWS Cloud, in fact any Cloud

    Sharing of responsibility and abstraction in Cloud

    One of the main advantage of the Cloud is sharing of the responsibilities by the Cloud Vendor and the Consumer of the Cloud. This way the Consumer of the Cloud need to worry less about the routine tasks and think more about the application business logic. Look here for more on the AWS Shared Responsibility Model.

    EC2 (Virtual Server in the Cloud) was one of the oldest service introduced by AWS, with EC2 there is less responsibility on AWS and more on Consumer. As AWS became more mature and more services have been introduced, the responsibility had been shifting slowly more from the Consumers towards AWS. AWS also had been ABSTRACTING more and more of the different aspects of technology from the Customer.

    When we deploy an application on EC2, we need to think about
    • Number of servers
    • Size of each server
    • Patching the server
    • Scaling the server up and down
    • Load balancing and more

    On the other end of the spectrum with Lambda, we simply create a function and upload it to AWS. The above concerns and lot more are taken care of by AWS automatically for us. With Lambda we don't need to think about the number of EC2 instances, size of each EC2 and a lot of things.

    While driving a regular car we don't need to worry about how the internal combustion of an engine works. A car provides us with an abstraction using a steering wheel, brakes, clutch etc. But, it's better to know what happens below the hood, just in case the car stops in the middle of no where. Same is the case of the AWS services also. The new autonomous cars do provide an even higher level of abstraction, we just need to specify the destination location and the rest of things will be taken care of. Similar is the progress in the different AWS services and in fact any of the Cloud services.

    Recently I read an article in AWS detailing the above abstractions and responsibilities here. It's a good read introducing the different AWS Services at a very high level.


    Abstraction is good, but it comes at a cost of less flexibility. Abstraction hides a lot of underlying details. With Lambda we simply upload a function and don't really care on which machine it runs nor do we have a choice on what type of hardware we want to run it on. So, we won't be able to do an Machine Learning inference using a GPU in a Lambda function as it requires access to the underlying GPU hardware which Lambda doesn't provide.

    Conclusion

     
    In the above diagram with different AWS Services as we move from left to right the flexibility of the services decreases. This is the dimension I would like to add  to the original discussion in the AWS article.

    The Bare Metal on the extreme left is very flexible but with a lot of responsibility on the Customer, on the other extreme the Lambda function is less flexible but with less responsibility on the Customer. Depending on the requirement, budget and lot of other factors the appropriate AWS service can be picked.

    We have Lambda which is a type of FAAS as the highest level of abstraction, I was thinking what's next abstraction on top of Lambda/FAAS. Any clue?
     

    September 09, 2018

    Cracking Hadoop

    [Intro to Programming] Learn Python – Full Course for Beginners

    ⌨️ (0:00) Introduction ⌨️ (1:45) Installing Python & PyCharm ⌨️ (6:40) Setup & Hello World ⌨️ (10:23) Drawing a Shape (15:06) Variables & Data Types ⌨️ (27:03) Working With Strings ⌨️ (38:18) Working With Numbers ⌨️ (48:26) Getting Input From Users ⌨️ (52:37) Building a Basic Calculator ⌨️ (58:27) Mad Libs Game ⌨️ (1:03:10) Lists ⌨️ (1:10:44) List Functions ⌨️

    The post [Intro to Programming] Learn Python – Full Course for Beginners appeared first on Cracking Hadoop.

     

    September 07, 2018


    Revolution Analytics

    Because it's Friday: Stars in Motion

    In the center of our galaxy, in the direction of the constellation Sagittarius, lies the massive black hole around which the entire galaxy revolves. The European Space Agency has observed that region...

    ...
    Cracking Hadoop

    [Video]Data Engineering using Azure Databricks and Apache Spark

    Source: https://databricks.com/etl-2-0-data-engineering-using-azure-databricks-and-apache-spark    

    The post [Video]Data Engineering using Azure Databricks and Apache Spark appeared first on Cracking Hadoop.

    Cloud Avenue Hadoop Tips

    How to run Windows Containers? Not using Minikube !!!

    How we ran Minikube?

    In the previous blog we looked at installing Minikube on Windows and also on Linux. In both the cases, the software stack is the same except replacing the Linux OS with the Windows OS (highlighted above). The Container ultimately runs on Linux OS in both the cases, so only Linux Containers and not the Windows Containers can be run in case of Minikube.

    How do we run a Windows Container then?

    For this we have to install Docker for Windows (DFW). Instructions for installing here. Prerequisite for DFW is support for Hyper-V which is not available in Windows Home Edition, need to upgrade to a Pro edition. In DFW K8S can be enabled as mentioned here.
    There are two types of Containers in the Windows world, Windows Containers which runs directly on Windows and shares the host kernel with other Containers. And the other type of Container is Hyper-V Container which has has one Windows Kernel per Container. Both the types of Containers are highlighted in the above diagram and are detailed here.

    The same Docker Windows image can be run as both Windows Container and Hyper-V Container, but the Hyper-V container provides an extra isolation. The Hyper-V Container is as good as a Hyper-V Virtual Machine, but uses a light weight and tweaked Windows Kernel. Microsoft documentation recommends using Windows Containers for stateless and Hyper-V Containers for stateful applications.

    As seen in the above diagram the Windows Container runs directly on top of the Windows Pro OS and doesn't use any Virtualization, but Hyper-V is a prerequisite for installing Docker for Windows, not sure why. If I get to know I will update the blog accordingly.

    Conclusion

    In this blog, we looked at a very high level on running Windows Containers. Currently, I have Windows Home and Ubuntu as dual boot setup. Since, I don't have a Windows Pro with Hyper-V enabled, I am not able to install Docker for Windows. Will get Windows updated to Pro and will write a blog on installing and using Docker for Windows. Keep looking !!!

    On a side note, I was thinking about setting up an entire K8S cluster on Windows and looks for now it is not possible. The K8S documentation mentions that the K8S control plane (aka master components) have to be installed on a Linux machine. But, Windows based worker nodes can join the K8S cluster. Maybe down the line, running an entire K8S cluster on Windows will be supported.

    Note : Finally I was able to upgrade my Windows Home to Professional (here),  enable Hyper-V (here) and installed Docker for Windows (here).
     

    September 06, 2018


    Revolution Analytics

    In case you missed it: August 2018 roundup

    In case you missed them, here are some articles from August of particular interest to R users. A guide to installing R and RStudio with packages and multithreaded BLAS on various platforms. Some tips...

    ...
     

    September 04, 2018


    Revolution Analytics

    Book review: SQL Server 2017 Machine Learning Services with R

    If you want to do statistical analysis or machine learning with data in SQL Server, you can of course extract the data from SQL Server and then analyze it in R or Python. But a better way is to run R...

    ...
     

    September 02, 2018


    Ricky Ho

    Structure Learning and Imitation Learning

    In classical prediction use case, the predicted output is either a number (for regression) or category (for classification).  A set of training data (x, y) where x is the input and y is the...

    ...
     

    August 31, 2018


    Revolution Analytics

    Because it's Friday: The Curiosity Show

    This one's all about nostalgia. When I was about 7 or 8 years old back in Australia, my favorite TV show by far was The Curiosity Show, a weekly series hosted by Rob (Morrison) and Deane (Hutton) —...

    ...

    Revolution Analytics

    Guide to a high-performance, powerful R installation

    R is an amazing language with 25 years of development behind it, but you can make the most from R with additional components. An IDE makes developing in R more convenient; packages extend R's...

    ...

    Revolution Analytics

    Tips for analyzing Excel data in R

    If you're familiar with analyzing data in Excel and want to learn how to work with the same data in R, Alyssa Columbus has put together a very useful guide: How To Use R With Excel. In addition to...

    ...
    Cloud Avenue Hadoop Tips

    'Learn Docker - Fundamentals of Docker 18.x' Book Review

    What is Docker?

    In the previous blog we looked at Docker/Containers at a high level and also compared VirtualBox with Dockers. VirtualBox and others like Xen, KVM, HyperV provide hardware level virtualization while Dockers/Containers provide OS level virtualization. Because of which Docker/Containers are lightweight.

    Below is the virtualization using VirtualBox. As noticed multiple OS kernels which provides the application level isolation run on the same hardware making it heavy and also inefficient.


    Here is the application level isolation provided by containers. As noticed the OS kernel is only run once for all the applications. This makes it efficient.


    How to install Docker?

    Docker can be installed/run on Windows, Mac and Linux. As I had been using Ubuntu as my primary OS, I followed the instructions for Linux. There is much more easier way to try Docker without any installation with all in a browser by using Play with Docker, every thing runs in the Cloud. It uses a concept called Docker in Docker (DID). All we need to do is to create an account with Docker and get started for free. Here we can try Docker on a single node or on a cluster of 5 nodes. Play with Docker is for the sake of learning, prototyping, demos and not for production purpose.

    Review of Learn Docker Book

    To explore more into Dockers, recently I completed reading the Learn Docker - Fundamentals of Docker 18.x and would be reviewing the book here.

    • The book starts with a light note on containers, ecosystem and then deep dives into Docker. The good think  about the book it is that it slowly increases the complexity towards the end of the book, this makes it easy for those who don't know what Docker is all about.
    • The book has equal emphasis on theory and practice. As soon as a concept is discussed immediately the complete example code and how to execute the same wherever appropriate is given. Once Docker has been installed, the examples can be tried out. Most of the cases the code can be executed as-is.
    • The book doesn't end at Docker, but also explains about container orchestration. There is in fact a few chapters on the inbuilt Docker orchestration layer Swarm and also on the latest Kubernetes again with examples. There are also few chapters on Dockers and orchestration in the Cloud.
    • It's not just about development against Docker, but also about making it production ready. There are a few sections on Security, Load Balancing, Blue-Green/Canary deployment, Secrets to name a few.
    • At the end of each chapter, there is a section for further reading to explore further. Also, included is a small quiz with answers.

    Conclusion

    I would definitely recommend Learn Docker - Fundamentals of Docker 18.x for anyone who is trying to get started with Docker. As mentioned Docker can be installed on Windows, Mac and Linux. If we don't want to install Docker, then Docker can be tried for free at PWD (https://labs.play-with-docker.com/).
     

    August 29, 2018

    Ronald van Loon

    Six Ways to Make Smarter Decisions

    The advent of data and analytics has opened the doors to smarter, more progressive decisions. Basic business decisions related to profitability and other facets of the business process can significantly benefit from the use of data and analytics.

    The presence of data and analysis tools has changed the way decisions are taken. Not only have they provided greater room for reasoning, but they’ve also ensured a more authentic process for decision-making. Now that we have entered this stage of heightened reasoning, we have finally realized how much we can improve decision-making through the use of data and analysis.

    Here, we look at six ways to drive smarter and more authentic decisions.

    1. Combine Solutions to Drive Innovations

    Most executives, analytics leaders, and managers can combine solutions to drive innovation forward. The use of data and analysis technologies to handle processes and strategies is the call of the day. By working with AI tools and other analysis algorithms, you can actually catapult your organization onto the digital bandwagon. The best way forward is to realize the need for combining solutions in order to derive and drive innovations.

    2. Plan and Report: Enabling Agile and Continuous Planning

    Most finance professionals, executives, business users, and analytic leaders can benefit from budgeting, forecasting financial closing and reporting processes. The use of flexible analysis and automated visualizations help you to uncover new insights and work with them, improving the management of financial and operational matters. This leads to better financial planning for the future years, with increased consideration of budgetary needs and requirements.

    3. Explore and Visualize: Build Reports, Dashboards, and Visualizations

    Most business analysts, IT administrators and business users can benefit from the ability to explore and visualize dashboards, visualizations and reports. Today’s user demands their BI solution provide enhanced dashboarding and reporting, while maintaining the security and scalability that is essential for a self-service world.

    4. Predict and Optimize: Examine, Model and Implement

    Most data science managers, data analysts, business analysts and business users can predict and optimize efficiently by learning to develop and deliver visible contributions to the business. These contributions are to be developed with a portfolio of prescriptive, predictive and Machine Learning tools for both coders and non-coders. Once the contributions have been developed, the organization can put the products into deployment faster. This cuts any downtimes involved in the process, and leads towards better predictions.

    5. Manage: Managing your Data

    Business users and IT professionals can manage data proficiently by adopting a future hybrid data management approach, which is complemented by analytical and operational workloads in order to optimize large, diverse data volumes, and to uncover actionable insights driven by data. All of this is to be done while remaining compatible with the present systems. The management of data is important for the overall decision making process, as it will bring forth better insights.

    6. Trust: Form a Trusted Analytics Foundation

    Most chief data analytics officers, data architects, data engineers and chief marketing officers can implement methods to help organizations know, trust, and use the data built on an analytics foundation with unified governance and integration. This helps information stakeholders to find both unstructured and structured high-quality data from any multi-cloud environment.

    Be up to date with the best practices, industry stories, and innovative learning in order to achieve proficiency in decision making. Register for IBM Analytics University 2018, to learn from experts including Anna Rosling Rönnlund, Co-founder and Vice President, at Gapminder.

    Ronald is an IBM Analytics partner, but all opinions expressed are his own.

    Ronald

    Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

    More Posts - Website

    Follow Me:
    TwitterLinkedIn

    Author information

    Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

    The post Six Ways to Make Smarter Decisions appeared first on Ronald van Loons.

     

    August 28, 2018


    Revolution Analytics

    Videos from NYC R Conference

    The videos from the NYC R conference have been published, and there are so many great talks there to explore. I highly recommend checking them out: you'll find a wealth of interesting R applications,...

    ...
    Algolytics

    Algorithm maximizing revenue from marketing campaign – application of GDBase

    Creating a marketing campaign to gain leads may involve several problems, which are problemtic and tedious to solve. Thus, we want to build an algorithm that will maximize the expected revenue taking into account the limitation concerning number of offers sent to each lead in one message.

    AdvancedMiner has a built-in GDBase database that supports triggers and queries of any complexity. It enables to freely write Python-based scripts and incorporate most SQL queries into them. As a result, it is possible to perform operations with many limitations resulting from information contained in databases.

    The Python code (called Gython in AdvancedMiner due to modifications made to simplify analytical operations) can be entered directly into the script or during SQL use after TRANSFORM and GLOBALS.

    GDBase, unlike other databases, allows you to create a simple algorithm that maximizes revenue even with many limitations.

    Inserting a table from another database in the GDBase.

    To insert a table from another database in the GDBase you need to enter a query in the script:

    sql: IMPORT COMPRESSED Query: SELECT * FROM another_database.name_of_the_table AS s1 AS name_of_the_table_in_GDBase USING ODBC('DSN=RP')

    An example of a table used in an algorithm in a marketing campaign:

    GDBase - algorytm maksymalizujący przychód z kampanii marketingowej - 1

    Download the data used in this post here.

    The client_id column contains the client ID, which can be for example an e-mail address or a telephone number. To each of the customer there are assigned identifiers for all campaings (campaign_id), the probability of obtaining a lead from the customer for the specific campaign (lead_probability), the revenue that the campaign may generate (income), the expected revenue taking into account the probability (expected_income) and a daily lead limitations for each campaign (lead_limit).

    The following algorithm is based on a sample table called “example”. The following steps contain comments to help you understand the algorithm.

    Firstly, you must initialize constraint wariables.

    max_number_of_offers = 3 #maximum number of offers per one message cups = { } #a bucket containing expected number of responses per lead limits = { } #daily lead limits for each campaign

    We add a list of offers with limits and add cups and limits values.

    sql a: SELECT DISTINCT campaign_id, lead_limit FROM example print a for r in a: cups[r[0]] = 0 limits[r[0]] = r[1]

    We sort the table on the basis of the maximum scoring and scoring for each campaign.

    sql: REPLACE TABLE example_recommendations AS SELECT * FROM ( SELECT client_id, max(expected_income) AS max_score FROM example GROUP BY 1 ORDER BY max_score DESC ) AS s1 LEFT JOIN (SELECT * FROM example ORDER BY expected_income DESC) as s2 USING (client_id) TRANSFORM: #Gython use if current_client_id <> client_id: number_of_offers = 0 for offer, contents in cups.items(): #we place an offer in the bucket, which will be sent if (offer == campaign_id) and (number_of_offers < max_number_of_offers): #however, we must check whether there is still room in the bucket if contents < limits[offer]: chosen_offer = offer cups[offer] = cups[offer] + lead_probability #the bucket filling criterion is the sum of the probability of obtaining a lead from a given campaign for all customers number_of_offers = number_of_offers + 1 __save__() current_client_id = client_id __skipRow__ = 1 GLOBALS: #the initialisation of global variables current_client_id = '' number_of_offers = 0 max_number_of_offers = $max_number_of_offers cups = $cups limits = $limits KEEP chosen_offer, client_id, lead_probability, expected_income

    As a result, we obtain an output table with recommendations of the offers that will generate the highest revenue. For each of the client, there are 3 selected offers The limits for the campaign are maintained.

    GDBase - algorytm maksymalizujący przychód z kampanii marketingowej - 2

    For comparison, without using an algorithm when selecting a campaign, the response to offers is ten times lower, which results in a much lower revenue obtained from each of these campaigns. This solution makes it possible to deal with all the limitations. The possibility of introducing small modifications makes this algorithm extremely versatile.

    Artykuł Algorithm maximizing revenue from marketing campaign – application of GDBase pochodzi z serwisu Algolytics.

    decor