decor
 

Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.

 

June 28, 2016


Revolution Analytics

Livestreaming of useR! 2016 conference begins tomorrow, June 28

The useR! 2016 conference, the annual gathering of R users from around the world, is already underway at Stanford University. Today is a day of interactive tutorials, and the presentation program...

...
Principa

Data Analytics now fuels Customer Loyalty in Banking


As the banking industry pursues improved customer engagement, unlocking the value of data becomes critical in designing a successful loyalty programme.

The balance of power in banking has changed. What customers expect, how they want to be serviced, what information they are prepared to share, and how loyal they are prepared to be, have all changed radically. According to leading industry analysts, Forrester Research, we are in the age of the customer, in which the only sustainable competitive advantage is knowledge of and engagement with customers.

Teradata ANZ

Which Open Source technologies are suitable for your Big Data roadmap?

Recently, I developed a handful of demos using open source technologies for detecting and alerting fraudulent events, incidence of poor customer experience and arrival of target subjects in geo-fenced locations for marketing purposes. The use cases required detection of individual events from streaming data sources and processing complex set of rules for identifying events of interest to create alerts for enabling data-driven insights and actions.

I selected the Apache Hadoop technologies, namely, Kafka, Storm, HDFS and HBase as they were found to be the best fit for these use cases and the tools had been deployed in large scale operation by reputed multinational organisations. In addition, I found a vast array of pre-integrated libraries, examples of source code and “lessons learned” that were freely available on the Internet and therefore instrumental in improving my productivity.

Sundara Raman_Roadmap

When the rubber hit the road…

As I showed the demos to my colleagues and customers my decision to use these Big Data technology tools was put to test. Common questions raised during the demo were: “Why not use Apache Spark Streaming instead of Apache Storm?”; “What would you recommend as the open source technologies to bet on in our 3 year Big Data roadmap?”; “Why did you not use Apache Flink that integrates complex event processing, stream processing and machine learning”?

When all else fails – ask the experts

As Open Source technologies dominate the C-level agenda, questions such as these are not just limited to streaming and CEP technologies but expected to be widely common pre-requisite for developing Big Data architecture in the enterprise. Unfortunately, for every such question there are myriads of opinions with no new answer. With new Open Source projects unfolding every week it makes the task harder! So, I took advice from the experts on Big Data and Open Source, Think Big Analytics.

Find below my key takeaways from their advice.

There is no free lunch – the ‘free puppy’ still needs to be fed though!

Organisations fall into the trap of thinking that selecting and implementing open source is trivial or ‘free’. But open source is as free as a ‘free puppy’ – it comes with all kinds of hidden costs that keep popping up. Every technology comes at a cost that includes acquiring the skillsets to use it, develop for it, and maintain and operate it. Open source is not any different.

Tea leaves and the taste bud

Experimentation to compare open source technologies can open up opportunities, but there is not nearly enough time or resource in an organisation to try everything. While you may be able to install a large number of tools with the ease of few mouse clicks, determining their strengths and limitations can take weeks, even months. Have you acquired a taste for that tea yet?

After all you only need a wrench or two to fix the sink

There are more than a dozen query engines for SQL on Hadoop. You do not need that many and selecting the right one is crucial. The ideal approach is to adopt a relatively small number of technologies, be it for SQL engine or other big data technologies, and optimise how the organisation uses them to gain a return on investment. The shiny new wrench may look appealing but the guy who knew how to use it just left!

Stay the night or build your own blue print for long term living

Renting a room may be a good idea for a few nights of temporary stay here and there. But the idea of living in your dream home is quite different. It generally starts with a blue print and well defined plan and path to get there. The same is true for building a big data architecture for your enterprise that differentiates your organisation from competitors. After all you want your dream home to be different to Mr & Mrs Jones next door.

The problem with using the open source tool in response to an immediate need is that it is easy to end up with a multitude of tools that do not work well together. Instead of a “tool mentality,” organisations should take the approach of building a blueprint for big data. Your business objectives and requirements are critical to selecting technologies that meet your needs.

Anyway, these are just a few I picked up as I was venturing into the world of Open Source. If you are interested in gaining detailed understanding of the approach to selecting the right tool and technologies please download this paper.

The post Which Open Source technologies are suitable for your Big Data roadmap? appeared first on International Blog.

 

June 27, 2016


Revolution Analytics

Apache Spark integrated with Microsoft R Server for Hadoop

by Bill Jacobs, Microsoft Advanced Analytics Product Marketing They say that time is infinite. Seem to me data is fast becoming the same. Or perhaps it's becoming true that our thirst for speed is...

...
 

June 26, 2016


Simplified Analytics

5 mistakes to avoid in Digital Transformation

Digital Transformation is happening everywhere you look. It is impacting businesses of any size, in any industry, any market and every geography. Many organizations recognize the importance of...

...
 

June 24, 2016


Revolution Analytics

Because it's Friday: A robot writes a movie

What happens when you take the scripts from dozens of sci-fi movies and TV series, and feed them (along with a couple of seed prompts) into a long short-term memory recurrent neural network? You get...

...

Revolution Analytics

Amazon X-ray data provides insight into movie characters

I'm a regular user of Amazon Video: as someone who spends a fair bit of time on planes, it's great to be able to download some of my favourite shows (hello, Orphan Black and Vikings) and catch up on...

...
Ronald van Loon

How You Can Improve Customer Experience With Fast Data Analytics

AAEAAQAAAAAAAAlcAAAAJDMxYTRjYTRmLTA3YjMtNDc4My05ZDI3LThiMmY5YmFlNzliMA

In today’s constantly connected world, customers expect more than ever before from the companies they do business with. With the emergence of big data, businesses have been able to better meet and exceed customer expectations thanks to analytics and data science. However, the role of data in your business’ success doesn’t end with big data – now you can take your data mining and analytics to the next level to improve customer service and your business’ overall customer experience faster than you ever thought possible.

Fast data is basically the next step for analysis and application of large data sets (big data). With fast data, big data analytics can be applied to smaller data sets in real time to solve a number of problems for businesses across multiple industries. The goal of fast data analytics services is to mine raw data in real time and provide actionable information that businesses can use to improve their customer experience.

 “Fast data analytics allows you to
turn raw data into actionable insights instantly”

Connect with Albert Mavashev
Co-author, CTO & Evangelist at jKool

Analyze Streaming Data with Ease

The Internet of Things (IoT) is growing at an incredible rate. People are using their phones and tablets to connect to their home thermostats, security systems, fitness trackers, and numerous other things to make their lives easier and more streamlined. Thanks to all of these connected devices, there is more raw data available to organizations about their customers, products, and their overall performance than ever before; and that data is constantly streaming.

With big data, you could expect to take advantage of at least some of that machine data, but there was still an expected lag in analysis and visualization to give you useable information from the raw data. Basically, fast data analytics allows you to turn raw data into actionable insights instantly.

With fast data analytics services, businesses in the finance, energy, retail, government, technology, and managed services sectors may create a more streamlined process for marketing strategies, customer service implementation, and much more. If your business has an application or sells a product that connects to mobile devices through an application, you can see almost immediate improvements in how your customers see you and interact with your business, all thanks to fast data analytics.

Consider a few real-world examples of how fast data analytics have helped companies across business sectors improve their performance.

A Financial Firm Monitors Flow of Business Transactions in Real-Time

The world of finance has always been fast-paced, and today a financial firm can have many millions of transactions each day. There’s no way to spare the time or effort to constantly search for breaks and/or delays in these transactions at every hour of the business day. However, with fast data analytics, they found that they could consistently monitor the flow of business throughout the day, including monitoring of specific flow segments, as well as complete transactions.

With the right fast data analytics service, the firm was able to come to a proactive solution in which they could monitor the production environment using their monitoring software’s automated algorithms to keep a constant eye on transaction times. The software’s algorithms determined whether transaction flows were within acceptable parameters or if something abnormal had occurred, giving the firm the ability to respond immediately to any problems or abnormalities to improve their customer experience and satisfaction.

Monitored transaction flows using jKool fast data analytics solution

A Large Insurance Firm Ensures Faster Claim Processing

In another case, a large health insurance provider with over three million clients was in the process of a massive expansion. As the firm expanded, though, they noticed a disturbing trend. Over the span of a single month, the average processing time for claim payments had increased by a dramatic 10%, but only for a single type of transaction. While they had the tools necessary to analyze the operating system problems, the servers’ hardware, application servers, and other areas where the problem could be originating, they were dealing with monitoring tools that were half-a-decade old.

Thanks to these outdated monitoring tools, the insurance provider had a very expensive problem on their hands, as finding the solution was taking up over 90% of their tier-three personnel’s time and energy. Not only that, but customers were actually finding the majority of their application problems before the provider’s IT support could detect them.

To immediately diagnose the problem and get ahead of it, the insurance firm deployed a fast data monitoring service that immediately diagnosed what was causing the delays in claims processing transactions. The solution was found promptly – one claim type was sitting in a queue long enough that it would time out – and that the addition of new branch locations was causing over-saturation of their architecture design. By reconfiguring their middleware, they were able to accommodate the load increase and solve the problem without taking up valuable employee time and resources.

Just a few of the benefits of deploying this service were:

⦁ A 40% decrease in the mean-time-to-repair for the software problem.
⦁ A 60% decrease in the time spent by third-tier personnel to solve the problem.
⦁ A 35% decrease in the number of open tickets at help desk.
⦁ More than 30% improvement in the average processing time for claims.

A Securities Firm Ensures Dodd-Frank Compliance

Enacted in 2010, the Dodd-Frank Wall Street Reform and Consumer Protection Act is a US federal law that was enacted to regulate the financial industry and prevent serious financial crises from occurring in the future. Securities firms and other financial institutions must ensure that they are Dodd-Frank compliant in order to stay in business and avoid the risk of serious litigation. One such securities firm implemented fast data monitoring and analysis for Dodd-Frank compliance for all of their SWAP trades.

To be Dodd-Frank compliant, firms must report all SWAP trades “as soon as technologically possible”. Within a few minutes of the execution of a trade, a real-time message and a confirmation message must be reported, as well as primary economic terms (PET). If a message is rejected for any reason, it must be resubmitted and received within minutes of execution.

Without monitoring of their reporting systems, the securities firm found that they were in danger of being found non-compliant should anything go wrong within their internal processes. Fast data analytics solution gave them the real-time monitoring they needed to stay compliant.

How Can Fast Data Analytics Help Your Business?

As you can see from these examples, fast data analytics makes it possible for businesses to quickly turn raw machine data into actionable insights by tracking transactions, identifying issues with hardware and software, and reducing customer complaints. With the ability to identify and solve these issues faster and more efficiently, fast data analytics services can significantly improve any business’ customer experience.

These processes can all be monitored in real-time, giving you access useful analytics and insights for time-sensitive activities. Fast data analytics can help you stay compliant with government and/or industry regulations, avoid preventable losses and it improve your personnel’s efficiency by pinpointing errors and problems without taking up a lot of employees’ time and energy.

How do you want to use fast data analytics to improve your customer experience? Let us know your experiences!

Connect with the authors

Connect with co-author Albert Mavashev to learn more about the world of fast data and all that it can do for you
Co-author, CTO & Evangelist at jKool

 

 

 

Connect with author Ronald van Loon to learn more about the possibilities of Big Data Co-author, Director at Adversitement

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How You Can Improve Customer Experience With Fast Data Analytics appeared first on Ronald van Loons.

 

June 23, 2016

Silicon Valley Data Science

Building Pipelines to Understand User Behavior

In early March, I spoke at the Hadoop with the Best online conference. I had fun sharing one of my total passions: data pipelines! In particular, I talked about some techniques for catching raw user events, acting on those events, and understanding user activity from the sessionization of such events. Here, I’ll give just a taste of what was covered. For more detail, please check out the video above (courtesy of With the Best).1

First, some background and motivation on my end: Silicon Valley Data Science (SVDS) is a boutique data science consulting firm. We help folks with their hardest data strategy, data science, and/or data engineering problems. In this role, we’re in a unique position to solve different kinds of problems across various industries, and start to recognize the patterns of solution that emerge. I’m very interested in cultivating and sharing the data engineering best practices that we’ve learned.

Key takeaways from this post include knowing what’s needed to understand user activity, and seeing some pipeline architectures that support this analysis. To achieve these goals, let’s walk through three data pipeline patterns that form the backbone of a business’ ability to understand user activity and behavior:

  • Ingesting events
  • Taking action on those events
  • Recognizing activity based on those events

Ingesting Events

The primary goal of an ingestion pipeline is simply to ingest events;all other considerations are secondary for now. We will walk through an example pipeline and discuss how that architecture changes as we scale up to handle billions of events per day. We’ll note along the way how general concepts of immutability and lazy evaluation can have large ramifications on data ingestion pipeline architecture.

Let’s start by covering typical classes of and types of events, some common event fields, and various ways that events are represented. These vary greatly across current and legacy systems, and you should always expect that munging will be involved as you’re working to ingest events from various data sources over time.

For our sessionization examples, we’re interested in user events such as
`login`, `checkout`, `add friend`, etc.

These user events can be “flat”:

```
    {
      "time_utc": "1457741907.959400112",
      "user_id": "688b60d1-c361-445b-b2f6-27f2eecfc217",
      "event_type": "button_pressed",
      "button_type": "one-click purchase",
      "item_sku": "1 23456 78999 9",
      "item_description": "Tony's Run-flat Tire",
      "item_unit_price": ...
      ...
    }
```

Or have some structure:

```
    {
      "time_utc": "1457741907.959400112",
      "user_id": "688b60d1-c361-445b-b2f6-27f2eecfc217",
      "event_type": "button_pressed",
      "event_details": {
        "button_type": "one-click purchase",
        "puchased_items": [
          {
            "sku": "1 23456 78999 9",
            "description": "Tony's Run-flat Tire",
            "unit_price": ...
            ...
          },
        ],
      },
      ...
    }
```

Both formats are often found within the same systems out in the wild, so you have to intelligently detect or classify events rather than just making blanket assumptions about them.

Stages of an ingestion pipeline

Before we dive into what ingestion pipelines usually look like, there are some things to keep in mind. You want to be sure to build a pipeline that is immutable, lazily evaluated, and made up of simple/composable (testable) components. We’ll see below how these abstract (CS-101) concepts really matter when it comes to pipeline implementation, maintenance, and scale.

We’ll start by focusing on just ingesting or landing events. Start with a simple pipeline such as:

Event ingestion without streaming with filename

Click to enlarge

At first glance, it seems like that’s all you would need for most problems. Especially since query-side tools are so fast and effective these days. Ingestion should be straightforward. The ingest pipeline simply needs to get the events as raw as possible as far back as possible in a format that’s amenable to fast queries.

Let’s state that again—it’s important.

The pipeline’s core job is to get events that are as raw as possible (immutable processing pipeline) as far back into the system as possible (lazily evaluated analysis) before any expensive computation is done. Modern query-side tools support these paradigms quite well. Better performance is obtained when events land in query-optimized formats and are grouped into query-optimized files and partitions where possible:

Event ingestion without streaming with filename

Click to enlarge

That’s simple enough and seems pretty straightforward in theory. In practice, you can ingest events straight into HDFS only up to a certain scale and degree of event complexity.

As scale increases, an ingestion pipeline has to effectively become a dynamic impedance matching network. It’s the funnel that’s catching events from what can be a highly distributed, large number of data sources and trying to slam all these events into a relatively small number of filesystem datanodes.

How do we catch events from a large number of data sources and efficiently land them into HDFS?

Events without streaming

Click to enlarge

Use Spark!

No, but seriously, add a streaming solution in-between (I do like Spark Streaming here):

Streaming bare

Click to enlarge

And use Kafka to decouple all the bits:

 

Streaming events at scale

Click to enlarge

Kafka decouples the data sources on the left from the data nodes on the right. And they can scale independently. Also, they scale independently of any stream computation infrastructure you might need for in-stream decisions in the future.

Impedance or size mismatches between data sources and data storage are really only one half of the story. Note that another culprit, event complexity, can limit ingest throughput for a number of reasons. A common example of where this happens is when event “types” are either poorly defined or are changing so much they’re hard to identify. As event complexity increases, so does the logic you use to group or partition the events so they’re fast to query. In practice, this quickly grows from simple logic to full-blown event classification algorithms. Often, those classification algorithms have to learn from the body of events that’ve already landed. You’re making decisions on events in front of you based on all the events you’ve ever seen. More on that in the “Recognize Activity” section later.

Ingest pipelines can get complicated as you try to scale in size and complexity—expect it and plan for it. The best way is to do this is to build or use a toolchain that can let you add a streaming and queueing solution without a lot of rearchitecture or downtime. Folks often don’t try to solve this problem until it’s already painful in production, but there are solutions available. My current favorite uses a hybrid combination of Terraform, Ansible, Consul, and ClouderaManager/Ambari.

Note also that we haven’t talked about any real-time processing or low-latency business requirements here at all. The need for a stream processing solution arises when we’re simply trying to catch events at scale.

Taking Action

Catching events within the system is an interesting challenge all by itself. However, just efficiently and faithfully capturing events isn’t the end of the story. Consider the middle-of-the-road version of the ingestion pipeline we discussed above:

Streaming bare

Click to enlarge

That’s sorta boring if we’re not taking action on events as we catch them. Actions such as Notifications, Decorations, Routing/Gating, etc. can be taken in either “batch” or “real-time” modes (see figure below).

Streaming simple

Click to enlarge

Unfortunately, folks have all sorts of meanings of the terms “batch” and “real-time.” Let’s clear that up and be a little more precise.

For every action you intend to take, and really every data product of your pipeline, you need to determine the latency requirements. What is the timeliness of that resulting action? Meaning, how soon after either a.) an event was generated, or b.) an event was seen within the system will that resulting action be valid? The answers might surprise you.

Latency requirements let you make a first-pass attempt at specifying the execution context of each action. There are two separate execution contexts we talk about here:

  • Batch: Asynchronous jobs that are potentially run against the entire body of
    events and event histories. These can be highly complex, computationally
    expensive tasks that might involve a large amount of data from various
    sources. The implementations of these jobs can involve Spark or Hadoop
    map-reduce code, Cascading-style frameworks, or even SQL-based analysis via
    Impala, Hive, or SparkSQL.
  • Stream: Jobs that are run against either an individual event or a small
    window of events. These are typically simple, low-computation jobs that
    don’t require context or information from other events. These are typically
    implemented using Spark-streaming or Storm.

When I say “real-time” here, I mean that the action will be taken from within the stream execution context.

It’s important to realize that not all actions require “real-time” latency. There are plenty of actions that are perfectly valid even if they’re operating on “stale” day-old, hour-old, 15min-old data. Of course, this sensitivity to latency varies greatly by action, domain, and industry. Also, how stale stream versus batch events are depends upon the actual performance characteristics of your ingestion pipeline under load. Measure all the things!

An approach I particularly like is to initially act from a batch context. There’s generally less development effort, more computational resources, more robustness, more flexibility, and more forgiveness involved when you’re working in a batch execution context. You’re less likely to interrupt or congest your ingestion pipeline.

Once you have basic actions working from the batch layer, then do some profiling and identify which of the actions you’re working with really require less stale data. Selectively bring those actions or analyses forward. Tools such as Spark can help tremendously with this. It’s not all fully baked yet, but there are ways to write Spark code where the same business logic can be optionally bound in either stream or batch execution contexts. You can move code around based on pipeline requirements and performance.

In practice, a good deal of architecting such a pipeline is all about preserving or protecting your stream ingestion and decision-making capabilities for when you really need them.

A real system often involves additionally protecting and decoupling your stream processing from making any potentially blocking service API calls (sending emails for example) by adding kafka queues for things like outbound notifications downstream of ingestion:

Streaming with notify queues

Click to enlarge

As well as isolating your streaming system from writes to HDFS using the same trick we used for event ingestion in the previous section:

Streaming two layers

Click to enlarge

Recognizing Activity

What’s user activity? Usually it’s a sequence of one or more events associated with a user. From an infrastructure standpoint, the key distinction is that activity usually needs to be constructed from a sequence of user events that don’t all fit within a single window of stream processing. This can either be because there are too many of them or because they’re spread out over too long a period of time.

Another way to think of this is that event context matters. In order to recognize activity as such, you often need to capture or create user context (let’s call it “state”) in such a way that it’s easily read by (and possibly updated from) processing in-stream.

We add HBase to our standard stack, and use it to store state:

Classifying with state

Click to enlarge

Which is then accessible from either stream or batch processing. HBase is attractive as a fast key-value store. Several other key-value stores could work here. I’ll often start using one simply because it’s easier to deploy/manage at first, and then refine the choice of tool once more precise performance requirements of the state store have emerged from use.

It’s important to note that you want fast key-based reads and writes. Full-table scans of columns are pretty much verboten in this setup. They’re simply too slow for value from stream.

The usual approach is to update state in batch. My favorite example when first talking to folks about this approach is to consider a user’s credit score. Events coming into the system are routed in stream based on the associated user’s credit score.

The stream system can simply (hopefully quickly) look that up in HBase keyed on a user ID of some sort:

HBase state credit score

Click to enlarge

The credit score is some number calculated by scanning across all a user’s events over the years. It’s a big, long-running, expensive computation. Do that continuously in batch and just update HBase as you go. If you do that, then you make that information available for decisions in stream.

Note that this is effectively a way to base fast-path decisions on information learned from slow-path computation. A way for the system to quite literally learn from the past.

Another example of this is tracking a package. The events involved are the various independent scans the package undergoes throughout its journey.

For “state,” you might just want to keep an abbreviated version of the raw history of each package:

HBase state tracking package

Click to enlarge

Or just some derived notion of its state:

HBase state tracking package derived

Click to enlarge

Those derived notions of state are tough to define from a single scan in a warehouse somewhere, but make perfect sense when viewed in the context of the entire package history.

Wrap Up

We’ve had a glimpse of a selection of pipeline patterns to handle the following, at various scales:

  • Ingesting events
  • Taking action on those events
  • Recognizing activity based on those events

There are more details covered in the video.

These patterns are high-level views of some of our best practices for data engineering. These are far from comprehensive at this point, so I’m interested in learning more about what other folks are doing out there. Do you have other pipeline patterns you’d add to this list? Other ways to solve some of the problems we’ve investigated here? Please post comments or links!

You can find the slides here. Note that this isn’t an ordinary slide deck—be sure to use the “down” arrow to drill deeper into the content.

1. Note Mark’s audio drops out at the very end, due to connectivity issues. The content in the lost bit is covered in the “Wrap Up” of this post.

The post Building Pipelines to Understand User Behavior appeared first on Silicon Valley Data Science.


BrightPlanet

CASE STUDY: Combating Pharmaceutical Fraud with Deep Web Data Harvesting

We just added a new case study to our website. This case study focuses on how Deep Web data harvesting is helping combat online pharmaceutical fraud. Keep in mind that this same process can be applied to help protect any intellectual property that may be vulnerable online. The case study takes you through the entire BrightPlanet process […] The post CASE STUDY: Combating Pharmaceutical Fraud with Deep Web Data Harvesting appeared first on BrightPlanet.

Read more »
Ronald van Loon

How You Can Improve Customer Experience with Fast Data Analytics

IoT – How You Can Improve Customer Experience with Fast Data Analytics

In today’s constantly connected world, customers expect more than ever before from the companies they do business with. With the emergence of big data, businesses have been able to better meet and exceed customer expectations thanks to analytics and data science. However, the role of data in your business’ success doesn’t end with big data – now you can take your data mining and analytics to the next level to improve customer service and your business’ overall customer experience faster than you ever thought possible.

Fast data is basically the next step for analysis and application of large data sets (big data). With fast data, big data analytics can be applied to smaller data sets in real time to solve a number of problems for businesses across multiple industries. The goal of fast data analytics services is to mine raw data in real time and provide actionable information that businesses can use to improve their customer experience.

“Fast data analytics allows you to turn raw data into actionable insights instantly”

Connect with Albert Mavashev
Co-author, CTO & Evangelist at jKool


Analyze Streaming Data with Ease

The Internet of Things (IoT) is growing at an incredible rate. People are using their phones and tablets to connect to their home thermostats, security systems, fitness trackers, and numerous other things to make their lives easier and more streamlined. Thanks to all of these connected devices, there is more raw data available to organizations about their customers, products, and their overall performance than ever before; and that data is constantly streaming.

With big data, you could expect to take advantage of at least some of that machine data, but there was still an expected lag in analysis and visualization to give you useable information from the raw data. Basically, fast data analytics allows you to turn raw data into actionable insights instantly.

With fast data analytics services, businesses in the finance, energy, retail, government, technology, and managed services sectors may create a more streamlined process for marketing strategies, customer service implementation, and much more. If your business has an application or sells a product that connects to mobile devices through an application, you can see almost immediate improvements in how your customers see you and interact with your business, all thanks to fast data analytics.

Consider a few real-world examples of how fast data analytics have helped companies across business sectors improve their performance.

A Financial Firm Monitors Flow of Business Transactions in Real-Time

The world of finance has always been fast-paced, and today a financial firm can have many millions of transactions each day. There’s no way to spare the time or effort to constantly search for breaks and/or delays in these transactions at every hour of the business day. However, with fast data analytics, they found that they could consistently monitor the flow of business throughout the day, including monitoring of specific flow segments, as well as complete transactions.

With the right fast data analytics service, the firm was able to come to a proactive solution in which they could monitor the production environment using their monitoring software’s automated algorithms to keep a constant eye on transaction times. The software’s algorithms determined whether transaction flows were within acceptable parameters or if something abnormal had occurred, giving the firm the ability to respond immediately to any problems or abnormalities to improve their customer experience and satisfaction.

IoT2Monitored transaction flows using jKool fast data analytics solution


A Large Insurance Firm Ensures Faster Claim Processing

In another case, a large health insurance provider with over three million clients was in the process of a massive expansion. As the firm expanded, though, they noticed a disturbing trend. Over the span of a single month, the average processing time for claim payments had increased by a dramatic 10%, but only for a single type of transaction. While they had the tools necessary to analyze the operating system problems, the servers’ hardware, application servers, and other areas where the problem could be originating, they were dealing with monitoring tools that were half-a-decade old.

Thanks to these outdated monitoring tools, the insurance provider had a very expensive problem on their hands, as finding the solution was taking up over 90% of their tier-three personnel’s time and energy. Not only that, but customers were actually finding the majority of their application problems before the provider’s IT support could detect them.

To immediately diagnose the problem and get ahead of it, the insurance firm deployed a fast data monitoring service that immediately diagnosed what was causing the delays in claims processing transactions. The solution was found promptly – one claim type was sitting in a queue long enough that it would time out – and that the addition of new branch locations was causing over-saturation of their architecture design. By reconfiguring their middleware, they were able to accommodate the load increase and solve the problem without taking up valuable employee time and resources.

Just a few of the benefits of deploying this service were:

  • A 40% decrease in the mean-time-to-repair for the software problem.
  • A 60% decrease in the time spent by third-tier personnel to solve the problem.
  • A 35% decrease in the number of open tickets at help desk.
  • More than 30% improvement in the average processing time for claims.

A Securities Firm Ensures Dodd-Frank Compliance

Enacted in 2010, the Dodd-Frank Wall Street Reform and Consumer Protection Act is a US federal law that was enacted to regulate the financial industry and prevent serious financial crises from occurring in the future. Securities firms and other financial institutions must ensure that they are Dodd-Frank compliant in order to stay in business and avoid the risk of serious litigation. One such securities firm implemented fast data monitoring and analysis for Dodd-Frank compliance for all of their SWAP trades.

To be Dodd-Frank compliant, firms must report all SWAP trades “as soon as technologically possible”. Within a few minutes of the execution of a trade, a real-time message and a confirmation message must be reported, as well as primary economic terms (PET). If a message is rejected for any reason, it must be resubmitted and received within minutes of execution.

Without monitoring of their reporting systems, the securities firm found that they were in danger of being found non-compliant should anything go wrong within their internal processes. Fast data analytics solution gave them the real-time monitoring they needed to stay compliant.

How Can Fast Data Analytics Help Your Business?

As you can see from these examples, fast data analytics makes it possible for businesses to quickly turn raw machine data into actionable insights by tracking transactions, identifying issues with hardware and software, and reducing customer complaints. With the ability to identify and solve these issues faster and more efficiently, fast data analytics services can significantly improve any business’ customer experience.

These processes can all be monitored in real-time, giving you access useful analytics and insights for time-sensitive activities. Fast data analytics can help you stay compliant with government and/or industry regulations, avoid preventable losses and it improve your personnel’s efficiency by pinpointing errors and problems without taking up a lot of employees’ time and energy.

How do you want to use fast data analytics to improve your customer experience?
Let us know your experiences!


Connect with the authors

 

Connect with co-author Albert Mavashev to learn more about the world of fast data and all that it can do for you.
Co-author, CTO & Evangelist at jKool

 

a69hONyQ

 

Connect with author Ronald van Loon to learn more about the possibilities of Big Data.
Co-author, Director at Adversitement

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How You Can Improve Customer Experience with Fast Data Analytics appeared first on Ronald van Loons.

Jean Francois Puget

Analytics For The Perfect Race Across America

 

image

 

Applying analytics to sports is one of the fun part of my work.  I had a great opportunity last year to work as part of an IBM team to help ultra cyclist Dave Haase race across America.   Racing across America is quite a challenge: imagine a 3000+ miles, non stop, race across USA, with over 110,000 feet elevation (see pictures below).  Cyclists can race as they wish, rest only when they chose to.   Last year winner slept about one hour every 24 hours, for 8 days.  Dave Haase finished close second, having rest for two hours every 24 hours, for 8 days too.  I wouldn't be able to sustain this even without cycling: I need way more sleep per day.

Just to give you a hint on the difficulty of the race, here is the elevation profile along the route, from RAAM site:

image

and here is the route itself:

image

 

It was kind of nuts to be part of this last year, but what is even more crazy is that Dave decided to race again this year.  Of course, we decided to help him again.  You can find more details on how IBM is helping on this blog.  I will focus here on my part.

I basically help Dave's team decide when and where he should rest.  The decision is based on road conditions (elevation, slope, etc), Dave's condition (the power he can provide when cycling), and weather conditions, especially wind.  For this I had to develop a series of analytics models.  We are reusing these models this year, albeit with this year data.  Data come from various sources:

  • data about Dave's physical condition, using wearable sensors. 
  • data about current and forecast weather, from The Weather Company (TWC), now part of IBM.
  • data about the road from gps track provided by RAAM organizers
  • etc

We use the following models:

  • A model of how Dave power (his muscle work) translate into speed on the road.  This model is a physical model.  It is described in Modeling Cyclist Power.
  • A model of how Dave adjusts his power according to the slope of the road and weather conditions.  This is a machine learning model, described in Predicting Cyclist Speed.
  • A weather prediction model (coming from The Weather Company).  We use this as a black box to get predicted wind along the road
  • A combination of the three models above and a simulator to predict how fast Dave moves along the route.  This simulator also predicts where Dave should rest.  It is also described in Predicting Cyclist Speed.

The result is displayed on a dashboard that Dave's team can access anytime. 

The race started well for Dave, but he had a major health issue that made him stop for 17 hours.  When he resumed racing (yes, he resumed!) he was 10th.  As I write this is is back to the second place, which is absolutely incredible.  I am proud we are helping him a bit, but he is doing the job, not us.

 

Lemoxo Technologies

What Is Dissertation Means DissertationMall.com

Dissertations Online Free

When you found us on the internet, you probably know you will discover numerous of other producing expertise available on the market. You must be watchful! A great number of them fail to appeal your business perhaps up to Experts Essay does. They typically commitment exceptionally low prices and great quality. You comprehend from working experience we frequently get everything we pay money for – crafting companies are no diverse.

A great number of suppliers use foreign authors and pay back them pennies each web page. Sometimes, Language is not actually even their first foreign language! Numerous clients for these suppliers have already been dismayed following benefiting from jobs which had been badly authored, plagiarized or possibly recycled. These businesses also are popular for absent output deadlines – one thing most people can unwell handle.

What’s significantly more – using these reduced-level firms, you might be getting your data in danger. Several do not use protect acquaintances to approach your sales. Many more is not going to protect valuable tips such as your identity, email address membership and telephone number. By utilizing these companies you most likely are opening your own self close to outstanding chance!

Experts Essay takes advantage of only professional freelance writers who have been finally in charge and then YOU! They do know we count on the very best quality from them where they definitely offer. We defend dissertation work http://dissertationmall.com/dissertations-online/ sentence grammar and punctuation checker your data and never put out it to next events. If you choose Masters Essay you can actually snooze simple – knowing your project could be carried out in time whilst your private information is safe.

  • Business Management Dissertation
  • What Is A Dissertation Abstract
  • Dissertation Critique Example
  • Electronic Thesis And Dissertation Database
  • Dissertation Discussion Chapter
  • Dissertation Topics In Banking

The post What Is Dissertation Means DissertationMall.com appeared first on Lemoxo.

 

June 22, 2016


Revolution Analytics

R, Stan and Bayesian Statistics

by Joseph Rickert Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by...

...

Revolution Analytics

R 3.3.1 now available

Peter Dalgaard announced yesterday on behalf of the R core team that R 3.3.1, the latest update to the R language, is now available for download from your local CRAN mirror. As of this writing,...

...
The Data Lab

New DataFest cooking after Data Talent Scotland success

Data Talent Scotland 2016

Here at The Data Lab, we want to make sure that the work we do has a positive impact on Scotland, so we conducted a survey with those who attended to find out whether the event helped them find a job, recruit talent or meet interesting, like-minded data-people.  

 

 

Below you can read through the complete survey feedback and watch a video summary of Data Talent Scotalnd 2016. You can also read a previous blog post sumarising the event.

 

 

As you can see from the feedback, the event was a huge success and we can already see jobs being created and partnerships being formed. In 2017, we want to improve and make Data Talent Scotland even more useful for Scottish companies and students to recruit talent and find the job they want; but also help celebrate data innovation and bring the Scottish Data Science community closer and support internationalisation. That’s why we are launching DataFest 2017 – a week-long series of events celebrating all things data, taking place from 20th to 24th March 2017 to show the world that Scotland truly is a data-destination.

DataFest is shaping to be a first-of-its-kind international data event in Scotland, providing attendees with the opportunity to learn from renowned international speakers and best practice, while offering an unprecedented networking platform where you will be able to interact with local and international talent, industry, academia and data enthusiasts. Data Talent Scotland 2017 will be part of DataFest 2017, and will take place on Wednesday 22nd March 2017. 

If you are interested in being part of DataFest and/or Data Talent Scotland on 2017, joining us as a sponsor, exhibitor, student or attendee, register your interest now to be first to hear more news and secure your space at the events.

 

 

 

Template: 
Image
Share: 
Google Plus
Facebook
Pinterest
Twitter
Teradata ANZ

Game Of Thrones – Who Dares Dies. Okay, So Who’s Next?

By Dr. Clement Fredembach

“Who shot JR?”

Two generations ago, Dallas had the whole global village on tenterhooks (his killer turned out to be the oilman’s wife’s sister who was pregnant with his child… ask your Dad).

Today, Game of Thrones is keeping the deadly tradition alive. Now, the question before each episode is “Who’s next?”. So, I thought I’d run the data and use data science to help me come up with the answer (by the time you read this you may already know, of course).


The story so far

Actually, the TV production is based on a series of books by G.R.R. Martin called ‘A Song of Ice and Fire’. Both books (1,400,000 words) and TV box set would take around 50 hours to digest. However, the books are more faithful to the author’s original concept and so, as text is much easier to analyse, I chose the printed version.

Making connections

The Game of Thrones narrative twists and turns, navigating a complicated landscape of events and characters. For our purposes, however, complexity is no bad thing because instead of pronoun substitutions, character names are generally written in full and mentioned frequently, to the benefit of the analytical process. The 102 frequently-mentioned characters I followed also included nicknames from the appendices (e.g. Kingslayer, Khaleesi). This groundwork helped build a series of social networks.

In the visualisation below, you’ll see that:

each node (dot) colour represents a different community (reflecting the output of the modularity algorithm)
node size represents the number of character-mentions
edge (connecting lines) colour and thickness represent the strength of character connections (thicker and redder = stronger).


This social-network graph spans the entire collection of five books. It tells the story of the Lannisters (red), the dispersed Starks (dark blue, purple), the North (green), and a woman with dragons (gold). As far as the network structure goes, Tyrion and Jon are the heroes, and dead characters are either centrally located within their network, or are leaf nodes (nodes that have no ‘child’ nodes). In other words, few ‘hub’ characters are killed.

Was Eddard’s death a surprise?

Being able to calculate the state of the social network at any point of the story allowed me to determine whether the death of a major character is a surprise, or an expected consequence of the network structure and previous deaths. To predict character deaths, I used a Belief Propagation algorithm (LBP) which propagates beliefs (characters are ‘dead’ or ‘alive’) from a few key nodes throughout the network. Initial values for the algorithms are provided by:

  • When a character dies, his or her value is set to ‘dead’
  • Jon Snow (Ice) and Daenerys (Fire) are set to ‘alive’
  • All other characters are ‘unknown’.

Computing LPB for the social network developed in the first book (up to the death of Eddard) shows that Eddard, himself, is the character most likely to die, with a 95% belief.

Over the first three books, LBP was remarkably accurate, successfully predicting the deaths of major characters (Catelyn, Renly, etc). In fact, only Joffrey’s death was missed (ranked seventh out of around 80 characters). LBP accuracy declined over the fourth and fifth books because:

  • the death toll kept growing
  • there were no obvious, additional survivors
  • the story structure changed (from significant major character interactions to a side-quest structure where the major characters became more distant).

Who dies in Book Six?

The end of Book Five left us pondering the possibility that Jon Snow could be dead; a problem for our analysis, since Jon was manually set to ‘alive’. Book Six deaths can therefore be predicted in two ways: keeping Jon ‘alive’, or marking Jon as ‘unknown’ (in this case, I mark Tyrion as ‘alive’ for an alternate ground truth). The results (alive: 0, dead: 1):

It’s close. But that’s not surprising considering the low number of ‘alive’ characters. However, as relationships are built over all five books, small differences become more meaningful.

Basically, if Jon Snow isn’t kept alive manually, he’s dead meat.

And if Jon’s alive then Brienne, Walder (Frey), Sam, Edmure, and the Lannisters, are next on the hit list.

This blog is based on a previous, much fuller account of the analytical process: http://bit.ly/24jLcfY

Read the second blog in the ‘Predicting Deaths in Game of Thrones’ series: Event-Based Survival Analysis And Conclusion: http://bit.ly/1SHtTAg

The Art of Analytics: http://bit.ly/1X2YDON

Dr. Clement Fredembach is a data scientist at Teradata ANZ Advanced Analytics Group

This post first appeared on Forbes TeradataVoice on 17/05/2016.

The post Game Of Thrones – Who Dares Dies. Okay, So Who’s Next? appeared first on International Blog.

 

June 21, 2016

Big Data University

This Week in Data Science (June 21, 2016)

Here’s this week’s news in Data Science and Big Data. Smart Toothbrush

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

Cool Data Science Videos

The post This Week in Data Science (June 21, 2016) appeared first on Big Data University.

Silicon Valley Data Science

Kafka Simple Consumer Failure Recovery

A modern data platform requires a robust Complex Event Processing (CEP) system, a cornerstone of which is a distributed messaging system. Today, many people use Kafka to fill this latter role. Kafka can be complicated, particularly when it comes to understanding how to recover from a failure during consumption of a message. There are numerous options for such recovery; in this post we’ll just be looking at one of them.

We’ll walk through a simple failure recovery mechanism, as well as a test harness that allows you to make sure this mechanism works as expected. First, we’ll create a test Kafka producer and consumer with failure recovery logic in Java. Then, we’ll discuss a bash script that starts up a local Kafka cluster using Docker Compose, sends a set of test messages through the producer, and finally kills the consumer and resurrects it again in order to simulate a recovery.

The script will also provide a way to see if the recovery logic worked correctly. With a few small changes, you can use this script as a template for your Kafka consumers in order to see if the consumer recovery will perform as expected in production.

Local Kafka Cluster

The first part of this project involves finding a way to create a realistic Kafka cluster locally. I did this by forking a Kafka Compose project that contains all the Confluent components, and upgrading it to Kafka 0.9.0.0 (because of the fast speed of Kafka platform development, I haven’t had a chance to upgrade further to 0.10.0.0). You can find the project here.

Note: if you’re using OSX, spinning up the cluster requires a docker-machine install. I have added a small script that simplifies the installation of this cluster inside docker-machine.

All the below code is available in this repository.

After cloning, run the following script to bring up your development environment:
./dev_start.sh

Producer

The next step is to create your Kafka producer, which will be sending your sample events. Here is the relevant code:

     for (i = 0; i < iterations; i++) {
         producer.send(new ProducerRecord<String, String>(topic, i
                  .toString(), "Message_" + i));
         messagesSent++;
     }

     producer.flush();

As you can see, there is not much to this. We are simply creating a set of events that are easy to track in the output using their equivalent message number. Numbering the messages also allows us to see whether they were consumed in the correct order.

Consumer

Now for the consumer portion. We are going to be using the new consumer style, which comes with Kafka ~>0.9.0.0. Whereas with Kafka ~0.8.2.1 we had to consume every partition of a topic in a separate thread created manually, this threading is now abstracted out to the consumer library. This makes the Kafka consumer logic a lot easier to implement and to understand. This consumer will implement an “at-least-once” recovery pattern. While we risk getting duplicate events after recovery, we are sure to ingest every single event that is produced.

To make sure that we’re doing “at-least-once” ingestion, we have to manually control the way we commit the latest consumed offset to ZooKeeper. The key here is that we have to commit the latest offset only after we’re sure that we have persisted a set of pulled messages. This is done by setting the ‘enable.auto.commit‘ Kafka consumer setting to false, and adding logic that manually commits our offset using the ‘commitSync()‘ function. The logic that we use for the consumption is the following:

     final int WRITE_BATCH_SIZE = 1000;
     Properties kafkaConsumerProps = new Properties();
     kafkaConsumerProps.load(new java.net.URL(args[0]).openStream());
     KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(
             kafkaConsumerProps);
     consumer.subscribe(Arrays.asList(topic));
     Path outFile = Paths.get(args[2] + "/" + topic + ".out");
     ArrayList outputBuffer = new ArrayList<>();

Here, we are setting the batch size of the messages written out to file at a time. This also shows the new style of Simple Kafka Consumer logic. No more need to listen to every partition with a separate thread.

Now, let’s start listening to the topic. We will poll 1,000 records at a time, filling out our buffer, and writing the records out to a file using a synchronous helper method, allowing for the write to finish before committing the offset.

   try {
       while (true) {
           ConsumerRecords<String, String> records = consumer.poll(1000);
           if (records.count() > 0) {
               for (ConsumerRecord<String, String> record : records) {
                   //Write out some details about the message so we can 
               understand how the consumer is behaving
                   outputBuffer.add("Partition: " + record.partition() + 
",Offset: " + record.offset() + ",Key:" + record.key() + ",Value:"+ record.value());
                  }
                  if (outputBuffer.size() >= WRITE_BATCH_SIZE) {
                      writeBufferToFile(outFile, outputBuffer);
                      //Now that the records in the buffer are written, let’s commit 
the latest offset:
                      consumer.commitSync();
                      outputBuffer.clear();
                  }
              } else {
                  if (outputBuffer.size() > 0) {
                      //Make sure we clear the non-0 buffer
                      writeBufferToFile(outFile, outputBuffer);
                      outputBuffer.clear();
                  }
              }
          }
      } finally {
          if (outputBuffer.size() > 0) {
             //Finally, clear out non-zero buffer
             writeBufferToFile(outFile, outputBuffer);
          }
          consumer.close();
      }

  }

As you can see, instead of letting Kafka do the committing, we’re manually syncing the offset with ZooKeeper once we’re sure that the latest buffer got written to the output. In the ‘finally‘ clause we make sure that we clear any buffer that may be left over.

Test Harness

With the consumer and producer in place, let’s write the script which actually runs our test.

First, set some variables. Here, we’re passing the name of the test, the number of iterations to run though, and a flag that tells us whether to rebuild our consumer and producer.

RUN=$1
ITERATIONS=$2
REBUILD=$3

TOPIC=test_failure_$RUN
PARTITIONS=1
REPLICATION_FACTOR=2

Then we create the necessary topic:

kafka-compose/confluent-1.0.1/bin/kafka-topics --create --zookeeper $DOCKER_IP:2181 --topic \ $TOPIC --partitions $PARTITIONS --replication-factor $REPLICATION_FACTOR

Next, we create the output folder for the current run, launch our consumer, and then producer:

CONSUMER_PROPS=file://$(pwd)/kafka-consumer-harness/src/main/resources/consumer.properties
PRODUCER_PROPS=file://$(pwd)/kafka-producer-harness/src/main/resources/producer.properties
if [ -d output_$RUN ]; then
 seconds=`date +%s`
 mv output_$RUN archive/output_${RUN}_${seconds}
fi

mkdir output_$RUN

java -jar kafka-consumer-harness/target/kafka-consumer-harness-$HARNESS_VERSION.jar \
  $CONSUMER_PROPS $TOPIC $__dir/output_$RUN >> consumer.out 2>&1 &
consumer_pid=$!

java -jar kafka-producer-harness/target/kafka-producer-harness-$HARNESS_VERSION.jar \
  $PRODUCER_PROPS $TOPIC $ITERATIONS >> producer.out 2>&1 &
producer_pid=$!

We use the cleanup function to remove the running producer and consumer after the script finishes:

cleanup() {
  kill -1 $consumer_pid
  kill -1 $producer_pid
  exit 1
}

trap cleanup EXIT
trap cleanup INT

After 15 seconds, let’s kill our consumer in order to simulate a failure:

sleep 15
#kill consumer
kill -9 $consumer_pid

And then start it back up again:

java -jar kafka-consumer-harness/target/kafka-consumer-harness-$HARNESS_VERSION.jar \
  $CONSUMER_PROPS $TOPIC $__dir/output_$RUN >> consumer.out 2>&1 &
consumer_pid=$!

This loop allows us to watch the amount of messages being output to make sure that our recovery worked correctly:

while true
do
  echo messages processed: $(wc -l output_${RUN}/${TOPIC}.out)
  sleep 10
done

If the recovery worked correctly, we should see at least as many messages in the output as the number of iterations we passed in as an initial parameter to the script. The script can only work as intended if we pass enough messages in the input. Typically, passing 30,000 or more messages has worked pretty well for me, but that number would fluctuate based on the type of Kafka cluster you’re spinning up, or the type of message you’re passing.

Conclusion

You should now have a basic idea of how to write a harness that tests a Kafka consumer failure recovery scenario. Admittedly, the scenario here is a bit contrived; the above can be used as a template, expanded to whatever needs you may have. Some examples include scaling your consumer and producer, using different size Kafka cluster, passing messages of different sizes and formats, as well as using different persistence mechanisms.

Consumer failure is a very real possibility in a production system. We hope that you can use the mechanism above to protect you from such failure, and understand how your system will respond during this scenario. There are other mechanisms of failure recovery, such as ‘at-most-once’ and ‘exactly once.’ We hope to cover these in future blog posts. Please share your experiences, or what you’d like to hear more about, in the comments.

The post Kafka Simple Consumer Failure Recovery appeared first on Silicon Valley Data Science.

Teradata ANZ

Transparency across the entire supply chain


Cost-effective, global and, of course, just-in-time: Logistics chains have to meet ever more demanding requirements and they become increasingly complex. This is no longer true for logistics service providers only, but for globally operating companies in all industries. Big data specialist Teradata helps numerous clients bring together and evaluate all their available supply chain data. In doing so, organizations get an overview of all processes and costs in their own network, and they can even include data of their suppliers. In 2012, BMW Group decided to merge their supply chain data on a so-called agile information platform. In our interview, Stefan Betz from the Material Control Overseas Plants department of BMW Group explains why the company introduced this platform and how it paid off in a short time.

Mister Betz, the Material Control Overseas Plants department is responsible for the logistic supply of all BMW overseas locations. What are the main challenges in this area?

Stefan Betz: BMW Group is a very customer-oriented company. All over the world, we accept, and react upon, change requests from our customers until a few days before their vehicle is actually produced. This obviously makes high demands on our supply chain. Therefore, BMW already optimized the supply of its European plants in close proximity to their suppliers in the 1990s. But as the overseas markets grow steadily, thus becoming increasingly important for us, we produce more and more vehicles directly on site. This makes the supply of these plants very demanding: Flexibility must be maintained for our customers despite long transport times. At the same time, the enormous transport volume out of Europe requires the use of numerous service providers. All this is of course expected to run at optimal cost and with a large degree of sustainability. In addition, we need maximum transparency of all important supply chain processes – which was ultimately the main reason why we decided to introduce the agile information platform and selected Teradata as our technology and implementation partner.

So the agile integration platform – a cloud solution based on a Teradata Data Warehouse Appliance – has been your first step towards supply chain analytics?

Betz: Not the first step, but the logical continuation of that we have done at BMW with simpler tools like Excel, Access and SAP Business Information Warehouse in the past. We have analyzed data along the supply chain for a long time in order to optimize our logistics processes. But our traditional instruments increasingly reached their limits. Thanks to the agile information platform we now have a convenient possibility to bring together countless interfaces and large amounts of data. We can, for example, analyze where stocks are too high or transports are underutilized. In addition, we use new technologies that pave the way for new logistics use cases.

What goals did you set with the introduction of the agile information platform and how does it contribute to the overall success of BMW?

Betz: The central aim of the platform is transparency – in all its dimensions. It starts with the transparency of stocks and processing times, continues with the performance measurement of process steps and service providers, and extends to absolute cost transparency. Moreover, this data gives us fresh impetus, and it serves as both an enabler for and proof of our process improvements. We have a number of big data and analytics initiatives going on at BMW Group, and this project is one of them. Therefore, we also wanted to use this project to gather experience with a big data technology provider that we had not worked with before and to demonstrate the applicability of the technology in the field of logistics.


Have your expectations been met in this respect? Have you already achieved any results?

Betz: The range of positive effects is indeed huge: First of all, we were able to stabilize our supply chain significantly, while increasing transparency in all areas. We also achieved a noticeable reduction of the workload in material control, centralized tracking and tracing of all service providers, generally reduced the supply cost and enabled faster execution of new requirements thanks to ad-hoc reporting. Self-service business intelligence in the departments relieves the IT department, and we were able to improve data quality and standardization through the implementation of process templates. We can now identify deviations such as delayed shipments at a very early stage, so that both the service provider and we at BMW Group can take appropriate action sooner. This reduces the staff cost and financial expenditure considerably, and it creates room for further process improvements.

How is this project perceived internally at BMW?

Betz: In a very positive way. The logistics departments see the agile information platform as an innovation that delivers true added value. Also from a technological point of view, the Teradata solution sets high standards.

Teradata’s logical data model for the transport and logistics sector has been implemented as part of your solution. What are the benefits for you?

Betz: Generally speaking, the logical combination of data objects from different source systems facilitates our analyses along the supply chain and it promotes the implementation of and compliance with data standards. In addition, the Teradata solution offers the possibility to integrate data from our suppliers as well as predictive analytics applications. The latter is already strongly promoted in other areas of the company.

How was the implementation process? What departments were involved in the project, and how was the cooperation with Teradata?

Betz: The Material Control Overseas Plants department assumed the overall project management which also included the IT project management. Furthermore, the purchasing department was involved in the project, and we consulted experts of the BMW IT Research Center at Clemson University Greenville in South Carolina.

In order to ensure smooth cooperation between the different departments, we installed an overarching steering committee that was responsible for the project organization. Overall, the introduction went very smoothly, and the Teradata Professional Services specialists had an important part in this success: We have experienced their cooperation with the departments as very constructive, cooperative and solution-oriented. This was an important success factor and already led to excellent results in early phases of the project – which gave additional boost to the overall project.

The post Transparency across the entire supply chain appeared first on International Blog.

Roaring Elephant Podcast

Episode 18 - MLeap interview: Productionising Data Science - Part 2

MLeap interview: Productionising Data ScienceIn this episode, we have the second part of the interview with Hollin Wilkins and Mikhail Semeniuk, the driving forces behind the MLeap project where they go into more technical details and give tips on deploying MLeap in your environment. If you are working with Spark, are deep into machine learning and are struggling to put those beautifully trained models into production, you definitely do not want to miss this episode!Read more »
Teradata ANZ

What is in a Name? A Data Scientist by any other name …

The term “data science” was first used by the statistician William H. Cleveland in his 2001 paper entitled, “Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics”. Cleveland emphasized that the “[results in] data science should be judged by the extent to which they enable the analyst to learn from data”.

The scientific discipline of learning from data has been happening for centuries before the term data science ever came into being. Statisticians have been collecting, processing, analysing, visualising and interpreting vast amounts of diverse data to generate models. In doing so, they developed many algorithms that are used for regression and classification such as GLM (Generalised Linear Modeling) and embedded in statistical packages such as SAS and SPSS that are used extensively to this day. They also developed fundamental theories that have been the basis of many learning algorithms developed in other fields, eg., Support Vector Machine (SVM).

The focus in statistics has been firstly, on inference, i.e., generating stochastic models to fit the data and on theoretical rigor in deriving statistically sound inferences, i.e., ensuring that the assumptions behind data distributions and data independence are valid. Machine Learning, on the other hand, focuses learning from data to make predictions without any reference to the underlying mechanism generating the data. They employ different predictive algorithms, many of which were developed by statisticians or based on statistical learning theory. The use of N-fold cross-validation or leave-one-out methodology to compare accuracies of different algorithms, the development of SVM solvers, ensemble classifiers and deep learning as well as different metrics for evaluating prediction accuracy are the result of fundamental machine learning research.

Since the 1990s, “Knowledge Discovery”, “Data Mining” and “Advanced Analytics” have come to encompass a broader field: that of computational discovery of useful patterns from large data-sets. The patterns may be grouping of data (cluster analysis), unusual data (anomaly detection), dependencies (association rule mining), generalizations (prediction models for classification and regression) and compact representations (summarisation).

Since 2012, after the Harvard Business Review article entitled “Data Scientist: The Sexiest Job of the 21st Century”, Data Science is the popular tag for this field. In the article, Thomas Davenport and D. J. Patil describe the combination of the skills needed in the data scientists employed in data-driven companies such as Google, Amazon, Microsoft, Walmart, eBay, LinkedIn, and Twitter. The data scientists are described as a hybrid of data hacker, analyst, communicator, and trusted adviser, i.e., having skills in multiple disciplines namely statistics and programming while at the same time understanding the business need and being able to communicate.

Bhavani Raskutti_Data Science 1
Above: Data Scientist Venn Diagram sourced from Stephen Kolassa’s comment in Data Science Stack Exchange.

Davenport and Patil describe data scientists as curious, self-directed and innovative, i.e., they are not limited by the tools available and when needed fashion their own tools and even conduct academic- style research. Not surprisingly, people with this combination of skills and characteristics are rare, as rare and as much in demand as the computer programmers in the 1990s.

This rarity and high demand for data science skills has meant that statisticians, machine learners, data miners, data analysts, DBAs as well as quantitative analysts, i.e., people with any data or analytics skills have re-badged themselves as data scientists so that they are more marketable. This is not unlike the pre-Y2K hype when computer operators and users of PCs, re-badged themselves as computer programmers.

The term “data scientist” itself has become so diffuse that it represents anybody from data base administrators to analysts doing simplistic summaries on Excel spreadsheet to data engineers setting up Hadoop infrastructure to advanced analytics practitioners who discover valuable insights from data using existing tools as well as those like the data scientists in Google and Facebook who derive insights from data using their own enhanced toolkit.

So, is the name really relevant? Apparently not, since Google’s career pages advertise for Decision Support Analysts, Statisticians, Quantitative Analysts, and Data Scientists and they all mean the same thing. Over the last 50 years, many people have been working as the data scientists described by Davenport and Patil, discovering insights from large volumes of diverse data using existing tools as well as new tools that they fashioned. They have been labelled statisticians, artificial intelligence researchers, data miners, machine learners, advanced analytics experts and the list goes on.

What is relevant is to understand where an individual’s interest lies in the broad data science church and where the needs of the organisation are. The individual’s interest may be developing innovative algorithms to solve a new problem (the high-end data scientist described by Davenport and Patil), or identifying new business problems that can be solved with existing tools or distributed programming for Hadoop. The key is to match the organisation’s needs with an individual’s interest and not be bothered with the position title or the candidate’s label.

Finally, as for finding this rare species, let me point out that the characteristics of curiosity, self-direction and innovation are required in all scientific research. Fashioning tools to overcome a challenge has always been the hallmark of a research scientist. Didn’t Newton invent infinitesimal calculus when the mathematical tools at his disposal were insufficient to calculate the instantaneous speed? Furthermore, scientific research through PhD ensures that they are able to teach themselves new skills.

So, instead of looking to graduates from the newly designed data science majors, develop your own data scientists by first finding a PhD or Masters in a quantitative science such as physics, mathematics, statistics or computer science and then providing them data, time and autonomy. It worked for LinkedIn with Jonathan Goldman and for many other data-driven companies and it can work for you too!!

The post What is in a Name? A Data Scientist by any other name … appeared first on International Blog.

 

June 20, 2016


Zementis

Adding arbitary attributes to the PMML MiningField element

MiningFieldAttributes code{white-space: pre;} pre:not([class]) { background-color: white; } if (window.hljs && document.readyState && document.readyState === "complete") { ...

...

Revolution Analytics

Exploring Global Internet Performance Data Using R

by Lourdes O. Montenegro Lourdes O. Montenegro is a PhD candidate at the Lee Kuan Yew School of Public Policy, National University of Singapore. Her research interests cover the intersection of...

...

Rob D Thomas

iPad Pro: Going All-in

Here is my tweet from a few weeks back: I have given it a go, going all-in with the iPad Pro. In short, I believe I have discovered the future of personal computing. That being said, in order to do...

...

Revolution Analytics

Updates to the 'forecast' package for R

The forecast package for R, created and maintained by Professor Rob Hyndman of Monash University, is one of the more useful R packages available available on CRAN. Statistical forecasting — the...

...
 

June 17, 2016


Revolution Analytics

Because it's Friday: How to get a baby to sleep

We've had our 6-month-old neice staying with us for the past week. She's been a joy, but we've got some catching up to do with this NZ dad when it comes to getting a little one to sleep: Maybe it's...

...

Revolution Analytics

Data Journalism Awards Data Visualization of the Year, 2016

Congratulations to Peter Aldhous and Charles Seife of Buzzfeed News, winners of the 2016 Data Journalism Award for Data Visualization of the Year. They were recognized by their reporting for Spies in...

...

BrightPlanet

What Type of Web Data Can You Collect From Facebook?

With over 1.65 billion monthly active users, it’s no surprise that businesses want to access user-generated content from Facebook. Status updates, likes, reviews, comments, photos, and videos all contribute to a sea of web data generated by Facebook users. But what exactly can be collected from Facebook and how are companies typically accessing that data? […] The post What Type of Web Data Can You Collect From Facebook? appeared first on BrightPlanet.

Read more »
 

June 16, 2016

Silicon Valley Data Science

CDO FAQ

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

At Silicon Valley Data Science, our focus on data strategy has given us a window into how various organizations are thinking about data at the executive level. Many large companies have been hiring Chief Data Officers (CDOs) to oversee the process of creating a data strategy, or to oversee compliance efforts in highly regulated industries.

The CDO is still a relatively new position and consensus is still gathering about the exact job description, reporting structure, or qualification set. Last year, I wrote a report examining some themes and trends I noticed after speaking with a dozen current and former CDOs across multiple industries. I’m currently working on a second edition of this report, which will be published by O’Reilly Media this September.

In the meantime, as I’ve given a handful of talks about the role of the CDO at various tech conferences, I have been impressed by the resonance of the topic and the consistency of the questions people ask about it. I thought it would be useful to compile a list of the most frequently asked questions and major discussion points. The second edition of my report will directly address the questions below—and others—in much more detail, so stay tuned. But this post aims to point you in the right direction right away.

If you have other questions or observations to add, please leave them in the comments below!

What should I look for in hiring a Chief Data Officer?

The ideal CDO exists to drive business value, so business skills and the ability to ask well-formed and relevant business questions is the first key skillset. Of course, the CDO should also be aware of the tools, techniques, and challenges involved in working with data, so sound technological experience is also key. Third, the CDO needs to be able to work well with others in all parts of the organization—above them as well as below them—so diplomacy and other soft skills are essential. Finally, remembering that this is an executive-level position, the CDO should have executive-level experience.

If finding someone like this sounds to you about like finding a purple unicorn with glittery silver wings, you’re not wrong. So start by identifying the key business objectives that are driving you to hire a CDO, match them to the most important skills required, and look for those first. Any missing skills can either be picked up on the job, or added in the form of other people on the CDO’s team. Remember, working with data is almost always a team sport!

How can I become a Chief Data Officer?

Work the process above in reverse. Start by identifying your own skillsets—are they more business-oriented or technology based? Work on your political skills. Identify your relevant project experience. Then look for companies whose business problems most likely match your expertise. Think through the challenges and opportunities they face with data, and look at what their competitors are doing. Find out as much as you can about how they’re already using analytics, and how the relevant teams are structured. Then, tell the company about the business value you would be able to deliver as CDO and describe some potential use cases so that when they bring you in, you’ll be prepared to hit the ground running.

What is the ideal reporting structure for the CDO?

The organizations that already have CDOs use myriad reporting structures: the CDO may report to the CIO, CTO, CEO, COO, or even the CFO. Some large enterprises have more than one CDO—perhaps one per division or one per country or region, reporting up to an executive CDO. In terms of a prescriptive recommendation, however, it seems to me that having the CDO report directly to the CEO is the best overall plan.

The Chief Data Officer forms a critical bridge between the business side and the technical side of the organization. Embedding the position in one side or the other inevitably skews the focus of the role. Also, it is critical for the CDO to have a real seat at the table when important strategic decisions are being made. Making this a true C-suite position with direct reporting to the CEO is the best way to ensure that the CDO’s voice is heard clearly.

What is the difference between the CDO and the CIO?

The title of Chief Data Officer sounds in some ways exactly like the title of Chief Information Officer—it sounds like they should be about very similar goals. Aren’t “data” and “information” the same thing? Indeed, the CDO is in some ways like a second take on the role of the CIO: somehow the CIO role took a left turn 20 years ago and ended up doing something else. But think of the CIO as short for the C-IT-O, and all of a sudden it becomes more clear. Most CIOs now are really about information technology, and the computer systems and hardware used by the organization.

The CDO, by contrast, is focused on the data itself, and, most importantly, how it can be used to drive real value for the business. The CDO is not so much about procurement and vendor contracts as about asking the right questions, making sure the data to answer them is accessible, and turning the resulting insights into actions. The CDO and CIO should work closely together, to be sure. But the CDO also has a strong business focus and works just as closely with other parts of the organization.

 

For more background on the role of the Chief Data Officer and more trends and observations, download the free report here.

The post CDO FAQ appeared first on Silicon Valley Data Science.


Revolution Analytics

Gender ratios of programmers, by language

While there are many admirable efforts to increase participation by women in STEM fields, in many programming teams men still outnumber women, often by a significant margin. Specifically by how much...

...
Jean Francois Puget

Machine Learning As Prescriptive Analytics

I made a mistake about machine learning.

Repeatedly. 

I said, and I wrote, that machine learning and predictive analytics were almost the same. 

To be more specific, my view was simple: analytics can be divided in four categories, exemplified below (see Analytics Landscape for details)

image

I put machine learning near predictive analytics in this 2D landscape:

image

Of course, I also put optimization as the queen of all analytics technologies as it yields best business value.  What else would you expect from someone who spent nearly 3 decades in working in optimization?  No wonder this view became popular in the optimization community...

If this is popular, then why is it wrong?

First, let me reassure readers about my mental health: I still think that optimization is best for computing optimal decisions.  Issue is elsewhere.

I started thinking there was an issue when I met customers willing to use machine learning to solve all the business problems they have.  I can't blame them, there is so much buzz around machine learning these days that it is hard to not think it can solve all problems there is.  I can't blame them but I was a bit shocked when I saw people willing to use machine learning to compute best decisions.  My mental model was that machine learning should be used to make predictions, and that optimization should be used to optimize decisions given these predictions:

Machine learning for predictions, optimization for decisions.

Let's use an example for the sake of clarity.  We can use machine learning techniques to build a model that predict future demand (future sales levels) for a retail store chain.  We can then use optimization to compute optimal inventory management for these stores, making sure cost of inventory and risk of out of shelves are kept at minimum. 

Simple and powerful, isn't it? 

Yet, I kept meeting people thinking of machine learning as a way to directly learn what the good decisions are.  It made me think.  Are they really wrong? 

They may not be wrong after all.  To see why, let's look at some facts.

One of the first known example of machine learning is Arthur Samuel 's work on a program that learned how to play checkers better than him.  Samuel used machine learning to learn how to play, i.e. learn how to make the right decisions.  There are other examples of machine learning used to learn how to make the right decisions.  For instance, IBM Watson defeated best humans at Jeopardy few years ago.  IBM Watson is a learning machine (we like to say a cognitive machine at IBM).  It was given Wikipedia to digest, then it was trained with pairs of question answers from Jeopardy.  As it trained, its performance at answering Jeopardy questions improved until it was able to best top human players.  IBM Watson learned how to answer questions.  It learned how to decide which answer was best given a question it had never seen before;

More recently, Google AlphaGo won a match against one of the top Go players.  AlphaGo is also a learning machine.  It was first trained on a large set of recorded Go games between top players.  Then it trained against itself.  As it trained, its performance at Go increased, until it became better than a top Human player.  AlphaGo learned how to decide what the best next move is, for any Go board configuration.

Once we look at these examples, we no longer can say that machine learning is just for making predictions.  Machine learning can be used to compute decisions.

Does it mean that machine learning is going to replace optimization?  Let's zoom a little bit before answering that question.

When using optimization directly, someone, typically an operations research practitioner (a king of data scientist actually), creates an optimization model of the business problem to solve.  For instance, that person will model all the inventory constraints (available space, cost per stored item, shelf life, cost of transportation, etc), as well as the business objective (combination of inventory cost and risk of lost sales because of out of shelf) Then, on a regular basis, this model is combined with a particular instance of the business problem to yield an optimization problem (for instance, inventory left from previous week, and predicted demand for the week). That problem is then solved by an optimization algorithm to yield a result, (for instance a replenishment plan).  This can be depicted as follows.

image

When using machine learning, we start with a (large) set of instances of the business problem to solve, together with their known answers (yes, this is supervised learning).  We then train a model on these examples, with the goal that this model can compute the best solution to new problems. Then the model is applied to new instances to compute the answer for that instance.  Point is that the training phase is an optimization problem.  It amounts to  finding the model with best predictive power given the training examples; see the appendix below for more on this.  We therefore get this workflow for machine learning:

image

The comparison with the previous picture is striking, isn't it?  When we chose between optimization or machine learning to compute best decisions, we actually chose when to use optimization.  Optimization will be used anyway.

I will let readers ponder about this.

Let me conclude with a question: why would we use optimization one way rather than the other way?  Should we train a machine learning model on past examples of good and bad decisions?  Or should we apply optimization to each new instance independently from each other?  I don't have a clear generic answer yet, and I welcome suggestions from readers.  My gut feeling is that the right answer could be a combination.

Appendix: Machine Learning Is An Optimization Problem

The view of machine learning as optimization is not mine. For instance, here is a great quote from John Mount that I found when writing Machine Learning and Optimization:

My opinion is the best machine learning work is an attempt to re-phrase prediction as an optimization problem (see for example: Bennett, K. P., & Parrado-Hernandez, E. (2006). The Interplay of Optimization and Machine Learning Research. Journal of Machine Learning Research, 7, 1265–1281). Good machine learning papers use good optimization techniques and bad machine learning papers (most of them in fact) use bad out of date ad-hoc optimization techniques.

I also found this definition of machine learning as an optimization problem when writing What Is Machine Learning? :

image

 

 

 

The Data Lab

Data Disruption in TV Advertising

TVSquared

While some of these 60s-era polling tactics have come into question recently, this fascinating piece of history is still a perfect example of the unrivalled power of TV.

Fifty-six years later, we live in a digital world and the way people consume content has changed drastically. Headlines tout the increase in digital ad spending among advertisers (and even political candidates). Many of these articles cite digital’s measurement and targeting opportunities as the reasons for the growth. And that makes sense. Traditionally, advertisers had no timely way of knowing how TV spots performed or the opportunity to make “on-the-fly” changes to improve on-air campaigns. Given those limitations, it’s no wonder advertisers turned to other, more measureable outlets.

But buried within the noise of digital ad spend is the fact that TV is still king for advertisers – and the rise of data-driven technologies is only making it more powerful.

  • TV is unequivocally the biggest outlet for media spending, with TV advertising accounting for a £127B global industry. In fact, it’s estimated that U.S. political ad spending will reach up to $6B (£4.2B) this year. Given TV’s sheer reach and the fact that the average person watches thousands of hours of TV a year, it’s not hard to understand why those numbers are so high.
  • Today’s linear TV is an optimized marketing channel. The use of big data and analytical technologies have enable advertisers to target, measure and optimise TV campaigns just like they do with digital.   

It’s fascinating to see how big data has completely transformed TV advertising – which, up until a few years ago, operated a lot like it did during the Don Draper-era of Madison Avenue. Compared to other industries, it was slow to adopt analytics, but it’s certainly making up for lost time.

Advertisers and, going back to my original example, even political teams, have leveraged technologies that measure same-day spot and response data (phone, app, web, SMS, etc.) to make TV work better for them. They can now answer questions like what network, daypart, program, genre and creative work best to reach their target audiences. Advertisers now know which audiences are the most engaged and where they are located; they even have the insights needed to change on-air spots to improve response and ROI.

While the use of these technologies in TV advertising space is still in its infancy, the impact they’ve had is impressive. Some advertisers have been able to improve TV efficiency by upward of 70 per cent.

Combine TV’s massive reach with today’s data-driven technologies and advertisers have an incredibly powerful tool to work with.  

 

Find out more about how TVSquared is using Data Science to disrupt TV advertisment analytics.

 

Template: 
Image
Share: 
Google Plus
Facebook
Pinterest
Twitter
 

June 15, 2016


Revolution Analytics

The R Packages of UseR! 2016

by Joseph Rickert It is always a delight to discover a new and useful R package, and it is especially nice when the discovery comes with at context and testimonial to its effectiveness. It is also...

...
Ikanow

Video: Take Back Control of Your Information Security

The numbers are staggering. PWC reported there were 60 million information security attacks on enterprises last year and CyberEdge Group found over 70% of all companies were breached. The attack surface has expanded rapidly as the number of managed devices increased another 72% over 2014 according to Cisco. According to IBM and CSO, the typical enterprise has between 60-85 information security tools to manage.  It’s no wonder information security teams feel like they have less control. Click the video below to learn more.

If you would like to go deeper into this topic. You can download the white paper Take Back Control of Your Information Security by clicking the button below.

Download White Paper

The post Video: Take Back Control of Your Information Security appeared first on IKANOW.


BrightPlanet

DATA VISUALIZATION: Uncovering Insights from Election Coverage

How often are the presidential candidates being mentioned in the news? And who’s mentioning them? Two questions we asked ourselves at BrightPlanet recently. Fortunately for us, we already had the data in our Global News Data Feed to help answer those questions. Using the Global News Data Feed, we created the following visualization via Tableau to help answer […] The post DATA VISUALIZATION: Uncovering Insights from Election Coverage appeared first on BrightPlanet.

Read more »
Teradata ANZ

Four Lessons From The Fire Monkey For Today’s Energy Businesses

2016 is the Year of the Monkey in the Chinese calendar. More specifically, it’s the Year of the Fire Monkey. So, for those of you unfamiliar with Chinese and wider Asian culture….who or what is the Fire Monkey? Firstly, the Monkey is one of the 12 animals of the Chinese Zodiac. More specifically though:

“The influence of the Fire Element makes the Fire Monkey the most creative and energetic of the Monkey types. Full of confidence, the Fire Monkey is also the most competitive member of its sign. While other signs may take to the sidelines, the Fire Monkey is all about action and initiative.”

Sounds like a smart animal to me. It also sounds like the sort of animal that, in times of massive, unprecedented disruption and opportunity, energy businesses could take a few cues from. OK, maybe that’s just because I’ve spent all my life in utilities. Maybe I need to get out more.

Either way, what could energy companies learn from the ambitious, adventurous (and admittedly, sometimes irritable) Fire Monkey?

Untitled
Be Energetic and Action-oriented

The energy and utilities industry hasn’t changed much in the past 100 years. Sure, we’ve had privatisation in some parts of the world. And retail competition. But nothing like the disruption we see today. Surviving the 21st Century means embracing change and even demonstrating agility through our choices and actions. But it needs to happen now. Whether that’s by employing better methods of managing talent; becoming genuinely customer-centric for the first time; or a host of other key foundations for the change ahead, suddenly the pace of change – and the pace of learning how to manage that change – has increased exponentially.

Be Competitive

Competition is increasing in many forms across the world’s energy businesses. More countries are opening up to retail competition all the time. But the business of energy generation is also facing increasing global competition; so too, major energy project construction work. And what about when Tesla’s Powerwall home battery system becomes a competitor to your own network-based supply model? Even for the remaining state-owned monopoly utilities, the time is probably already here when some form of competition is harming your business. It’s time to learn what it means to be part of the wider, competitive world and to act accordingly.

Be Creative

Today’s utility will only survive if it can innovate. How do you cope with the challenges of Smart Cities and, say, the electrification of transport? (And no, that’s no guarantee of increased demand. Not if energy autonomous vehicles have anything do with it.) Are you ready to take advantage of the opportunities presented by the Internet of Things (IoT)? Respected industry player DNV.GL have even begun to talk about the Internet of Electricity. No, seriously. In such an environment, utilities must embrace creativity and consider new business models that will be fit for purpose in this new, connected world.

And finally, be focused

“The Fire Monkey’s adventurous side can run the risk of becoming reckless, especially when the Fire Monkey does not think their decisions through. In order to over come their rash tendencies, the Fire Monkey is encouraged to take their time and focus their ambitions.”

Ay, there’s the rub. In an industry not too familiar with change or with agility, it can be all too easy to rush off in the wrong direction. Big Data & Analytics is the perfect example. It’s not something that’s core business, obviously. But it is something that’s got the potential to transform….and also the potential to do nothing more than burn budget, reputations and even careers. If Big Data is to be a key enabler for the 21st Century utility, it’s got to be managed well, with clear goals and strong governance. My own experience suggests that utilities would do well to remember that.

And so there we have it. All in all, it seems to me that the Fire Monkey could be something of a role model for the 21st century energy business (naughtiness or rashness aside, of course). It’s a time when creativity, drive and…ahem…energy have been needed in utilities like never before. So my advice: be like the Fire Monkey. At least until the next Year of the Dragon!

This post first appeared on Forbes TeradataVoice on 04/02/2015.

The post Four Lessons From The Fire Monkey For Today’s Energy Businesses appeared first on International Blog.

 

June 14, 2016

Data Digest

Do or Die: 5 Data Analytics Project Pitfalls you Must Avoid

“If I had asked people what they wanted, they would have said faster horses.” Henry Ford


I’ve seen this quote referenced so many times with regards to data and analytics technologies and it seems so perfectly apt. With the rise of Big Data and analytics we have seen a surge of investment in the various tools and technologies – en masse. Yet, not everyone is delivering the innovative business insights they were anticipating for success.  Open source technology has reduced the barriers to entry, making it all the more tempting to implement them in a “me too” style. Implementing the same tools as the rest of the crowd and trying to do it better is not likely to benefit you unless there is a clear need for your business.

During my recent research in developing the Chief Data and Analytics Officer Forum, Melbourne,  I came across some of the key reasons why organisations are unable to leverage their data for innovation.

Top 5 Issues to Address:

1.      Lack of an enterprise-wide strategy. In a recent post about the data disconnect, it touched on the importance of a carefully managed data analytics strategy. Data strategy must it be effectively communicated across the business and underwritten by well-developed organisational support in order for it to become an inherent part of the way an organisation operates.

2.      Lacking the right skillset. People are often searching for the perfect blend of IT and business experience and there is much debate around whether that skillset should be recruited or built internally. Not having the right skillset at the right time can be fatal for an analytics project.

3.      Disparate information systems. Uniting all relevant data from various legacy systems and differing technologies is a very common challenge. In order for your business insights to be meaningful, they need to be derived from one source of the truth. Crucially, data governance and data quality must underpin every data analytics project.

4.      Failure to identify the business problem. A point that is always highlighted at our conferences is that you must first identify the business need and then design the analytics project. Collecting data in the hope that something meaningful will emerge that will be of use for the business is an incredibly inefficient way of operating – you need to first know what problem you are trying to solve.

5.     Need for a ‘fast-fail’ analytics culture. Building a culture for a ‘fast-fail’, learn quickly and move forward environment will reap the greatest rewards.  Pilot the viability of your project before scaling up. Starting small and failing fast minimises the economical loss and assures the viability of the project before you start scaling up.

The pitfalls are many but the general consensus is that an iterative approach to data analytics projects is a must. Don’t be disheartened by failure – expect it.  Focus on the business problem and start asking the right questions in order to tailor your project. If Henry Ford had asked his customers about their day-to-day needs, perhaps he would have got a different answer.


You can hear more on this topic at our Chief Data & Analytics Officer Forum in Melbourne this September. Phil Wilkenden, ME Bank and Richard Mackey, Department of Immigration & Border Protection will be hosting a discussion group addressing how to understand the challenges of data and analytics projects.



Monica Mina is the Content Director for the CDAO Forum Melbourne. Monica is the organiser of the CDAO Forum Melbourne, consulting with the industry about their key challenges and trying to find exciting and innovative ways to bring people together to address those issues – the CDAO Forum APAC takes place in Sydney, Melbourne, Canberra, Singapore and Hong Kong. For enquiries, email:monica.mina@coriniumintelligence.com.
Big Data University

This Week in Data Science (June 14, 2016)

Here’s this week’s news in Data Science and Big Data. Big Data and Love

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

The post This Week in Data Science (June 14, 2016) appeared first on Big Data University.

VLDB Solutions

Google Cloud Platform Review: Head In The Clouds

Google Cloud Platform

Google Cloud Platform Review: Head In The Clouds

 

Over the last decade, cloud computing services have sky rocketed as people push them to their limits and beyond to find more, advantageous uses for them. Providers such as Amazon Web Services (AWS), Microsoft Azure and VMware are constantly refining their own services and providing users with better means to their end. But in recent years, one service in particular has provided cloud computing that boasts both flexibility and performance to a high degree: The Google Cloud Platform (GCP).

 

Flexibility & Performance

 

The Google Compute Engine (GCE) is the main component offered by the Google Cloud Platform that we are interested in. This service allows the general public to create virtual machines (VMs) with an excellent range of selections and prices for the general hardware required for a VM.

 

Some of the more common hardware is of course CPU and RAM. When creating a new VM, the Google Compute Engine offers a list of options to choose from for fixed values. But once chosen, you then have the freedom to slide both RAM and CPU as required, which is a useful benefit to everyone.

 

Potentially the most important hardware for a VM are the disks. Disks can be chosen from either HDDs or SSDs, both scaling in IOPS (Input/Output Operations Per Second) as you increase the size of the disk, maxing out at approximately 10,000 read IOPS and 15,000 write IOPS for a 350GB (or higher) SSD.

 

You can also choose whether or not the disk created is persistent. This means that all of your data on your disks will be protected, and as a result will persist through system restarts and shutdowns. Should you not require your VM online at all times, then you can shutdown your server for a period and you will not be charged the full price of the VM during this time. Persistent disks therefore not only offer you security of your data, but also additional benefits that might not be seen with some other cloud computing services.

 

One other option available for disks is the local SSD. This is an SSD that is physically attached to a VM, which in turn offers superior performance on the other available disk options. However, this increased performance does have certain trade-offs.

 

  • First, a local SSD can only be created when the VM itself is initially created. Thus it is not possible to add a local SSD to an existing VM.
  • Second, when creating a local SSD there is [currently] a fixed size of 375GB. This may be more than enough for certain situations, but for others this can be very limiting indeed.
  • Third, and potentially the biggest downside, a local SSD cannot be persistent. This means that any type of system fault could result in all data held on a local SSD being lost. It also means that the VM cannot be shutdown, which revokes the benefit of temporarily turning off the VM when it isn’t needed.

 

As a result of the options available, the flexibility regarding both performance and price is highly suitable for anyone. It means that you can truly specify a system to your requirements, depending on both the performance and price range that you are looking for.

 

Testing The Platform

 

With all of the options available, the Google Compute Engine sounds like a great environment to build your new VMs in. But how easy is it to setup a VM in the Google Compute Engine? How does it fare for the tasks that you require? To try and answer these basic, but very important, questions, we have installed a Greenplum multi node cluster within the Google Compute Engine using Pivotal’s Greenplum database version 4.3.7.1 and servers provisioned with a CentOS 7.2 image.

 

The very simple reason for using Greenplum as a test is that it checks all of the boxes that would generally be required for a day to day server. Basic system processes can test the general performance of disks, CPU and RAM. By running a set of TPC-H queries on loop over a few days, it is possible to also see how daily traffic may, or may not, affect the performance of the servers.

 

Furthermore, a Greenplum database requires perfect broadcast capabilities of networks between the multiple VMs, without any interferences. When initially looking into the networking capabilities of VMs in the Google Compute Engine, various posts made it appear that running a Multi Parallel Processing (MPP) Greenplum instance would be difficult (if possible). This therefore was essentially a make or break for the initial testing stages.

 

Google Compute Engine Cluster Setup

 

Setting up a VM using the Google Compute Engine is relatively straight forward. Once you are logged in, simply go into the Google Compute Engine and click Create Instance in the VM instances tab. From here you can select the name of the instance (it is important to note that this is the host name of the server as well), CPU and RAM values, boot disk options (including the OS image as well as the boot disk size and type), and optional extras (such as additional disks and network parameters).

 

That’s all there is to it! Once you’re happy with the VM details, click Create and the Google Compute Engine will start to provision it over a respectfully short period of time. Once provisioned, it is possible to click on the VM to get a more detailed overview if required. From this view you can also choose to edit the VM and alter most settings.

For our test cluster, we provisioned three VMs, each consisting of 2 CPUs, 4GB RAM and one standard persistent boot disk with 15GB space. The master VM has an additional standard persistent disk with 200GB space (which will be used to generate and store 100GB of TPC-H data), whilst the two segment servers each have an additional SSD persistent disk with 250GB space (for the Greenplum database segments).

 

Disk Performance

 

It isn’t always an initial thought that disks can often be a limiting factor in server performance, especially when it comes to databases. Shovelling on more CPUs and RAM may help in part, but there is always an upper limit and disks can often impact that limit.

For a provisioned 250GB SSD disk in the Google Compute Engine, one would expect to achieve approximately 7500 random-read IO per second (IOPS), which would be a very welcome sight for most database servers. But using the exceptional disk performance measuring tool FIO, it was in fact a disappointment to find that the approximate random-read IOPS performance seen on both of my SSDs was closer to 3400, regardless of using a range of different options available with the FIO tool to try and increase this.

 

Similar testing on a separate VM provisioned with a 375GB local SSD returned similar disappointing results.

 

Network Configuration

 

The most important task is to configure the network, as this is essential for Greenplum to run correctly. By default, no user has external or internal SSH access to any server. Whilst generating SSH keys for each individual user (using PuTTYgen) and then applying them to each VM via the Google Compute Engine is relatively straight forward, this only allows SSH access to a VM from an external host.

 

Setting up SSH access between the VMs themselves, which for Greenplum is the more important aspect, requires a relatively simple task. First you need to manually generate an rsa key on each server for each user using the command ssh-keygen –t rsa from the command line (this key will appear in ~/.ssh). Then, share each generated key for each user between all servers (via SCP to and from an external host) and finally paste all keys into the ~/.ssh/authorized_keys file on all servers.

 

With this task complete and successful, not only is the most tedious part of the server setup out of the way but it is also a relatively straight forward procedure from here on to get Greenplum up and running.

 

Greenplum Installation

 

With the network setup as required, all that remains is the system configuration options and Greenplum software installation, which holds little to no complications. Once these additional tasks were complete, it was a single and simple command to successfully initialise the Greenplum database across the cluster.
With the database up and running and a 100GB TPC-H data set generated, it was possible to load and query the data without any issues.

 

TPC-H Testing

 

With the data loaded, a continuous loop of the 22 TPC-H queries was run against the data over several days. One thing we specifically looked for was the standard deviation percentage in query times for individual queries. Impressively, this averaged to be 2% across all queries, with the maximum being 7%. From this we concluded that daily traffic did not noticeably interfere from one persons server to the next.

 

Less impressively however, the TPC-H tests once again showed that the Google Compute Engine wasn’t quite as performant as it boasts, as it returned an average time of 618 seconds per query, whilst an almost exact replica of the server (regarding Greenplum setup and hardware) on a different cloud provider returned an average time of 374 seconds per query.

 

Overall:

 

It is easy to say that the Google Cloud Platform is a flexible and reliable cloud computing service, with options available that are more than capable of performing most tasks. Upgrading and scaling out your server(s) is essentially as quick as you can click, meaning you are never truly limited with regards to performance, and ultimately an ever growing server can easily meet ever growing needs.

However, with all of the benefits that can be utilised, the expectations of server performance appears far more boasted than what seemed the likely reality. This may not be a problem for the smallest of requirements, but it could most definitely prove to be the downside for larger requirements. Scaling out your server(s) is of course an option, as mentioned above, but how far are you willing to go?

The post Google Cloud Platform Review: Head In The Clouds appeared first on VLDB Blog.

Teradata ANZ

How Can Analytics Help You Become a Happier Retail Shopper?

I shiver just at the thought of shopping for necessities! Yet, I do it weekly and I am not alone. Van Rompaey[1] pointed out that 20% of Belgian citizens consider shopping for food a chore. According to the Daily Mail “a national survey” revealed that 62% of 2500 surveyed customers cannot get out of supermarkets quickly enough.

The main reasons given in the survey for this aversion to supermarket shopping are as follows:

  • (62%) customers – “Get me out of here!”
  • (38%) customers – The unruly children of other customers. The childless portion of the surveyed population.
  • (38%) customers – Self-service machines.

My deepest sympathy and understanding! How many times I have been faced with:

“Please remove an unexpected item out of the bagging area”. This is your merchandise, what is so unexpected about it?

(36%) customers – Were annoyed at the trolley blocking the aisles.

Retail

Here is the truth: “I do not want to be there and I do not want to be doing my grocery shopping”. Unfortunately the house chores have to be done. What does the analytics have to do with it all? Everything! It is a key to my spending, loyalty and sanity……Analyse me, help me!

From the supermarket’s point of view, the problem is simple. They want me to keep spending money in the shop, keep me loyal, sell me new products and keep me coming back to the shop. From a customer’s perspective, it is also simple: make it easy, convenient, relevant and private. Save me the time, the money and remove the activity that I don’t want to do. The question is – can these two aims be met?

Analytics to the rescue!

My local supermarket, “knows” what I eat, what time I shop, how often I do my grocery shopping, what pets I have, that I have children, where I live, in fact it knows more about my life than my own mother. If it does not, I am changing to a different shop.

Most of us have different, but very predictable habits in what we eat, how we shop and what we buy. Some people only shop for organic produce, others buy products which are on sale. This is not a new phenomenon. Stone [2] in 1954 made the first attempt at classifying consumers based on their attitude towards shopping.

However, with all that knowledge and over 60 years of research, I am yet to see the full power of analytics applied in a retail environment. I am not talking about targeted advertising, I want more than that. I want a predictive shopping experience.

Harvard law professor Cass Sunstein defined predictive shopping as “special programs in which you receive goods and services, and are asked to pay for them, before you have actually chosen them” . Call it: “the return of the milkman”, backed-up by science and technology.

Gone are the days in which a retailer can simply aim to keep a customer in store as long as possible so they will spend more. Welcome to the new dawn of keeping the customer out of the shop, but online where they will spend more and be happier.

[1] Van Rompaey, S. (1998), ” Le commerce en ligne: un crTneau a potentiel? “, Store Check, (December), 22-23.

[2] Stone, G.P. (1954), “City shoppers and urban identification: observation on the social psychology of city life”, American Journal of Sociology, 60 (July), 36-45.

The post How Can Analytics Help You Become a Happier Retail Shopper? appeared first on International Blog.


Revolution Analytics

Using Microsoft R Server on a single machine for experiments with 600 million taxi rides.

by Dmitry Pechyoni, Microsoft Data Scientist The New York City taxi dataset is one of the largest publicly available datasets. It has about 1.1 billion taxi rides in New York City. Previously this...

...
 

June 13, 2016


Revolution Analytics

R holds top ranking in KDnuggets software poll

The open-source R language is the most frequently used analytics / data science software, selected by 49% of the 2895 voters of the 2016 KDNuggets Software Poll. (R was also the top selection in last...

...
 

June 12, 2016


Simplified Analytics

9 steps to Successful Digital Transformation

Today nobody is bothered about if you need to go for Digital Transformation but when and how you are doing it. All new programs or initiatives should now take a ‘Digital First’ approach, with the...

...
 

June 10, 2016


Revolution Analytics

Because it's Friday: The DNA Journey

There is, and always has been, a lot of strife in the world related to how people perceive "nationality" and "other". This moving video suggests that perhaps there might be less strife in only we all...

...
Teradata ANZ

Cyber Security and the Art of Data Analytics

Who remembers the 1983 movie starring Matthew Broderick titled “War Games”? This was one of the first movies that piqued my interest in computer security. In those days it was the use of acoustic couplers to dial a remote computer and we could actually communicate with a remote pc which was fascinating for me. Before long it was dial-up modems where I could literally dial anywhere I wanted to because it was attached via a serial cable to my pc. And yes such power at my fingertips meant I did perform some ethical hacking as a form of learning the inner workings of computer security and also the fact that I thought I was Matthew Broderick too!

Fast forward 30 years and I am still involved in computer security albeit from a different angle. Firstly through my ever continuing research on computer security topics as part of my PhD program and also secondly through the application of data analytical capabilities to detect and counter threats.

But let’s take a step back first to understand today’s cyber security threat and what it means to you. The proliferation of device connectivity onto the internet over the past 20 years has given rise to huge volumes of information that is accessible from one device to another. It’s a very simple concept; a device connects to another to exchange bits of data across a communications link. In the today’s modern form however the availability of data is what attracts the cyber criminals.

 Protect your data, but also understand the data protection policies of your trading partners.

We heard last month from the US Department of Justice on the case of charging several Chinese Nationals who have been identified as stealing trade secrets from US companies and feeding this back into Chinese corporations. This however is not your backyard group of ragtag coders though. This is a sophisticated state backed group using techniques that are developed in-house. Their targets are not the military missile silos either like they depicted in wargames, they are corporate organisations.

Their targets are patent designs and any other corporate information that can be used as an advantage. And they don’t discriminate on organisation size either. I recently spoke to a CEO of a Funds management organisation based in Canberra that specialises in rural properties. I asked him what his organisation is doing in regards to protection of corporate secrets and his response was very sobering. His view was that they weren’t a target. “What do we have that would be of interest to them?” After pointing out the value of any form of data to foreign organisations, he got the picture.

A survey by Ponemon institute about cyber-attacks highlighted the state of cyber readiness. In this report I note the following figure:

Less than half agreed that their organisation is vigilant in detecting attacks and slightly less agreed that they were preventing attacks. I thoroughly recommend reading this report as it highlights some fascinating insights into the state of the art of cyber attack prevention. Download the report here.

And attacks may not come directly into your organisation either. On a local note here in Canberra, we witnessed the accessing of building design plans of the new ASIO HQ not directly against ASIO but via a 3rd party contractor. Therefore we see that access comes in many forms, shapes and sizes. Protect your data, but also understand the data protection policies of your trading partners.

So in understanding the context of cyber-attacks on our society, how does the use of data analytics play a role in defending against these attacks? The obvious answer lies in the vast amounts of information that we have at our fingertips and analysing this data to figure out what is happening. There are a number of key requirements that a data analytical system to combat cyber attacks should have and I have outlined a few below:

Speed– Obviously the quicker we can analyse the data, the quicker we can detect the threat and put in place counter measures. But traditionally, data analytics has taken on a historical view of the data. It was ok to send the data off to somewhere to be processed and have a result come back a few hours later, but that’s not how we handle cyber security data. We now must develop processes whereby we can collect, analyse and take action in a fraction of a second. Any longer and the attack would be deemed to be successful. To do this we have to design environments that collect data instantly and process the data “in-flight”. Therefore analytical functions have to be performed at the point of capture in real-time.

Volume– Imagine if we had to walk around our house constantly monitoring every fence line to stop burglars coming over. As soon as we turn our backs, one could slip over in an instant. Well the same applies to the volume of data we need to keep watch over. Analytics plays a role in analysing web logs, firewall logs, change logs, application logs, packet information and user activity all in one place. Organisations need to centralise security information into one place to analyse it all as a single entity and not in isolation.  Miss one bit of information and sure enough the attack will come through that crack.

Convert to an intelligence driven security model- Just like the hackers out there evolve quickly, so to must our security models adapt. As organisations, we are far too slow and rigid in our security approaches to be able to adapt quickly to the multiple threats that appear every day. Therefore we must move towards an intelligence driven security model. This approach relies on security-related information from both internal and external sources being combined to deliver a comprehensive picture of risk and security vulnerabilities. Current security models rely too much on detecting what’s already known and protecting the enterprise against those threats. Instead an intelligence driven security model will help us to detect the unknowns and predict the threats. As a result we can strengthen our defences where the attacks are going to come from. Predictive analytics certainly has a role to play in this space and Teradata leads the way with our Aster platform.

Know the unknowns and be more effective in protecting your organisation through the use of predictive analytics.

On a final note, I recommend you visit a news release from last year that highlighted the next big wave in partnerships on combating cyber-attacks. Teradata has formed a partnership with Novetta to develop next generation cyber security solutions. Combining the benefits of proven Teradata technology with Novetta advanced cyber security solutions is a no brainer. Especially when you consider that if the US military can trust Novetta for their cyber security needs, then surely you can too!

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

The post Cyber Security and the Art of Data Analytics appeared first on International Blog.

Teradata ANZ

What is a “Data Lake” Anyway?

One of the consequences of the hype and exaggeration that surrounds Big Data is that the labels and definitions that we use to describe the field quickly become overloaded. One of the Big Data concepts that presently we risk over-loading to the point of complete abstraction is the “Data Lake”.

Data Lake discussions are everywhere right now; to read some of these commentaries, the Data Lake is almost the prototypical use-case for the Hadoop technology stack. But there are far fewer actual, reference-able Data Lake implementations than there are Hadoop deployments – and even less documented best-practice that will tell you how you might actually go about building one.

So if the Data Lake is more architectural concept than physical reality in most organisations today, now seems like a good time to ask: What is a Data Lake anyway? What do we want it to be? And what do we want it not to be?

UTL-1010-L

 

 

 

 

 

 

 

 

 

 

 

 

 

When you cut through the hype, most proponents of the Data Lake concept are promoting three big ideas:

1) It should capture all data in a centralized, Hadoop-based repository (whatever all means)

2) It stores the data in a raw, un-modelled format

3) And that doing so will enable you to break down the barriers that still inhibit end-to-end, cross-functional Analytics in too many organisations

Now those are lofty and worthwhile ambitions, but at this point many of you could be forgiven a certain sense of déjà vu – because improving data accessibility and integration are what many of you thought you were building the Data Warehouse for.

In fact, many production Hadoop applications are built according to an application-specific design pattern, rather than an application-neutral one that allows multiple applications to be brought to a single copy of data (in technical jargon, this is called a “star schema” design pattern). And whilst there is a legitimate place in most organizations for at least some application-specific data stores, far from breaking down barriers to Enterprise-wide Analytics, many of these solutions risk creating a new generation of data silos.

A few short years after starting their Hadoop journey, a leading Teradata customer has already deployed more than twenty sizeable application-specific Hadoop clusters. That is not a sustainable trajectory and we’ve seen this movie before. In the 90s and the 00s, many organisations deployed multiple data mart solutions that ultimately had to be consolidated into data warehouses. These consolidation projects cut costs and added value but also sucked up resources – human, financial and organisational which delayed the delivery of net new Analytic applications. That same scenario is likely to play out for the organisations deploying tens of Hadoop clusters today.

We can do better than this – and we should be much more ambitious for the Hadoop technology stack, too.

Put simply, we need to decide whether we are building data lakes to try and deliver existing functionality more cost effectively or whether we are building them to deliver net new analytics and insights for our organisations. While, of course, we should always try to optimise the cost of processing information in our organisations, the bigger prize is to leapfrog the competition.

I recently attended a big data conference where a leading European bank was discussing the applications it had built on its Hadoop-based data lake. Whilst some of these applications were clearly interesting and adding value to the organisation, I was left with the clear impression that they could easily have been delivered from infrastructure and solutions that the bank had already deployed.

The kind of advanced analytics that Hadoop and related technologies make possible are already enabling some leading banks to address some of the well-publicised difficulties that large European banks find themselves faced with. Text analytics can help the bank understand which customers are complaining and what they are complaining about. Graph analytics can pinpoint fraudulent trading patterns and collusion between rogue traders. Path analytics can highlight whether employees are correctly complying with regulatory processes. So I have to conclude that this organisation’s use of the technology to re-invent the wheel was a wasted opportunity.

 

The Data Lake isn’t quite yet the prototypical use-case for Hadoop that some of the hype would have you believe. But it will be. Application-specific star schema-based data stores be damned; this is what Google and Doug Cutting gave us Hadoop for.

This post first appeared on TeradataVoice on Forbes on 11 Dec 2014.’

The post What is a “Data Lake” Anyway? appeared first on International Blog.


Revolution Analytics

R Consortium and User! 2016 News

by Joseph Rickert IBM Joins the R Consortium This past Monday at the Spark Summit in San Francisco IBM announced that it had joined the R Consortium as a "Platinum" member. This is very good news...

...

Revolution Analytics

Interactive maps and charts in R

Randy George, an expert in web map applications, has been fascinated with computer graphics (especially maps) since the early '80s. For much of that time, he says, the technology for mapping has been...

...

BrightPlanet

Website Upgrade – Same Data Harvesting Technology, New Updated Look

We’ve launched a completely new website. It’s been four years since our last website redesign, and we continue to make improvements to our data harvesting technology so it was time to make sure the website reflected those improvements. We’ve freshened up the pages and also updated our services to make them more understandable for our readers […] The post Website Upgrade – Same Data Harvesting Technology, New Updated Look appeared first on BrightPlanet.

Read more »
Ronald van Loon

Learn the Art of Data Science in Five Steps

AAEAAQAAAAAAAAirAAAAJDIxODQyZDU0LTc1NzctNDcxZi1hMmZkLWIzYjAxMTJmNzZlZg

The field of data science is one of the youngest and most exciting fields in the technology sector. In no other industry or field can you combine statistics, data analysis, research, and marketing to do jobs that help businesses make the digital transformation and come to full digital maturity.

In today’s business world, companies can no longer afford to look at their websites and social media presences as add-ons or after thoughts. The success of the technological aspects of a company is as crucial to its overall success as any other department. To gain that kind of success online, businesses must embrace big data and analytics. They must learn about their customers’ behaviour online and specifically on their sites, and they must use data to drive their marketing and production strategies.

Data Science is dedicated to analyzing large data sets, showing trends in customer and market behaviours, predicting future trends, and finding algorithms to help improve the customer experience and increase sales for the future. To learn and dive into this intriguing new field, you’ll really only need to take five simple steps…

1. Get Passionate About Big Data

As you set out to learn this field and become a data scientist, you’ll find that entering an emerging field is both exciting and overwhelming. As you come across challenges and obstacles, and as the definition of data science seems to change depending on whom you’re reading or listening to, the whole endeavour can seem intimidating enough to want to quit. If you’re passionate about data and what you can do with it, though, you’ll have motivation to keep going.

2. Embrace Hands-On Learning

Some people will tell you to begin by learning about the fundamentals of machine learning, neural networking, and image recognition. All of this is fascinating, but if you start here, you’ll never get where you want to be. Instead, start simple. Begin by learning how to clean data to make it legible to the computers and programs you’ll be working with.

As you do this, focus on getting to know a few algorithms really well instead of trying to gain cursory familiarity with a dictionary’s worth of them. Besides, when you need an algorithm that you don’t know, you’ll most often be able to fetch it from a library. You do not have to have an encyclopaedic knowledge of them.  

              

3. Learn the Language and Communicate Your Insights

A large part of your job is going to be telling others (in laymen’s terms) what your data means for them. You can practice this by talking to less tech-savvy friends, starting a blog, and finding any platform available to discuss your findings. The more you practice, the more you’ll be able to express yourself in ways that normal people can understand.

4. Learn From People in the Industry

Go online and find forums and meet-ups where you can network with others working in data science. If a university near you provides a data science program, find out when they’re hosting a lecture, panel, and/or conference in the field and go get to know some people who are learning data science or who already work in the industry. You can also learn a lot from your data science peers by following industry thought leaders’ blogs, as well.

5. Keep Challenging Yourself

Finally, as you learn the ropes of data collection and analysis, don’t stop challenging yourself. If your “homework” is getting too easy, continue to increase the difficulty until it feels challenging again. You can do this by working with larger data sets, tweaking algorithms to make them more efficient, scaling algorithms to work with multiple processors, learning the theories behind new algorithms, and more. The further you take your studies, the more options you’ll have to create new challenges for yourself.

Do you have experience in learning data science? What methods have worked best for you as you’ve delved into this exciting new field?

Connect with the authors
 

         Wouter Weijens
Facilitates analysts and data scientists to use their talent to the max.
Want to become the next Digital Hero? Start here!

              Ronald van Loon
Helping data driven companies generating success
Want to learn more about Big Data? Follow Ronald on Twitter

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Learn the Art of Data Science in Five Steps appeared first on Ronald van Loons.

decor