Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


December 03, 2016

Simplified Analytics

Digital Transformation helping to reduce patient's readmission

Digital Transformation is helping all the corners of life and healthcare is no exception. Patients when discharged from the hospital are given verbal and written instructions regarding their...


December 02, 2016

Revolution Analytics

Because it's Friday: Border flyover

President-elect Trump has famously pledged to build a wall along the US-Mexico border, but what would such a wall actually look like? This short film directed by Josh Begley follows the path of the...


Revolution Analytics

Stylometry: Identifying authors of texts using R

Few people expect politicians to write every word they utter themselves; reliance on speechwriters and spokepersons is a long-established political practice. Still, it's interesting to know which...

Data Digest

How the Evolution of Learning & Development Transformed the role of Chief Learning Officers (CLO)

After months of research with over 50 Chief Learning Officers (CLOs) and de facto Learning & Development (L&D) executives in preparation for Corinium’s Chief Learning Officer Forum USA, I found that the fundamental principles of L&D have not changed substantially since their implementation nearly 20 years ago. However, these exceptional leaders are constantly looking at ways to innovate and keep ahead of the curve developing more creative, thorough and inclusive initiatives to fulfil those traditional ideas.

It is evident that corporations still seek to champion and support the professional development of their workforce in order to maintain their predominant position within a specific market. Nonetheless, highly competitive environments have propel substantial changes in the fundamental tasks that every CLO faces at any modern corporation. For instance, corporations now expect from these executives to: (1) effectively develop leadership programmes focussed on business delivery; (2) facilitate the transition from legacy methods to modern technologies to further promote employee capacitation/engagement/retention; and (3) overcome the generational gap issue when implementing new tools and L&D programmes.

Key Principles of Learning & Development 

In order to understand how these new challenges have transformed the CLO role, let us look at two main and contemporary principles of any L&D initiative: Diversity and Inclusion. Due to some old cultural traditions, a few of the old fashioned corporate values were determined by underlying notions of segregation and discrimination. These old social institutions, that shaped those L&D programmes in the past, have gradually given way to more contemporary ideas of “people’s development” and “workforce Inclusion”. The direct task of transforming those notions falls into the CLOs responsibility. These experts understand that there is a direct correlation between better professional performance and the implementation of programmes that recognise and promote diversity and inclusion.

Certainly, the key is understanding that diversity is not only a concept applicable to race and gender; it also encompasses age, sexual orientation, faith, social background, language, amongst many other things. Also that inclusion does not only refer to minorities and/or outsiders integrating with established social groups, but also to fundamental ideas of recognition and respect.

The key is understanding that diversity is not only a concept applicable to race and gender; it also encompasses age, sexual orientation, faith, social background, language, amongst many other things.

The importance of right technological solution

It should not be a surprise to discover that these corporations, who will join us next March in New York City, have embraced these two concepts thoroughly and have applied their principles as core values of their L&D programmes, promoted by the CLOs. The vast majority have publicly committed to improving the presence of minority groups within the leadership organisational structure; close the gender gap amongst the executive office and promote values of respecting difference amongst their local or international workforce. Similarly, they have acknowledged the vital role that generation exchange has to further endorse leadership initiatives and the importance of openly discussing topics related to race, gender, background, faith and others.

However, all these initiatives have been supported by the right technological solution to enable and grant access to these programmes to a wider audience. Whether it offers faster, easier and cheaper access to information or promotes more contemporary principles that aim to support the ideas of inclusion and diversity, it has crucially transformed the way that corporations interact with their employees.

Those L&D programmes that were once designed to target a select group of individuals, have now become accessible to anyone. The wider opportunity to access information everywhere and anytime has democratised the way people learn and develop their personal and professional hard and soft skills. Nowadays the capacitation, readiness and reskilling of personnel is carried out without segregation or segmentation, thanks to inclusive corporate policies that promote contemporary values of respect and social recognition and technological platforms that allow anyone to access L&D programmes anytime.

Clearly, neither of those initiatives or programmes could be considered perfect nor the answer to those demands raised by the global corporations. Nonetheless, they could show us how the CLO the US has embraced these concepts and tried to construct a more solid policy of inclusion of a vast and diverse workforce within flexible organisational structures.

Understanding the significance of these concerns and the importance of public initiatives for debating these key aspects and many others, Corinium Intelligence has proudly created the Chief Learning Officers Forum USA, taking place in March 7-8 2017 in New York City, for all CLOs and L&D experts in the US. The event promises to be the place in which more than 100+ L&D experts will gather to discuss these and more issues regarding the challenges they face daily and the clever solutions they have produced to implement those diversity and inclusion values. Please, get in contact and let us welcome you at the Convene 101 Park Avenue next March!

By Alejandro Becerra:

Alejandro Becerra is the Content Director for LATAM/USA for the CDAO and CCO Forum. Alejandro is an experienced Social Scientist who enjoys exploring and debating with senior executives about the opportunities and key challenges for enterprise data leadership, to create interactive discussion-led platforms to bring people together to address those issues and more. For enquiries email:
Data Digest

JUST RELEASED: Chief Data Scientist, USA - Speaker Presentations | #CDSUSA

On 16th - 17th November 2016, Corinium launched the inaugural Chief Data Scientist, USA; the premier conference for high-level data science practitioners to get a detailed roadmap for developing the leadership role for data science. The event set out to assist anyone looking to fully exploit the data science capability within their organization.

Using an interactive format, the forum brought together over 100 senior-level data science peers to share their latest innovations, best practises, challenges and use cases, as well as facilitate conversations and connections.  Alongside keynote presentations from our senior speaker line-up, our informal discussion groups, in-depth masterclasses and networking sessions provided the opportunity to take away the new ideas and information to deliver real benefits to attendee's companies.


December 01, 2016

Revolution Analytics

Using R to Gain Insights into the Emotional Journeys in War and Peace

by Wee Hyong Tok, Senior Data Scientist Manager at Microsoft How do you read a novel in record time, and gain insights into the emotional journey of main characters, as they go through various trials...

Silicon Valley Data Science

Big Data is About Agility

As a buzzword, the phrase “big data” summons many things to mind, but to understand its real potential, look to the businesses creating the technology. Google, Facebook, Microsoft, and Yahoo are driven by very large customer bases, a focus on experimentation, and a need to put data science into production. They need the ability to be agile, while still handling diverse and sizable data volumes.

The resulting set of technologies, centered around cloud and big data, have brought us a set of capabilities that can equip any business with the same flexibility, which is why the real benefit of big data is agility.

We can break this agility down into three categories, spread over purchasing and resource acquisition, architectural factors, and development.

Buying agility

Linear scale-out cost. A significant advantage of big data technologies such as Hadoop is that they are scalable. That is, when you add more data, the extra cost of compute and storage is approximately linear with the increase in capacity.

Why is this a big deal? Architectures that don’t have this capability will max out at a certain capacity, beyond which costs get prohibitive. For example, NetApp found that in order to implement telemetry and performance monitoring on their products, they needed to move to Hadoop and Cassandra, because their existing Oracle investment would have been too expensive to scale with the demand.

This scalability means that you can start small, but you won’t have to change the platform when you grow.

Opex vs capex. Many big data and data science applications use cloud services, which offer a different cost profile to owning dedicated hardware. Rather than getting lumbered with a large capital investment, using the cloud makes compute resource into an operational cost. This opens up new flexibility. Many tasks, such as large, periodic Extract-Load-Transform (ETL) processes, just don’t require compute power 24/7, so why pay for it? Additionally, data scientists now have the ability to leverage the elasticity of cloud resources: perhaps following up a hypothesis needs 1,000 compute nodes, but just for a day. That was never possible before the cloud without a huge investment: certainly not one anybody would have made for one experiment.

Ease of purchase. A little while ago I was speaking to a CIO of a US city, and we were discussing his use of Amazon’s cloud data warehouse, Redshift. Curious, I inquired which technical capability had attracted him. It wasn’t a technical reason: it turned out he could unblock a project he had by using cloud services, rather than wait three months for a cumbersome purchase process from his existing database company.

And it’s not just the ability to use cloud services that affects purchase either: most big data platforms are open source. This means you can get on immediately with prototyping and implementation, and make purchase decisions further down the line when you’re ready for production support.

Architectural agility

Schema on read. Hadoop turned the traditional way of using analytic databases on its head. When compute and storage are at a premium, the traditional Extract-Transform-Load way of importing data made sense. You optimized the data for its application—applied a schema—before importing it. The downside there is that you are stuck with those schema decisions, which are expensive to change.

The plentiful compute and storage characteristics of scale-out big data technology changed the game. Now you can pursue Extract-Load-Transform strategies, sometimes called “schema on read.” In other words, store data in its raw form, and optimize it for use just before it’s needed. This means that you’re not stuck with one set of schema decisions forever, and it’s easier to serve multiple applications with the same data set. It enables a more agile approach, where data can be refined iteratively for the purpose at hand.

Rapid deployment. The emergence of highly distributed commodity software has also necessitated the creation of tools to deploy software to many nodes at once. Pioneered by early web companies such as Flickr, the DevOps movement has ensured that we have technologies to safely bring new versions of software into service many times a day, should we wish. No longer do we have to make a bet on three months into the future with software releases, but new models and ways of processing data can be introduced—and backed out—in a very flexible manner.

Faithful development environments.
One vexing aspect of development, exasperated by deploying to large server clusters, is the disparity between the production environment software runs in, and the environment in which a developer or data scientist creates it. It’s a source of continuing deployment risk. Advances in container and virtualization technologies mean that it’s now much easier for developers to use a faithful copy of the production environment, reducing bugs and deployment friction. Additionally, technologies such as notebooks make it easier for data scientists to operate on an entire data set, rather than just a subset that will fit on their laptop.

Developer agility

Fun. Human factors matter a lot. Arcane or cumbersome programming models take the fun out of developing. Who enjoys SQL queries that extend to over 300 lines? Or waiting an hour for a computation to return? One of the key advantages of the Spark analytical project is that it is an enjoyable environment to use. Its predecessor, Hadoop’s MapReduce, was a lot more tedious to use, despite the advances that it brought. The best developers gravitate to the best tools.

Concision. As big data technologies advance, the amount of code required to implement an algorithm has shrunk. Early big data programs needed a lot of boilerplate code, and their structures obscured the key transformations that the program implemented. Concise programming environments mean code is faster to write, easier to reason about, and easier to collaborate over.

Easier to test. When code moves from targeting a single machine to a scaled computing environment, testing becomes difficult. The focus on testing that has come from the last decade of agile software engineering is now catching up to big data, and Spark in particular incorporates testing capabilities. Better testing is vital as data science finds its way as part of production systems, not just as standalone analyses. Tests enable developers to move with the confidence that changes aren’t breaking things.


Any technology is only as good as the way in which you use it. Successfully adopting big data isn’t just about large volumes of data, but also about learning from its heritage—those companies which are themselves data-driven.

The legacy of big data technologies is an unprecedented business agility: for creating value with data, managing costs, lowering risk, and being able to move quickly into new opportunities.

Editor’s note: Our CTO John Akred will be at Strata + Hadoop World Singapore next week, talking about the business opportunities inherent in the latest technologies, such as Spark, Docker, and Jupyter Notebooks.

The post Big Data is About Agility appeared first on Silicon Valley Data Science.

Data Digest

What the C-Suite Thought Leaders in Big Data & Analytics are saying about Corinium [VIDEO]

Corinium events are different. Our extensive experience has led us to develop a new format that gives you the chance to not only hear from the leading minds in the industry, but to contribute to the critical thinking being shared.

"We all learn differently...and this format [discussion groups] let's us learn by participating. And that is the thing we will remember.  It's personal.”

We connect C-Suite executives in the Data, Analytics and Digital space and focus them into Discussion Groups where genuine progress can be made. Each group is dedicated to the topic of greatest importance to you and includes the most knowledgeable people on the subject.

After our expert Co-Chairs initiate the discussion, you are invited to offer your own experiences or questions, sparking a free-flowing and open conversation in which over 65% of the participants typically contribute.

CEP America, Georgia Tech & McGraw Hill Education explain how Corinium Discussion Groups gives a rare opportunity for senior-level executives to have the time to participate with one another in such a personal format.  Barriers are discussed and decisions are made on how to move forward, and more importantly, how to solve challenges together.’

SAP, PwC, TimeXtender &  Caserta Concept  are discussion group facilitators.  PwC and SAP lead discussion groups all around the world at Corinium C-Suite events, and continue to do so because of the value of these interaction.

What our participants had to say:

We all learn differently...and this format [discussion groups] let's us learn by participating. And that is the thing we will remember.  It's personal.”
Dipti Patel, Chief Data and Analytics Officer, CEP America

'It really takes the coming together of people like these [Attendees of the Chief Analytics Officer Fall] tell us, this is the problem we need to come together to solve'
John Sullivan, Director of Innovation, North America Center of Excellence, SAP

'I think the interactions here are much better than...the other conferences I have ever attend'
Oliver Halter, Partner, PwC

'The people at the Corinium events are the decision-makers and are the people that push the industry forward. If that benefits the industry, great. If that benefits us too, also great."
Romi Mahajan, Chief Commercial Officer, TimeXtender

To learn more, visit    
Data Digest

CCO Forum: What Customer Experience Leaders were talking about

What an amazing event these last two days at the Chief Customer Officer Sydney. I feel very privileged to have spent so much time speaking with some of the best minds in Customer Experience.

I noticed during my conversations that there are some common themes for what is expected for the future, what is working in the present and what is holding back progress.

1. Lead the pack

Mark Reinke of Suncorp set up his presentation by talking about the extinction of the Mammoth and how as a child you get taught about this and assume it was quite a sudden extinction. In reality, it happened very slowly as the Mammoth did not adapt to the changing environment.

There seemed to be some shared sentiment that it was not only important to adapt to avoid extinction, but that the first mover advantage was significant when it comes to customer experience opportunities.
There seemed to be some shared sentiment that it was not only important to adapt to avoid extinction, but that the first mover advantage was significant when it comes to customer experience opportunities.

2. Alignment with IT is a priority

The jealousy was almost unanimous upon finding out that Richard Burns of Aussie leads both the CX and IT teams, helping to increase the velocity of CX change projects, which brought into view the issue, that many at the forum were struggling with IT alignment.

I heard from many conversations that delays caused when the Enterprise Infrastructure cannot support the next phase of CX transformation, were top of the agenda to avoid.

To many, the solution is ensuring that IT teams are included in forward planning and that contingencies are in place to resource projects that take advantage of a "moment in time opportunity" as one attendee put it, without having to compromise on projects essential to the "Roadmap".

Whilst there are certainly some strategies to solve this problem, internal agility and alignment continue to be a challenge for most companies.

3. Agility and expertise of partners/vendors

When it comes to transformation and digitisation, many companies do not have the skills in-house to be able to ensure the success of an implementation. There is an expectation that the Vendor they end up choosing will bring this to the table along with the product. The feeling was that the initial success and speed of delivery came down to choosing a partner that was able to provide this at a high level.

Whilst there were certainly examples where this was not the case, I heard some amazing feedback on this front about some of the Vendors in the room.

As well as the expertise, the ability to change course and assist quickly, wth one of those "moment in time" opportunities was what created the stickiness with some of their closest relationships.

4. Using Data to inform transformation and product development opportunities

This was perhaps the most interesting to me as a self-professed data enthusiast, in that whilst many companies are still struggling to harness their own internal data to understand customer behaviour, there are many companies looking to use both internal and external data sources to better understand what their customers will need predictively rather than historically.

The ability to use these insights to create better products and provide better service seems to be the goal for many.

In many cases, however, attendees were suggesting that (back to point 2) their IT Infrastructure was not ready to support and would be a few years away unless they could find a cost effective solution to skip a couple of steps.

These certainly were not all of the themes, but a good representation of the conversations I had across the two days.

Thank you to everyone who shared so willingly, it was an amazing learning experience and I can't wait to do it all again in Melbourne

Thank you especially to our wonderful Sponsors and Presenters who supported the event and the attendees with their insights, knowledge and solutions. The feedback on how you assisted and educated across the event was spectacular.

By Ben Shipley: 

Ben Shipley is the Partnerships Director at Corinium Global Intelligence. You may reach him at   
Teradata ANZ

Turbo Charge Enterprise Analytics with Big Data

renato-manongdo_enterprise-analytics-1We have been showing off the amazing art works drawn from numerous big data insight engagements we’ve had with Teradata, Aster and Hadoop clients. Most of these were new insights to answer business questions never before considered.

While these engagements have demonstrated the power of insights from the new analytics enabled by big data, it continues to have limited penetration to the wider enterprise analytics community. I have observed significant investment in big data training and hiring of fresh data science talents but the value of the new analytics remain a boutique capability and not yet leveraged across the enterprise.

Perhaps, we need to take a different tact. Instead of changing the analytical culture to embrace big data, why not embed big data into existing analytical processes? Change the culture from within.

How exactly do you do that? Focus big data analytics to adding new data points for analytics then make these data points available using the current enterprises data deployment and access processes. Feed the existing machinery with the new data points to turbo-charge the existing analytical insight and execution processes.

A good starting area is the organisation’s customer events library. This library is a database that contains customer behavior change indicators that provide a trigger for action and provide context for a marketing interventions. For banks, this would be significant deposit events (e.g. three standard deviations from the last 5 months average deposit) and for Telco’s significant dropped calls. Most organisations would have a version of this in place and would have dozens of these pre-defined data intervention points together with customer demographics. These data points support over 80% of the actionable analytics currently performed to drive product development and customer marketing interventions.

What new data points can be added? For example, life events that can provide context to the customer’s product behavior remains a significant blind spot for most organisation e.g. closing a home loan due to divorce, or refinance to a bigger house because of a new baby, etc. The Australian Institute of Family Studies have identified a number of these life events.

Big data analytics applied to combined traditional, digital and social data sources can produce customer profiles scores that become data points for the analytical community to consume. The score can be recalculated periodically and the changes become events themselves. With these initiatives, you have embedded the big data to your existing enterprise analytical processes and moved closer to a deeper understanding to enable pro-active customer experience management.

We have had success with our clients in building some of these data points. Are you interested?

The post Turbo Charge Enterprise Analytics with Big Data appeared first on International Blog.


November 30, 2016

Data Digest

How to Put Data at the Heart of Everything We Do

Ahead of Corinium’s CDAO  Sydney event, we spoke with keynote speaker Darren Abbruzzese, General Manager, Technology Data, at ANZ to find out more about their group data transformation program. We also gauged his views on data as a tool for competitive advantage, its importance in the financial services industry and the role of the Chief Data Officer, its longevity and evolution.

Meet Darren and our amazing line up of over 60 speakers at the Chief Data and Analytics Officer Forum, Sydney taking place on the 6-8 March, 2017.

For more information visit: 

Corinium: You will be speaking on “data at the heart of everything we do” – how have ANZ’s data needs changed in the last 18 months?

Darren Abbruzzese: Providing customers with a fantastic experience is absolutely critical in developing and maintaining deep and engaging relationships. The minimum expectation around what is a ‘great customer experience’ is being continually lifted by interactions our customers have every day with companies like Uber, Facebook, Google and the like. These companies are really raising the bar on what a great digital experience is all about. Customers then come to their bank and expect a similar level of experience. We are embracing that challenge and working to deliver a standout customer experience, both digital and through our branches. We can only achieve that if we make the best use of our data. A great customer experience, one that is tailored to the individual and their specific needs, can only be successful if we use the data we share between us and the customer to its full extent.  This helps us create a unique and engaging experience for our customers, and it will lead to a more engaging relationship than they’ve had in the past.

The minimum expectation around what is a ‘great customer experience’ is being continually lifted by interactions our customers have every day with companies like Uber, Facebook, Google and the like.

Corinium: Digital channels must play a huge part in that. What are your top technology investments in the coming year?

Darren Abbruzzese: Big data and fast data are major priorities. The amount of data we produce as a bank is exploding and we need to ensure we’ve got the tools to harness and make use of that data, which is where our Big Data capability comes in. But having a scaled infrastructure and Hadoop capability is not in itself enough. As a customer, I want real-time information and I want it relevant for my specific interaction. This is where fast data comes in. Moving away from earlier batch-based patterns and towards real-time capture and exposure of data into our internal and customer-facing channels is a key pillar to our strategy of developing a digital bank.

Corinium: How will ANZ’s group data transformation program cater to the diverse enterprise needs?

Darren Abbruzzese: As a large organisation, servicing millions of customers across a multitude of countries and segments we certainly have a diverse set of needs when it comes to data. Trying to deliver to the needs of the organisation via individual solutions won’t get us very far. Instead, our approach is to shape our delivery around solving common bank-wide problems, and being lean in our approach so we can learn, react and move at pace. Some of those common problems are things like data sourcing and collecting millions of data points from hundreds of platforms on a daily basis and consolidating that into joined up, usable models. If we solve that common problem we will make consumption of data via reporting easier and faster. Having a view about a medium-term architecture is also critical. While the technology in the data landscape is moving fast, the capability we need to help deliver our strategy is clear. Building common assets in line with that architecture will help us move at pace and solve for individual business or project needs.

Corinium: What are the leadership challenges educating and synchronising a global data team?

Darren Abbruzzese: It really comes down to being clear on our purpose and ensuring our staff understands what we are trying to achieve and why. If everyone is clear on what will make us successful, and how as a team we will add value to the organisation, then day-to-day decision making will be faster and more aligned to strategy. That’s easier said than done though and requires a lot of work.  Putting our purpose in a PowerPoint pack and emailing it around isn’t going to cut it.  We need to continually reinforce our purpose through concrete measures such as our organisational structure, operating model, architecture, delivery processes and KPIs. We can talk about purpose, but if we embed it in the way we work then it will become part of the culture. 

Corinium: What’s your opinion on the view that the CDO role is just a flash in the pan job title that will eventually become merged or lost amongst the web of new C-suite titles?

Darren Abbruzzese: Banks traditionally have seen their deposits and loans as strategic assets, and also their customers, staff and technology.  In each of these cases there’s been solid management structures established in recognition of the importance of these assets to the future of the organisation.  So we have a bunch of C-suite roles to lead these functions. Data has emerged as a core, strategic asset of any organisation that needs to be curated, managed, protected and leveraged just like any other strategic asset. Data needs a clear organisational strategy – what will we use data for and how will it help make us successful? It needs to be managed and protected. Whether it needs a Chief Data Officer really depends on each organisation and how they operate. That might be a role absolutely critical in raising the profile and importance of data, or it might be something the CEO themselves will define and drive. In other cases, generally where data is already more maturely managed, it has already been well-embedded into the organisation on many levels.

Data has emerged as a core, strategic asset of any organisation that needs to be curated, managed, protected and leveraged just like any other strategic asset. 

Corinium: What do you consider to be the key building blocks to establishing an effective data governance framework?

Darren Abbruzzese: Two key pillars: the first is the age-old problem of “garbage in / garbage out”. Start with what are the key data elements for your organisation and put in place processes to govern the collection, checking and cleaning of data right from the source and throughout its lifecycle. Someone needs to be accountable for this process or it won’t happen, or it will happen for a few months and then fall away. The second pillar is ensuring you have really strong information security and user access management in place. Nothing will destroy the credibility of your data program more, and perhaps even your organisation itself, if you were to suffer a data breach via internal or external means.  It's an important asset so protect it.

Corinium: What do you believe to be the most common form of ‘bad data’ and what effect can that have an organization?

Darren Abbruzzese: Poor data quality management can have a really detrimental impact. If poor data capture and management processes allow inaccurate, or wrong, or misleading data to pollute your key information assets then all the investments you’ve made into building a data capability will be for nothing. Your reporting won’t be trusted and you’ll spend countless hours trying to explain it. Your analytical efforts will return misleading signals potentially leading to sub-optimal or downright disastrous decisions. Your data program will lose credibility – as will you – as you’ll be left to explain the bad outcomes, even though you may not have controlled the input. So clearly, data quality right from the start is really important.  Again, it comes down to data being a core asset of high value and needs to be treated as such.   

Corinium: The financial services industry understands the value and power of data, why do you think that is? How do you see that developing in the next 3-5 years, with particular reference to the use of analytics?

Darren Abbruzzese: Banks have appreciated the value of data and what that means for their businesses for a long time. It's only been in the last few years as the tools and capability have started to mature that banks have begun to make better use of their data and do it at scale. I believe we are only at the start of this journey. Banks have been pretty good at developing digital channels for their customers, and these are the predominate ways that customers now interact with their bank for simple transactions, but looking forward it is by blending data into all of our channels that will drive the next great leap forward. Using analytics to really understand the needs of the individual customer, recognising what they need and are likely to need in the future, and building that into their mobile and desktop interface in an engaging way is where banks will go next and it’s pretty exciting.

Join Darren and 200 attendees at the Chief Data and Analytics Officer Forum, Sydney taking place on 6-8 March 2017.

For more information visit:   


November 29, 2016

Revolution Analytics

Free online course: Analyzing big data with Microsoft R Server

If you're already familiar with R, but struggling with out-of-memory or performance problems when attempting to analyze large data sets, you might want to check out this new EdX course, Analyzing Big...

Big Data University

This Week in Data Science (November 29, 2016)

Here’s this week’s news in Data Science and Big Data. Machine Learning

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

  • Data Science Bootcamp – This is a beginner-friendly, hands-on bootcamp, where you will learn the fundamentals of data science from IBM Data Scientists Saeed Aghabozorgi, PhD and Polong Lin.

The post This Week in Data Science (November 29, 2016) appeared first on Big Data University.

Silicon Valley Data Science

When Decisions Are Driven by More Than Data

I have been thinking a lot lately about storytelling, especially in the wake of the election and the ensuing discussion of the role played by fake news sites, echo chambers, and filter bubbles. Stories can lead us astray. They can reinforce what we want to hear rather than tell us the truth. And many times, two different sets of people will listen to the same story and then leave with completely different takeaways.

Another thing I think a lot about is data visualization—which, as we know it, is barely a century and a half old: it arguably started with William Playfair in the 1860s. But data visualization (and modern data science) are really just another phase in a tradition of human storytelling that goes all the way back to cave paintings.

We have always used stories to help each other make better decisions. Look at the parables in religious texts or the folk tales we tell to children. We use data in the same way, to help drive better decision-making.

The thing about stories is: they’re about people. They’re about explaining the things that happen to people, and they’re tools that we use to spread the history and shared values of groups of people (which is why they become so powerful in partisan election scenarios, for example).

In other words, stories are all about relationships, and so are data. It’s not the data points, or the nodes, that matter—it’s the edges between them that paint the trend lines and allow us to say something about our future. Or to put it more poetically, as Rebecca Solnit does:

The stars we are given. The constellations we make. That is to say, stars exist in the cosmos, but constellations are the imaginary lines we draw between them, the readings we give the sky, the stories we tell.

Why visual storytelling matters

A quartet of graphs, each with the same diagonal blue line showing identical statistical averages. Orange circles show the data points on each graph are actually in really different patterns: one form an arc, one forms a mostly diagonal line with one extreme outlier, one forms a mostly vertical line with one even more extreme outlier, and another forms a very loosely diagonal line.

Graphs by Wikipedia user Schutz, CC BY-SA 3.0

You may already be familiar with Anscombe’s Quartet. It’s four sets of data points (x-y pairs), and all four sets have the same statistical properties. So if you’re only thinking about them mathematically, they appear to be identical in nature. But as soon as you graph them, you see immediately that they’re very different, and you can detect with your eye a relationship in each dataset that you wouldn’t see if you were looking at them with math alone.

Storytelling matters because it’s innate in us as human beings, and it’s how we learn about the world and think about the decisions we should make as people living in relationship to that world. Visual storytelling matters because it reveals relationships that we might not be able to understand any other way.

A black background with bluish lines tracing the routes of commercial airline traffic: there are so many routes that the shape of the USA is clearly visible. Major cities appear as bright spots of connecting lines.

An image from Aaron Koblin’s “Flight Patterns” project.

There’s another thing that happens when you begin to make stories visible—to draw constellations—which is that meta patterns and structures emerge. Take, for example, Aaron Koblin’s visualization of North American air traffic, “Flight Patterns.” Making those routes visible and layering them together in a composite reveals not only the locations of major cities, but also the shape of an entire continent, without the need for any geographic map as a base layer. The relationships between the data points end up describing more than just themselves, in part because of where the data isn’t: the emptiness describes the oceans.

Why you as a storyteller matter

We all tend to think of data as infallible, as black-and-white, but once you understand that it’s not the data you’re presenting, it’s the relationships among the data, then you can see that you as the designer and storyteller are bringing something important to the process.

You must be very deliberate about what of yourself you put into your visualizations. You are naturally going to have an opinion, and that will likely inform the story you tell. Stories are powerful things. And it’s not unusual for people to become so attached to a particular story that they insist on drawing their constellations in ways that ignore the position of the stars. Just think of the historical inaccuracies in almost every Hollywood war film ever made, especially Mel Gibson’s.

Steve Jobs standing on stage to the left of a giant screen, which shows a colorful 3D pie chart. The pie chart has a green slice center front, which is labeled 19.5%. It appears to be much larger than another, purple slice in center-back, which is labeled 21.2%.

3D pie charts use foreshortening to create the illusion of three dimensions, but that same effect also distorts the data.

Data, like stories, can also have the “ending changed,” so to speak. If I were to show you this image from a keynote that Steve Jobs gave at MacWorld in 2008 and ask you to tell me, without reading the numbers, which section is bigger, green or purple? You would surely answer “green.” But of course, once you do read the numbers, you can see that the visualization is misleading. From a storytelling perspective, this is the same thing as changing the outcome of a battle because you felt like it.

You as the data scientist or as the data visualization designer, have an incredible impact on how the story gets told, and therefore on how decisions get made. Draw your constellations carefully, and use well the power to expose relationships and meta structures. Our ability to make good decisions—and even our futures—depend on it.

Editor’s note: Julie will be presenting her “From Data to Decisions” tutorial at TDWI Austin next week, which will elaborate on how designers can influence decision-making with data visualizations. Find more information, and sign up for her slides, here

The post When Decisions Are Driven by More Than Data appeared first on Silicon Valley Data Science.

Revolution Analytics

Microsoft R Open 3.3.2 now available

Microsoft R Open 3.3.2, Microsoft's enhanced distribution of open source R, is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to version 3.3.2,...

The Data Lab

Online Learning Funding Call

Online Learning Call

We know that Scotland needs more people with the skills to manipulate, organise and analyse an ever increasing amount and variety of data. This demand is not confined to traditional computing industries. Data skills are required in agriculture, tourism, construction, energy and most industry sectors.

To achieve The Data Lab’s vision to showcase Scotland as an international leader in Data Science, and to train a new generation of data scientists, we are looking to fund the development of online courses which contribute to the requirement for more flexible high-quality data science training and education. These may take the form of MOOCs or closed application online courses.

The Data Lab will provide between £30,000 and £50,000 for the development of up to three online courses for this call. These must be delivered fully online, which means there should be no mandatory in-person requirement for the learner.  You can select one of the following three course types:

  1. MOOC (Massive Open Online Courses) – Typically delivered for free using a pre-existing popular MOOC platform (up to £30k)
  2. Online Training – Paid for learning content that is focused on equipping Scottish Industry with the data science training they need (up to £30k)
  3. MOOC and Online Training – A combination of 1 and 2. This could be a MOOC with additional paid-for advanced learning content that meets the data science training needs of Scottish Industry. Here is an example. (Up to £50k)

If you are interested, please complete and submit our online Expression of Interest form. You must submit your completed application form by 17:00 on Monday 13th February 2017.

Before submitting your Expression of Interest please read our application guidance.

Further online guidance on how to build a MOOC can be found online thourgh many different sources. Here are some examples that might be useful from The University of Glasgow, The University of Edinburgh and


Google Plus

November 28, 2016

The Data Lab

The Data Lab Hosts 2nd Executive Away Day

Executive Away Day November 2016

Over dinner, we heard some fascinating stories about people’s personal experiences and practical advice on driving value from data. Hannah Fry was our captivating after dinner speaker who got us all thinking about how data affects us every day. The conversation went on long into the night and it was an excellent way to warm our brains up for the next day.

The Away Day presenters continued the trend set by Hannah with a series of enlightening and entertaining insights. Mark Priestly, F1 commentator and guru, shared the journey he went on with McLaren as they embedded data scientists into the team and how that changed the way decisions were made. I can’t do justice to the way he presented the material, so clear your diaries for DataFest next March, when both Hannah and Mark will be back in Scotland.

Our next two speakers were Steve Coates from Brainnwave and Inez Hogarth for brightsolid. Both Steve and Inez shared their experiences in a start-up and in an established company respectively, on the value that data scientists can bring to organisations. It was great to see the passion they had for investing in people and setting them up for success working in partnership with businesses. I highly recommend getting along to any event that you see Inez or Steve speaking at so that you can hear their experiences first-hand and, more importantly, get a chance to engage them in conversation.

Our last two speakers were two people I have had the privilege to work with in previous roles. Chris Martin from Waracle and Callum Morton from NCR both spoke about the role data plays in their companies. Chris is a fascinating speaker and raised many superb points including the fact that data is not a silver bullet, it is just one of the many things you need to master to create an environment for customer innovation. Callum brought to life the sweet spot between a compelling strategy and rapid execution. It was great to listen to Callum share his experiences and his views on the future. Especially powerful was the fact that only 9 months ago Callum was sitting in the audience at our last Executive Away Day, and now here he was leading the conversation based upon real world experiences.

I also gave a short talk on Data as a strategic differentiator as part of opening the second day. I was delighted and relieved to hear some of the points I raised brought to life by our excellent speakers. But best of all for me was the feedback we got on the day from attendees who found it engaging, practical and enjoyable. Most people I spoke to said that the speakers had stimulated their thought processes and that they were returning to work enthused, armed with practical advice and with a much richer network. As Gillian our CEO said, the attendees are all now part of The Data Lab gang and we will be there to support them as they continue on their own data journeys. 

Find out about our next Executive Education course: "Big Data Demystified", delivered in Partnership with the Institute of Directors (IoD) on 7 February 2017.


Google Plus
Jean Francois Puget

What’s in CPLEX Optimization Studio 12.7?

Here is a guest post by my colleague Paul Shaw on the latest release of CPLEX Optimization Studio.  That release made some buzz at the latest INFORMS conference because of support for Benders Decomposition.  However, Benders decomposition isn't the only novelty in this release as Paul explains.  Paul's post originally appeared here.  This release is also available to all academics and students for free, as my other colleague Xavier Nodet explains here.   Xavier Nodet also posted a detailed presentation of CPLEX Optimization Studio 12.7 on slideshare.

CPLEX Optimization Studio 12.7 is here! It was officially announced on Tuesday 8th November on the IBM website and will be available from November 11th. I’ll go over the main features here.

First of all, performance has been improved under default settings in both CPLEX and CP Optimizer. For CPLEX, the most significant gains can be seen for MIP, MIQP, MIQCP and nonconvex QP and MIQP models. For CP Optimizer, the main performance improvement is for scheduling models, but combinatorial integer models should see some improvement too.

Benders’ decomposition
For certain MIPs with a decomposable structure, CPLEX can apply a Benders’ decomposition technique which may be able to solve the problem significantly faster than with regular Branch-and-Cut. This new Benders’ algorithm has a number of levels of control indicated through the new “benders strategy” parameter. This parameter specifies how the problem should be split up into a master problem and a number of subproblems. The levels work as follows:

  • User level: This level gives you full control over the specification of the master problem and the subproblems. To do this, you need to annotate your model. Annotations are a new concept added in CPLEX 12.7 to associate values with CPLEX objects like objectives, variables, and constraints. Models can be annotated through the APIs or specified in annotation files. To specify the master and subproblems, you give an annotation to each variable. A variable with annotation 0 belongs to the master and one with annotation k1 belongs to subproblem k.
  • Workers level: Here, CPLEX will take the given annotation and try to further break up the subproblems into independent parts. In particular, this level lets you think about the separation of variables into “in the master” and “in a subproblem” without having to worry about how the subproblems are set out. For example, you can annotate either the variables in the master or those in the subproblems as you desire. CPLEX will then automatically break up the subproblem variables into independent subproblems if possible.
  • Full level: The fully automatic level has the following behavior. First, CPLEX will assume that all the integer variables of the problem will go into the master, with all the continuous variables being placed in the subproblems. Then, as for the Workers level, CPLEX will attempt to refine the subproblem decomposition by breaking it into independent parts if possible.

By default, “benders strategy” uses an automatic level which behaves as Workers if a decomposition is specified and runs regular branch-and-cut if no decomposition is specified.

Modeling assistance
With 12.7, CPLEX can be asked to issue warnings about modeling constructs which, although valid, may contribute to performance degradation or numerical stability. To turn on these warnings, the “read datacheck” parameter should be set to the value 2. Here is an example of the type of warning that CPLEX can issue:

CPLEX Warning 1042: Detected a variable bound constraint with large coefficients. Constraint c8101, links binary variable x934 with variable x2642 and the ratio between the two is 1e+06. Consider turning constraint into an indicator for better performance and numerical stability.

Interactive improvements
CP Optimizer now includes an interactive interface similar to that of CPLEX. You can load and save models in CPO format, change parameters, run propagation, solve, refine conflicts and so on. You can type “help” at the prompt to get information on the facilities available.

Evaluating variability
Both the CPLEX and CP Optimizer interactive shells now provide a way of easily examining the performance and variability of the solver on a particular model instance. In both interactive shells, the command tools runseeds [n] will run an instance n times (default is 30) with the current parameter settings, varying only the random seed between runs. Information on each run is displayed. For example, here is the output of tools runseeds on a CPLEX model where the time limit has already been set to 8 seconds.

====== runseeds statistics of 30 runs

    exit  sol    objective     gap  iteration      node   runtime   dettime
run code stat        value     (%)      count     count   seconds     ticks
  1    0  108          ---     ---     419468     39808      8.00   4195.20
  2    0  108          ---     ---     514998     49440      8.00   4369.46
  3    0  101            0    0.00      69242      8453      1.55    981.88
  4    0  108          ---     ---     518059     37514      8.00   4460.28
  5    0  101            0    0.00     123420     16923      4.52   3692.25
  6    0  108          ---     ---     511910     46768      8.01   4550.99
  7    0  108          ---     ---     459329     41168      8.00   4260.86
  8    0  108          ---     ---     431280     32714      8.01   4215.08
  9    0  108          ---     ---     379432     36441      8.00   3992.18
 10    0  108          ---     ---     377233     27091      8.01   3710.86
 11    0  108          ---     ---     333495     20436      8.01   3620.59
 12    0  101            0    0.00      21160      2993      0.44    225.97
 13    0  101            0    0.00     124943     14762      4.55   3714.96
 14    0  101            0    0.00     113538     12581      4.43   3641.18
 15    0  108          ---     ---     549655     46617      8.00   4606.69
 16    0  108          ---     ---     447175     26129      8.00   4007.73
 17    0  101            0    0.00     38622       5525      1.42   843.57
 18    0  108          ---     ---     413188     43561      8.00   4158.66
 19    0  101            0    0.00     490580     41267      7.83   4324.97
 20    0  108          ---     ---     499872     38093      8.00   4394.23
 21    0  101            0    0.00        292         0      0.14     86.63
 22    0  108          ---     ---     450731     47616      8.00   4373.14
 23    0  101            0    0.00     101645      9091      1.57    982.50
 24    0  108          ---     ---     520977     50566      8.00   4412.67
 25    0  108          ---     ---     501371     45112      8.00   4371.53
 26    0  108          ---     ---     496808     39674      8.00   4352.48
 27    0  108          ---     ---     414917     44412      8.00   4345.97
 28    0  101            0    0.00     427330     31227      6.97   4120.19
 29    0  108          ---     ---     481120     44447      8.00   4547.70
 30    0  101            0    0.00     465275     40579      7.69   4222.06

Exit codes:
      0 : No error

Optimization status codes:
                 objective     gap  iteration      node   runtime   dettime
                     value     (%)      count     count   seconds     ticks
    101 : integer optimal solution (11 times)
     average:            0    0.00     179641     16673      3.74   2439.65
     std dev:            0    0.00     185898     14571      2.89   1772.79
    108 : time limit exceeded, no integer solution (19 times)
     average:         ---      ---     459001     39874      8.00   4260.33
     std dev:         ---      ---      58792      8317      0.00    267.13

This is an instance where either a solution is found and proved optimal, or none is found. Looking at the breakdown by status code shows that the optimal is found and proved in 11 out of the 30 runs and a timeout happens without a solution being found on the remainder of the runs. When the optimal is proved, it happens in 3.74 seconds on average.
Conflict refinement and tuning tools have also been moved to the “tools” sub-menu.

CP Optimizer warm start
CP Optimizer warm starts can now be represented in the CPO file format in the “startingPoint” section. This makes it possible to portably and persistently store starting points and communicate starting points with others or with IBM support. Moreoever, this means you can solve problems with a warm start using IBM Decision Optimization on Cloud:

Here is an example of how to specify a starting point:

x = intVar(1..10);
y = intVar(1..10);
x + y < 12;
itv = intervalVar(optional, end=0..100, size=1..5);
startOf(itv, 10) == x;
endOf(itv, 1) == y;

startingPoint {
  x = 3;
  itv = intervalVar(present, size=4, start=7);

When you export a model on which a starting point has been set with the APIs, a "startingPoint" section is automatically generated containing the starting point.

Additionally, CP Optimizer's warning message mechanism has been evolved to include additional information on starting points, particularly when the starting point contains inconsistent information (such as values which are not possible for particular variables due to domain restrictions).

Piecewise linear functions
The support for piecewise linear functions in CPLEX has been extended and is now available both in the C API and in the file formats. Here's an example of how to specify a piecewise linear function in the LP file format.

Subject To
  f: y = x 0.5 (20, 240) (40, 400) 2.0

Here, we specify a piecewise linear function f and the constraint y = f(x)
The function f consists of three segments. For x < 20, the slope of the
function is 0.5, then there is a segment between the two points (20,240) and (40,400). Finally, for x > 40, the slope is 2.0.

Additional Parameters
A new parameter has been introduced to control so-called RLT cuts, based on Reformulation Linearization Technique. These cuts apply when the optimality target parameter is set to 3: that is, when solving a nonconvex QP or MIQP instance to global optimality. The parameter is named CPXPARAM_MIP_Cuts_RLT in the C API, and is accessible as set mip cuts rlt in the interactive optimizer. Possible values are -1 (off), 0 (let CPLEX decide, default), 1 (moderate), 2 (aggressive) and 3 (very aggressive).

A new effort level for MIP starts is available. This effort level CPX_MIPSTART_NOCHECK specifies that you want to inject a complete solution that you know is valid for the instance. CPLEX will skip the usual checks that are done, which can save time on some models. If the solution is not actually valid for the model, the behavior of CPLEX is undefined, and may lead to incorrect answers.

Please enjoy using and exploring the new features of CPLEX Studio 12.7!


What type of Machine Learning is right for my business?

Machine Learning is by no means a new thing. Back in 1959, Arthur Samuel’s self-training checkers algorithm had already reached “amateur status” – no mean feat for that period in time. This article is intended to shed some light on the two different types of Machine Learning that one can encounter, which may be useful if you are thinking of entering into this space and are unsure as to which avenue is appropriate for your business.


November 26, 2016

Simplified Analytics

Product recommendations in Digital Age

By 1994 the web has come to our doors bringing the power of online world at our doorsteps. Suddenly there was a way to buy things directly and efficiently online. Then came eBay and Amazon in...


November 25, 2016

Simplified Analytics

What is Deep Learning ?

Remember how you started recognizing fruits, animals, cars and for that matter any other object by looking at them from our childhood?  Our brain gets trained over the years to recognize these...

Teradata ANZ

Is your business ready to learn from the record-breakers?

In sport, as in business, there is the constant interplay between marginal gains and game-changing innovations.

Take the 100m freestyle swim, records have been broken year on year, but every so often we see not just a record broken, we see an outstanding accomplishment like Albert Vande Weghe in 1934. In one stroke he changed the nature of competitive swimming with his underwater somersault ‘Flip Turn’. Then further change followed in 1976 with the introduction of pool gutters at the Montreal Olympics capturing excess water and resulting in less friction and faster times. The next big advance was 2008, with the advent of low-friction swimwear that enabled athletes to move through the water with even greater speed.

The relevance of this story to business analytics is that just like athletes in training, data scientists make incremental improvements every day, and yet every so often comes one of those momentous, game-changing innovations.

Prescriptive analytics are the catalyst

As organisations increasingly seek to drive value from historical insights, they can start predicting the future, ensuring that positive predictions are fulfilled and that negative outcomes are avoided. This is how prescriptive analytics can influence the future.

Achieving this nonetheless requires a shift away from statistical and descriptive ways of looking at data, towards considering events and interactions. Applying contextual analytics to these events and interactions allows us to investigate modes of behaviour, intentions, situations, and influences.

Vast amounts of money are being spent by organisations on the creation of data ecosystems, enabling them to capture, store, and archive, large volumes of data at an unprecedented scale, in a cost-efficient manner. They are responding to the headlines about Big Data, but unfortunately many end up with fragmented architectures and data silos that thwart their ability to interrogate data and create value.

Gartner predicts that by 2018, 70 per cent of Hadoop deployments will fail to meet cost savings and revenue-generation objectives due to skills and integration challenges.

Should data ecosystems be built?

The answer is firmly “yes”. The age of infrastructure opened the door to the use of analytics for extracting value from data. Essentially, the resulting insights make business decisions more accurate and intelligent and because of that, the focus has shifted. Now it is the business team and not the IT department that leads data and analytics initiatives, demanding more value from data-plus-insights capable of creating commercial opportunities and solving problems.

It must be recognised that the value of storing and organising data depends on what you do with it. Business teams want to ask questions that cross data silos; questions that account for customer, product, channel, and marketing in combination. This amounts to a fundamental realignment of priorities and means that in future, many of our data professionals will no longer be technical specialists. Instead, they will be business-focused individuals using data, analytics, and technology as key enablers.

The Olympic spirit – higher, stronger, faster

Unsurprisingly, the monetisation of data and analytics will be a big differentiator. Gartner’s strategic prediction states that by 2018, more than half of large, global organisations will compete using advanced analytics and proprietary algorithms, causing the disruption of entire industries.

Without an underlying strategic framework – the organisation, the people, the processes, and the execution – businesses will drown in data. Only a judicial mix of analytics can help business leaders make decisions with confidence and intelligence, and sharpen the competitive edge.

The fact is that in any given organisation, data analysts beaver away making incremental improvements to their analytics ‘personal best’. Yet, as in the Olympics, it is the “Fosbury Flop” moments, and the Bob Beamon breakthroughs that live in the memory.

It is only such record-shattering leaps forward – like prescriptive analytics – that are capable of changing corporate thinking. Or, more precisely, transform the whole data-driven nature of business competition.


The post Is your business ready to learn from the record-breakers? appeared first on International Blog.


November 24, 2016

Data Digest

Activewear brand evolves from sports clothing to wearable fitness tech | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by Bev Terrell was originally posted on SiliconANGLE blog.  

While most people think of Under Armour as the supplier of sportswear and sports footwear, it also owns the digital apps MapMyFitness, MyFitnessPal and Edmundo. As such, the company has taken the first steps this year toward more investment in wearable technologies, called “Connected Fitness,” so that people can track their exercise, sleep and nutrition throughout the day.

Chul Lee, head of Data Engineering and Data Science at Under Armour Inc., joined Jeff Frick (@JeffFrick), co-host of theCUBE*, from the SiliconANGLE Media team, during the Chief Data Scientist, USA event, held in San Francisco, CA, to discuss aspects of the Chief Data Scientist role and Under Armour’s move into wearable fitness technology.

Being your own advocate

During a panel discussion held earlier in the day, Frick noted that Lee had brought up the point that in addition to being a scientist, a CDS also has to be a salesperson to sell the role, engage the business units and help them understand what they’re doing, at the right level.

Lee expanded on that by saying, “I learned, through many experiences and many years of failing, that there was an ‘ah-ha’ moment where I had to start communicating and being a salesperson.”

He also explained that data scientists tend to think they have to unpack the ‘black box’ of whatever project they are working on at the time and try to explain everything that they are doing to everyone. Data scientists feel the pressure to talk about the science aspect of projects and how it is done, rather than focusing on the value you’re trying to deliver to your customers.

Lee has found that all that is needed is to explain your project at a high level to coworkers and make sure they understand and are supportive of that.

Data is in sports clothes, too

Frick asked about how Under Armour got started with its Connected Fitness services, the software services arm built around the Endomondo, MyFitnessPal and MapMyFitness apps.

“The way we start thinking about shoes and shirts is that, OK, you need to enter an experience around shoes and shirts,” he said, adding that because data is everywhere, in every sector, they asked why shouldn’t it be in fitness clothes, too.

Watch the complete video interview below:

Data Digest

The Gawker effect: Can deep learning go deep enough to write tomorrow’s headlines? | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by R. Danes was originally posted on SiliconANGLE blog.  

When now-defunct Gawker revealed its use of analytics in content decisions, many old media types shook their heads; algorithms must not replace human judgement in journalism, they warned. But some believe a happy medium is possible: Data can be sourced and analyzed to inform content writers while leaving them with the final say on what readers see.

Haile Owusu, chief data scientist at Mashable, said that this space where data meets human knowledge workers is fertile ground for innovation. He told Jeff Frick (@JeffFrick), host of theCUBE, from the SiliconANGLE Media team, during the recent Chief Data Scientist, USA event that data practitioners do their best work in tandem with “people who are not especially quantitative, who are expecting — and rightfully so — expecting to extract real, concrete, revenue based value, but are completely in the dark about the details.”

Digital research assistant

Owusu explained how Mashable assists writers with data without encroaching on their judgement. They utilize an accumulated history of viral hits, its Velocity Technology Suite and its CMS.

“What we found is that writers are able to distill from sort of a collection of greatest hits — filtered by topic, filtered by time window, filtered by language key words — they are able to incorporate that collected history into their writing,” he said, adding that it does not simply fetch more clicks, but actually improves the quality and depth of their writing.

Two heads are better than one

Owusu stated that deep learning neural networks are able to grok the nuances of data in an almost human manner.

“They’ve allowed us to do feature extraction on images and text in a way that we hadn’t been able to before, and there has been a significant improvement in our ability to do predictions along these lines,” he concluded.

Watch the complete video interview below:

Data Digest

Show me the money: selling inexact data science to tight-fisted investors | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by R. Danes was originally posted on SiliconANGLE blog.  

What industry takes risk management more seriously than finance? Corporate and personal investors want a gold-embossed sure thing when they sink their cash into a venture. So techies who come to them with an untested data analytics toy will likely find them tough customers. Some folks in the finance world are out to dispel anxieties by educating investors on why data sometimes picks a winner and why it may fail.

Jeffrey Bohn, chief science officer at State Street Global Exchange, said the confusion lay in the problem of data quality. He also believes that companies still do not have enough hands on deck to separate the wheat from the chaff.

“You still find 70 to 80 percent of the effort and resources focused on the data preparation step,” he told Jeff Frick (@JeffFrick), host of theCUBE*, from the SiliconANGLE Media team, during the recent the Chief Data Scientist, USA event.

Data scientists spread too thin

According to Bohn, more data stewards are need to select quality data and to free up analysts to innovate and find solutions.

“I’ve had problems where you have great models, but data quality produced some kind of strange answers,” he explained. “And then you have a senior executive who looks at a couple of anecdotal pieces of evidence that suggest there are data quality issues, and all of a sudden they want to trash the whole process and go back to more ad hoc, gut-based decision making.”

The best and the rest

Bohn argueds that to increase data quality, companies need to start culling from a greater number of sources.

“We’ve recently been very focused these days on trying to take unstructured data — so this would be text data, it might be in forms on PDFs or html document or text files — and marry that with some of the more standard structured or quantitative data,” he said.

Watch the complete video interview below:

Data Digest

Data science: What does it mean and how is it best applied? | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by Gabriel Pesek was originally posted on SiliconANGLE blog.  

While data science (along with its associated tools, utilities and applications) is a hot topic in the tech world across innumerable facets of industry, there’s still a large degree of uncertainty as to just what data science means, and how it can best be applied.

At the Chief Data Scientist, USA event in San Francisco, CA, Assad M. Shaik, head of Enterprise Data Science for Cuna Mutual, joined Jeff Frick (@JeffFrick), co-host of theCUBE*, from the SiliconANGLE Media team, to talk about data science as it’s commonly seen and as it really is.

Improving understanding

As Shaik explained, his focus at the event was to clarify what data science actually is. Noting “a lot of confusion” about whether it’s a new name for analytics, advanced analytics or something else entirely, his goal is to speak frankly and clearly enough for attendees to gain a better understanding of the challenges and goals encountered by data scientists on a daily basis.

As part of this examination, he’s looking at both the experiences encountered by customers and the revenue growth that can be gained by applying analytics to those customers. He’s also providing some insight into the new skill expectations being encountered by data scientists and the reasons behind those changes.

In Shaik’s estimation, sales skills and a grasp of marketing have become “essential” for a data science group to find success. He attributes this mainly to the development from “IT and the business, if you just go back a few years,” to a centralization of data teams by other organizations in the search for additional value within their data.

Solving real problems

One of the biggest questions savvy data science groups can ask, in Shaik’s mind, is: “How can we help you meet the corporate and the business area goals using data science?” By looking for concrete problems to which the team can apply their tools and research, more informative conclusions can be drawn from the experience.

In a similar development line, Shaik shared his thoughts on companies such as Uber and Airbnb, which are using data science to evolve from traditional models. To him, the most important part of these companies is the way that they’re applying their data to the problems of an existing industry standard and leading the rest of that industry along with the need to innovate and keep up with the times.

As the conversation came to a close, Shaik also shared how much he enjoys the conference. In his experience at the event, “The biggest thing is the networking. I get to meet the people from the different industry sectors, with a similar background in the data science, and understand how they are doing what they are doing in the data science field, [while] sharing my perspective with them. It’s a fabulous event.”

Watch the complete video interview below:

Data Digest

Crowd-sourcing online talent to win a million-dollar competition | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by Nelson Williams was originally posted on SiliconANGLE blog.  

Talent shows have become a television favorite across the world. On the production side, they’re cheap and relatively easy to throw together. Viewers love seeing new acts and voting for their favorites. Still, only so many people can audition for these shows, and even fewer can make the trip to wherever filming takes place. The online world of user-generated media has a solution.

To learn more about an online talent show in the works, Jeff Frick (@JeffFrick), co-host of theCUBE*, from the SiliconANGLE Media team, visited the Chief Data Scientist, USA event in San Francisco, CA. There, he spoke with Roman Sharkey, chief data scientist at Megastar Millionaire.

The data science of talent

The conversation opened up with a look at Megastar Millionaire itself. Sharkey described the show as the world’s first online talent platform. Currently in beta testing, he expected Megastar Millionaire to go global sometime next year. He mentioned how the winner would be determined by votes and video shares, with a celebrity judging panel reigning over the finals.

Sharkey stated that as a data scientist, his role with the company was twofold. First, he was responsible for analytics, collecting data to extract information from users. Second, there was the machine learning part. A major project within the company involved obtaining new performers by detecting real talent in videos online outside the show and then inviting those people to join.

“The system is already really accurate, and its accuracy is improving,” Sharkey said.

Testing and business

At the moment, Megastar Millionaire is still testing its technology. Sharkey explained it’s keeping the number of performers low while testing out the platform, with about 200 to 250 people in the beta competition. On the business side, the company is working with funding from investors and is listed on the Australian stock exchange.

Sharkey pointed out the company is working on things no one has done in practice so far. His goal was to find new ways to accomplish tasks through data science. As for Megastar Millionaire itself, users can find its app in the Apple App Store and Google Play.

Watch the complete video interview below:

Data Digest

Data confluence: handling the scale of distributed computing | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by Nelson Williams was originally posted on SiliconANGLE blog.  

We live inside an explosion of data. More information is being created now than ever before. More devices are networked than ever before. This trend is likely to continue into the future. While this makes data easy to collect for companies, it also presents the challenge of sheer scale. How does a business handle data from millions, possibly billions, of sources?

To gain some insight into the cutting edge of distributed data collection, Jeff Frick (@JeffFrick), co-host of theCUBE*, from the SiliconANGLE Media team, visited the Chief Data Scientist, USA event in San Francisco, CA. There, he met up with Sam Lightstone, distinguished engineer and chief architect for data warehousing at IBM.

The discussion opened with a look at a recently announced concept technology called “Data Confluence.” Lightstone explained that data confluence was a whole new idea they’re incubating at IBM. It came from a realization that vast amounts of data is about to come upon business from distributed sources like cellphones, cars, smartglasses and others.

“It’s really a deluge of data,” Lightstone said.

The idea behind data confluence is to leave the data where it is. Lightstone described it as allowing the data sources to find each other and collaborate on data science problems in a computational mesh.

Using the power of processors at scale

Lightstone mentioned a great advantage of this concept, being able to bring hundreds of thousands, even millions of processors to bear on data where it lives. He called this a very powerful and necessary concept. Such a network must be automatic if it is to scale for hundreds of thousands of devices.

The complexities of such a system are too much for humans to deal with. Lightstone stated his goal was to make this automatic and resilient, adapting to the state of the devices connected to it. He related that with data confluence, they hoped to tap into data science for Internet of Things, enterprise and cloud use cases.

Watch the complete video interview below:

Jean Francois Puget

Using Python Subprocess To Drive Machine Learning Packages

A lot of state of the art machine learning algorithms are available via open source software.  Many open source software are designed to be used via a command line interface.  I much prefer to use Python as I can mix many packages together, and I can use a combination of Numpy, Pandas, and Scikit-Learn to orchestrate my machine learning pipelines.  I am not alone, and as a result, many open source machine learning software provide a Python api. 

Most, but not all.  For instance Vowpal Wabbit does not support a Python API that works with Anaconda.  A more recent package, LightGBM, does not provide a Python API either. 

I'd like to be able to use these packages and other command line packages from within my favorite Python environment.  What can I do?

The answer is to use a very powerful Python package, namely subprocess.  Note that I am using Python 3.5 with Anaconda on a MacBook Pro.  What follows runs as well on Windows 7 if you use commands available in a Windows terminal, for instance using dir instead of ls.  Irv Lustig has checked that the same approach runs fine on Windows 10, see his comment at the end of the blog.

First thing to do is to import the package:

import subprocess

We can then try it, for instance by listing all the meta information we have on a given data file named Data/week_3_4.vw:["ls", "-l", "Data/week_3_4.vw"], stdout=subprocess.PIPE).stdout

This yields

b'-rw-r--r--  1 JFPuget  staff  558779701 Aug 11 12:40 Data/week_3_4.vw\n'

Let's analyze a bit the code we executed. runs a command in a sub process, as its name suggests.  The command is passed as the first argument, here a list of strings.  I could have passed a unique string such as "ls -l Data/week_3_4.vw" .   Python documentation says it is preferable to break the command into as many substrings as possible.

The command outputs a CompletedProcess object that can be stored for latter use.  We can also use it immediately to retrieve the output of our command.  For this we need to pipe the standard output of the command to the stdout property of the object returned by  This is done with the second argument stdout=subprocess.PIPE.

A similar example from Windows 7 (I guess WIndows 10 would be the same):["dir"], stdout=subprocess.PIPE, shell=True).stdout

will output a string containing the content of the default directory for your Python script.  Note we must use the shell=True argument in this case.  It first launches a shell, then runs the command in that shell.

Let's now run Vowpal Wabbit.  We assume that the Vowpal Wabbit executable vw is accessible in our system path.  One way to check it is to just type vw in a terminal.  The following code snippet runs it with the above data file as input:

cp =['vw -d Data/week_4_5.vw'],

Let's look at the code we wrote.  We store the CompletedProcess object returned by the command in a variable for later use.  We then print the stdout property of that object.  We redirect the standard error as well as the standard output, with the argument stderr=subprocess.STDOUT. This redirects the standard error to the standard output, which in turn is redirected to the standard output of the cp object. 

We want to run Vowpal Wabbit in a shell, as this is what it expects.  This is done via the shell=True argument.

The universal_newlines=True argument basically says to treat new lines as new lines.  If it is not set than output printing will be jammed. 

Last, the check=True argument is set to true in order to trigger a Python exception if the sub process command return code is different from 0.  This is the only way to make sure that the command executed properly.

This code prints:

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = Data/week_4_5.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.480453 0.480453            1            1.0   0.6931   0.0000        6
0.263166 0.045879            2            2.0   0.6931   0.4790        6
0.389123 0.515081            4            4.0   1.0986   1.1152        6
0.303324 0.217524            8            8.0   1.3863   1.5294        6
0.216799 0.130274           16           16.0   1.6094   1.4195        6
0.332574 0.448349           32           32.0   2.9444   1.4557        6
0.431952 0.531331           64           64.0   1.3863   1.5907        6
0.832383 1.232813          128          128.0   1.7918   2.7743        6
0.686496 0.540610          256          256.0   1.0986   1.6892        6
0.586707 0.486917          512          512.0   0.0000   1.8906        6
0.743715 0.900723         1024         1024.0   1.0986   1.9852        6
0.793468 0.843222         2048         2048.0   2.6391   2.2505        6
0.707151 0.620834         4096         4096.0   2.1972   2.0928        6
0.699104 0.691057         8192         8192.0   0.6931   1.4820        6
0.497394 0.295684        16384        16384.0   1.6094   1.4555        6
0.374966 0.252538        32768        32768.0   2.9444   2.5621        6
0.305165 0.235363        65536        65536.0   2.5649   2.0110        6
0.266821 0.228478       131072       131072.0   1.3863   0.8909        6
0.278243 0.289665       262144       262144.0   1.6094   1.2183        6
0.263753 0.249263       524288       524288.0   1.0986   1.4362        6
0.252903 0.242053      1048576      1048576.0   1.0986   1.0863        6
0.255260 0.257616      2097152      2097152.0   1.0986   0.9340        6
0.270311 0.285362      4194304      4194304.0   1.0986   1.6406        6
0.293451 0.316591      8388608      8388608.0   0.6931   0.5203        6

finished run
number of examples per pass = 11009593
passes used = 1
weighted example sum = 11009593.000000
weighted label sum = 17739622.688639
average loss = 0.301545
best constant = 1.611288
total feature number = 66057558

This is a typical Vowpal Wabbit output. 

The above code looks nice and handy, but it has a major drawback.  It prints the output when the sub process command completes.  It does not let you see the current output of the command.  This can be frustrating when the underlying command takes time to complete.  And as machine learning practitioners know, training a machine learning model can take a long long time.

Fortunately for us, the subprocess packages provides ways to communicate with the sub process.  Let's see how we can harness this to print the output of the sub process command as it is generated.  Instead of using the run command, we use the Popen constructor with the same arguments, except for check=True. Indeed  check=True only makes sense if you run the sub process command to completion.

We then parse the output line by line and print it.  There is a catch however: we need to stop at some point.  Looking at the above output, we see that Vowpal Wabbit terminates its output with a line that starts with total.  We check for this, and stop reading from the Popen object.  We then print all existing output before completion.  We use rstrip to remove trailing endline as print already adds an end line.  An alternative would be to replace print(output.rstrip()) with print(output, end='') .

proc = subprocess.Popen(['vw -d Data/week_3_4.vw'],
while True:
    output = proc.stdout.readline()
    if 'total' in output:

remainder = proc.communicate()[0]
print( remainder)

The output of this code is the same as above, except that each line is printed immediately, without waiting for completion of the sub process. 

You can use the same approach to run LightGBM or any other command line package from within Python. 

I hope this is useful to readers, and I welcome comments on issues or suggestions for improvement.


Update on November 24, 2016.  Added that code works fine on Windows 10 as pointed out by Irv Lustig.  Also tested on Windows 7.


Data Digest

Forrester: Empowered customers drive deeper business transformations in 2017

Businesses today are under attack, but it’s not by their competitors. They are under attack from their customers. Three years ago, Forrester identified a major shift in the market, ushering in the age of the customer. Power has shifted away from companies and towards digitally savvy, technology-empowered customers who now decide winners and losers.

Our Empowered Customer Segmentation shows that consumers in Asia Pacific are evolving — and becoming more empowered — along five key dimensions. These five key shifts explain changing consumption trends and lead to a sense of customer empowerment: Consumers are increasingly willing to experiment, reliant on technology, inclined to integrate digital and physical experiences, able to handle large volumes of information, and determined to create the best experiences for themselves.

At one end of the spectrum are Progressive Pioneers, who rapidly evolve and feel most empowered; at the other end we find Reserved Resisters, who are more wary of change and innovation. While the segments are globally consistent and apply across markets, we see significant differences when comparing countries. Our analysis of metropolitan online adults in Australia found that a third of them fall into the most empowered segments — Progressive Pioneers and Savvy Seekers. Highly empowered customers will switch companies to find new and exciting experiences. In this environment, being customer-obsessed and constantly innovating are the only ways to remain competitive.

Organisations in Australia understand this new environment and have started leveraging digital technologies to better engage and serve their B2C and/or B2B empowered customers. While important, most of these investments remain cosmetic in nature. Being customer obsessed requires much more than a refreshed user experience on a mobile app. It requires an operational reboot. To date, few organisations have started the hard transformation work of making their internal operations more agile in service of these customers. To win the business of these empowered customers, digital initiatives in 2017 will have to move from tactical, short term initiatives to broader and deeper functional transformation programs.

Being customer obsessed requires much more than a refreshed user experience on a mobile app. It requires an operational reboot. 

Customer obsession requires harnessing every employee, every customer data point, and every policy in the organisation. Eventually, companies will have to assess and address six key operational levers — technology, structure, culture, talent, metrics, and processes — derived from the four principles of customer obsession: customer-led, insights-driven, fast, and connected. Done well, customer obsession promises to help your organisation win, serve, and retain customers with exceptional and differentiated customer experiences.

Join Michael Barnes, VP and research director, Forrester at the Chief Customer Officer, Sydney on 29th November, Tuesday where he will speak on how to Transform Marketing Into A Customer-Driven Effort. To find out more, visit 


The Top Predictive Analytics Pitfalls to avoid

Predictive Analytics can yield amazing results.  The lift that can be achieved by basing future decisions from observed patterns in historical events can far outweigh anything that can be achieved by relying on gut-feel or being guided by anecdotal events.  There are numerous examples that demonstrate the possible lift that can be achieved across all possible industries, but a test we did recently in the retail sector showed that applying stable predictive models gave us a five-fold increase in the take-up of the product when compared against a random sample.  Let’s face it, there would not be so much focus on Predictive Analytics and in particular Machine Learning if it was not yielding impressive results.

Teradata ANZ

Service Recovery that Deepens Relationship & Brand Loyalty

Optimising customer’s revenue contribution depends heavily on a company’s ability to deepen and effectively maintain loyalty along with emotional attachment to its brand. There has been plenty of rhetoric around Customer Experience Management as the strategy to achieve this competitive edge. Yet the fact remains that the vast majority of such initiatives either concentrate solely on cross-sell/up-sell marketing or they are Voice-of-Customer (VOC) service related surveys. These are laudable efforts but are unlikely to result in sustainable differentiation when they are not part of coordinated customer-level dialogue.

What has been proven to deliver a superior business outcome is the ability to engage with “One Voice” when communicating with customers, especially after negative experiences. For example, only a handful of customers would do more business with a company when their complaints remain unresolved. Therefore, at a minimum, these customers should be excluded from promotional marketing until after satisfactory resolution. Ideally a system should be in place to automatically replace a cross-sell message with a service recovery one for the affected customers.

According to BCG[1] , regardless of the channel they started in, most consumers would seek human assistance (usually via telephone) if they do not get their problem resolved. There would already be a degree of annoyance right from the start of such service calls, especially if customers have to retell background from the beginning. Ideally the service rep should already know the service failures and breakpoints from an earlier interaction. The key capabilities differentiator here is data integration for just-in-time analytics specific to each customer’s context. Even better is to avoid the negative experience in the first place – i.e. develop the capabilities to detect and predict potential servicing and quality issues that erode customer satisfaction.

A company that only proactively contacts customers to sell would very likely condition more and more recipients to switch off and disengage from these communications. A different approach is needed to succeed with Customer Experience Management strategy. In order to inculcate a customer-centric mindset and systematically deliver bespoke servicing across the entire customer base, an organisation will need to align processes and introduce new performance metrics (e.g. customer’s share of wallet) to drive appropriate contents of automatic communication management capabilities. Successfully deploying service recovery into its broader marketing dialogue would stand the company in good position to take advantage of a sale opportunity as and when it emerges for each customer.



[1] BCG, “Digital Technologies Raise the Stakes in Customer Service”, May-16

The post Service Recovery that Deepens Relationship & Brand Loyalty appeared first on International Blog.


November 23, 2016

Curt Monash

DBAs of the future

After a July visit to DataStax, I wrote The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things. Handle the database...


Curt Monash

MongoDB 3.4 and “multimodel” query

“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the...

Data Digest

Is Latin America really ready for Big Data and Analytics?

When one tries to understand the rapidly evolving market of Big Data and analytics globally, there is always a tendency to compare the most advanced markets with those that are currently at the beginning of their data and analytics journey, such as Latin America. In our research, it was clear though that there is much greater level of expertise and knowledge than what is perceived from the outside.

We recently interviewed Giulianna Carranza (GC) of Yanbal, David Delgado (DD) of Banorte, and Victor Barrera (VB) of AXA Seguros, three of the most prominent CDOs in Latin America. During our conversations, we explored their role, competencies and future predictions about the Big Data and Analytics market in the region.

Corinium: What are the top qualities that any CDO in Latin America must have?

GC: The ideal CDO must have fundamental skillsets to drive forward any business strategy within an organisation in Latin America. For instance, he/she must possess innovation leadership, tactical acumen, technological knowledge and strategic thinking in order to promote the adoption of new ideas and processes around data. This executive must also have the ability to guide the members of his/her team and other corporate silos to build solid strategies based on past and present data stories as well as future goals.

DD: A CDO must have a complete and integral business acumen and vision. He/she also needs to be an integrator, business process analyst and a great strategist.

VB: Any CDO in Latin America must have and promote strategic values such as empathy, curiosity and the agility to drive forward any data strategy. This executive also needs to learn and understand how to build a solid data enterprise aligned with the organisation’s values and future objectives, to really lead a data-driven transformation from within; not only from an IT but also from a business perspective. Finally, it is imperative for a CDO to have exceptional leadership skills to engage all silos, including IT, and functions around a clear idea of leveraging data as a vital asset.

It is imperative for a CDO to have exceptional leadership skills to engage all silos, including IT, and functions around a clear idea of leveraging data as a vital asset.

Corinium: What recommendations would you give to any professional seeking this career path?

GC: The is some hard-stuff to be considered by any future CDO in terms of a simple  “Professional self-assessment”: (1) Complete knowledge of the organisation’s operative structure and behaviour; (2)  Sufficient theoretical- practical knowledge of the latest technological tendencies and developments in both digital and analytics tools; (3) Enjoy working out of their “comfort zone” and see the professional path of a CDO as a journey of continuous development and improvement; finally, (4) passion and commitment.

DD: The future CDO needs to have great analytical skills, the ability to break traditional paradigms about customers and brands interactions to recognise its evolution through data; and possess an innate inquisitive mind to understand the complex processes and problems around data to transform them into applicable and easy-to-apply business insights.

VB: As I mentioned in the previous point, any CDO requires a great deal of empathy, curiosity and adaptability. Any “data-driven transformation” must be understood as a fundamental change of paradigms at the basic corporate structure. What’s more, it is vital for any CDO in the region to have a strategic thought leadership mentality and knowledge of how to build a solid Data strategy to be able to develop a consistent corporate transformation within the foreseeable future, instead of seeking a swift technological adaptation to a rapidly changing market. In short, the CDO needs to be a leader, able to encourage both the IT and BI teams to work together, with the same goal in mind, to properly deliver a consistent data driven transformation.

Corinium: What are the biggest challenges that any CDO will face in the next few years in the region?

DD: Firstly, how to cope with the velocity in which data is produced locally and all administrative challenges this implies to generate any business insight. Also, the complexity of using new data management platforms, such as “Hadoop”, and the workforce capacitation needed to implement them.

Culturally speaking, Latin American corporations should implement working models that allow organisations to properly follow what the data indicates.

GC: Culturally speaking, Latin American corporations should implement working models that allow organisations to properly follow what the data indicates. Also, we have important professional skillsets gaps, so we need to deploy a full new organisational model around analytics in LATAM, including Learning & Development, to properly exploit existing and future data.

We also need to change our traditional “Ad Hoc Technology” approach of a “unique-analysis-based-platform” and adopt more tailored solutions to support different silos to perform concordantly with a data strategy (there are obvious expenses related, but it is imperative to consider it to achieve real business outcomes).

It is very important to consolidate the organizational positioning of CDOs in the region. When anyone asks me: Would the CDO be the next CEO? My answer is always: Yes! It is the next natural step!

VB: Change management will be the biggest. It may sound simple, but if all related tasks are not carried out as required, any Data Strategy will fail.

Corinium: How do you see the current Big Data and Analytics market in Latin America and what predictions could you make for it in the next 4 years?

DD: It is a market that has rapidly developed in the last couple of years with regards to its conceptualisation and initial implementation in the region. Clearly, it will grow exponentially now that local corporations have recognised how vital it is to embrace the digital era, and everything it represents, to understand customers’ behaviour, data recompilation, data monetisation and data administration. Several Big Data tools will be deployed by more organisations faster to gain a competitive advantage.

GC: I believe there will be a substantial consolidation of Small Data, Data Intelligence, Digital Analytics, amongst many others, in the region and that they will generate a fundamental transformation of the traditional management structure. Real-time Analytics concepts will represent an important challenge for current or future CDOs –Gartner, for instance, is now talking about hybrid operations models for it: on-going details + consolidates.

From a vendor perspective, I believe they need to adopt a more “consultancy” approach, rather than a simple “sale of products” one, as they have the opportunity to cement integral technological foundations to support CDO’s performances and initiatives for the future in Latin America.

VB: From my perspective, Latin America is not yet ready for Analytics and Big Data. There are fundamentals to be achieved before considering the above: (1) uniformity on Data integration; (2) real business questions’ identification; (3) tools to identify how data, both internal and external, will help answer those questions; and (4) development of statistics models to answer them. It is currently possible to apply some analytical tools to find specific solutions, but not from a wider organisational perspective, given that the data is not really integrated in a unique format.

When anyone asks me: Would the CDO be the next CEO? My answer is always: Yes! It is the next natural step!

To hear from these experts and many more, join us at Chief Data & Analytics Officer, Central America 2017. Taking place on January 24-25, 2017 in Mexico City, CDAO, Central America is the most important gathering of CDOs and CAOs in the region, to further and promote the dialogue concerning data and analytics and their untapped potential. For more information, please visit:  


November 22, 2016

Big Data University

This Week in Data Science (November 22, 2016)

Here’s this week’s news in Data Science and Big Data. Coffee

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

  • Apache Spark – Hands-on Session – Come join speakers Matt McInnis and Sepi Seifzadeh, Data Scientists from IBM Canada as they guide the group through three hands-on exercises using IBM’s new Data Science Experience to leverage Apache Spark!

The post This Week in Data Science (November 22, 2016) appeared first on Big Data University.

Silicon Valley Data Science

Embracing Experimentation at AstroHackWeek 2016

There is something very freeing about experimentation—the ability to fail without fear, and move on. At SVDS, we encourage experimenting as part of our agile practices.

The last week of August, I saw experimentation in action at AstroHackWeek 2016, which is billed as a week-long summer school, unconference, and hackathon. It is part of the Moore-Sloan Data Science initiative being sponsored by three universities: UC Berkeley, New York University, and the University of Washington.


While the first half of each day of AstroHackWeek was spent on more traditional lectures, I would like to focus on the “hacks.” Operating as a form of hackathon, the hacks were well organized and structured.

The first day after lunch, everyone participating in the unconference stood up, introduced themselves, and either proposed a hack idea or mentioned some skills they had that others might find useful for their own hacks. While the event and attendees were rooted in astrophysics, everyone was encouraged to explore any idea they liked, which helped contribute to that atmosphere of being open to the unexpected.

Then, everyone simply self-organized and worked on whatever projects they found interesting. Each hack group ended their days by giving a short recap of what was accomplished, what failed to work, and calls for help. I want want to note that second piece—the encouragement to speak to what failed, or ask for help from others, created a very positive environment for trying new and difficult projects.

Structuring experimentation

I found it striking how important having structure around the freeform process of proposing and working on “hacks” was to the success of the week.This rapid reassessment and evaluation of each of the hacks, and a deliberate calling out of what did not work and why reminded me of the daily standups in our agile data science projects at SVDS. During a project, we meet in the morning to quickly mention what happened yesterday, what is planned for today, and whether something is blocking progress.

The engagement from the whole group at the unconference was incredibly high—people stayed late every night, working at different locations (from bars to restaurants to nearby houses) to continue working on different hacks. Hacks ranged from projects like creating a “Made At AstroHackWeek” GitHub badge in ten minutes, to an analysis investigating exoplanet periods from sparse radial velocity measurements (this one is currently being written up for submission to a journal).

You can find the full list of hacks here (and all the materials here), but I’ll link to a couple of my favorites:

Concluding thoughts

Experimentation is one of the engines that drives scientific inquiry. The rapid turnaround on hack projects throughout AstroHackWeek were of a different kind than is typical in academia, and felt more similar to an agile project. The freedom to fail, to iterate quickly, and the cross pollination of having researchers in different astrophysics disciplines made for a powerful and productive week.

What have you experimented with lately? Let us know in the comments, or check out our agile build case study with

The post Embracing Experimentation at AstroHackWeek 2016 appeared first on Silicon Valley Data Science.

The Data Lab

Bizvento - Knowledge Extraction for Business Opportunities

Since 2013, Glasgow-based Bizvento, has been developing innovative business software for the event management industry. The Scottish start-up has developed a mobile software platform specifically for professional event organisers which lets them manage all aspects of an event in one place. It also gives real-time data analysis capability, useful to event organisers. 

After establishing a successful business in providing mobile solutions, Bizvento realised the potential value of the data gathered by its apps at events in the higher and further education sectors, specifically at college and university open days. The app is designed to provide information to prospective students at these events around available courses and programmes. Based on this information, students then have the opportunity to attend talks and information sessions about what’s on offer. This information is recorded in the app and analysed by Bizvento.

The introduction of tuition fees in the UK has created a £20billion market for the higher and further education sector. Bizvento saw the potential to offer universities and colleges across the country the ability to predict and forecast the number of prospective student applicants using reliable data.

To do this, Bizvento created project KEBOP (Knowledge Extraction for Business Opportunities) and approached The Data Lab for partnership, strategic support and grant funding to realise this opportunity. The Data Lab then facilitated the academic partnership between Bizvento and the University of Glasgow based on the data analysis requirements. 

KEBOP is made up of a suite of sophisticated analysis tools capable of extracting actionable information from two main sources, namely the usage logs of the Bizvento technologies and the registration data of the Bizvento users. 

In the case of the usage logs, the analysis tools adopt information theory to model the behaviour of the users and to identify classes of usage. In particular, the analysis tools manage to discriminate between average usage patterns (those adopted most frequently by the users), peculiar usage patterns (those that appear less frequently while being correct) and wrong usage patterns (those that correspond to incorrect usages of the app). Furthermore, the tools allow the analysis of user engagement as a function of time. This has shown that usage dynamics tend to change abruptly at specific points of time rather than continuously over long periods.

In the case of the registration data, the key-point of the KEBOP approach is the integration of basic information provided by the users (name and postcode) and publicly available data about sociologically relevant information (gender, status, education level, etc.). 

The main outcome of KEBOP is a suite of data analysis technologies capable of making sense of the digital traces left by the users of Bizvento products at academic open days. 

The usage logs and specifically the registration data captured and analysed by KEBOP technologies will identify the main factors that determine the participation in a large-scale event, not only in terms of the very chances of participating at the event but also in terms of the preferences for different aspects of the event itself (e.g., the choice of specific sessions in a conference). The experiments have been performed over data collected at the Open Days of the University of Glasgow and show that the most important factors underlying the participation of prospective students to the Open Days are as follows: education level, unemployment rate and average income in the area where an individual lives. Advancing on this, the analysis tools also show the interplay between gender, social status and the choice of the subject of study.

Through the analysis of rich data sets, Bizvento can reliably predict the number of prospective student applicants to universities and colleges throughout the UK. This information can be used by higher and further education institutions to inform their student recruitment processes and forecast levels of applications and interest in specific courses and programmes.

Google Plus

November 21, 2016

Jean Francois Puget

The Machine Learning Workflow



I have been giving two talks recently on the machine learning workflow, discussing pain points within it and how we might address them. First one was at Spark Summit Europe at Brussels, the other one at MLConf at San Francisco

You can find videos and slides for each below.  Main message is that the machine learning workflow is not that simple.


MLConf, San Francisco

That was a great event.  I was in very good company with top presenters from a number of prominent companies, as you can see from the speakers page.  One key takeaway (not a surprise for me) is that machine learning is not all about deep learning.  Sure, deep learning is used, but other techniques such as factorization machines and gradient boosted decision trees play a significant role in some very visible applications of machine learning as well. 

I encourage readers to take a look at the videos of MLConf presentations. Here is information about mine:

My Abstract:

Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.

Watch the presentation on YouTube

See the slides on SlideShare

Spark Summit Meetup, Brussels

At the recent sold-out Spark & Machine Learning Meetup in Brussels, I teamed up with Nick Pentreath of the Spark Technology Center  to deliver the main talk of the meetup: Creating an end-to-end Recommender System with Apache Spark and Elasticsearch.

Nick did most of the talk, presenting how to build a recommender system.  I talked about 10-15 minutes at the end, discussing the machine learning workflow and typical pain points within it.

Watch the presentation on Youtube

See the slides on SlideShare

The Data Lab

Scottish Knowledge Exchange Awards 2017

Scottish Knowledge Exchange Awards 2017

Interface, which matches businesses to world leading Scottish universities and research institutes for research and development, is calling for entries in five categories. They are:

  1. Innovation of the Year: for the development of an innovative product, process or service;
  2. Sustained Partnership: for a collaboration that demonstrates a long term partnership taking an initial project from a transactional to strategic relationship;
  3. Outstanding Contribution to Knowledge Exchange: recognising an individual from business or Higher Education who has played a pivotal role in knowledge exchange within Scotland;
  4. Multi-Party Collaboration: for groups of three or more parties working in collaboration on an innovative research project to solve a common challenge;
  5. Building Skills Through Knowledge Exchange: recognising postgraduate students or Knowledge Transfer Partnership (KTP) Associates who have worked within a business to increase innovation.

Alastair Sim, Director of Universities Scotland, said: 

“Universities are committed to making their knowledge, expertise and facilities have an impact in the world and working with businesses and other organisations on innovation is one of the ways they can do that. Interface has been at the heart of creating many hundreds of innovative connections between business and academia and is a much-valued partner to the higher education sector. The outcomes from previous winners are impressive; new products and processes are created, businesses grow and expand into new markets, additional funding and investment is leveraged and universities learn so much from the process. I look forward to being inspired, once again, by this year’s winners.”

Siobhán Jordan, Director of Interface, said:

“After the success of last year’s inaugural awards, we are anticipating a high standard of entries from business-academic partnerships this year. Five Scottish universities are ranked among the world’s top 200, so there has never been a better time for Scottish businesses to tap into that academic expertise. Entries to the awards can also be a good foundation for academics to develop Research Excellence Framework impact case studies, for assessing the research excellence and impact of Higher Education Institutions.”

Dr Stuart Fancey, Director of Research and Innovation at the Scottish Funding Council, said:

“Innovation is vital to the future of Scotland and collaborations between academics and businesses are an essential part of that. The Scottish Knowledge Exchange Awards are a fantastic way to recognise and reward these partnerships and achievements.”

Susan Fouquier, Regional Managing Director for Royal Bank of Scotland in Scotland, said: 

“We are delighted to once again support the Scottish Knowledge Exchange Awards. It offers a fantastic opportunity to showcase the great work being carried out here by the country’s academic and business communities. It is a real boost and inspiration to the organisations operating here and brings to light the need for strong relationships between the public and private sectors and the need for creating an eco-system which allows companies to flourish. Thanks to our support of accelerator hubs such as Entrepreneurial Spark and the development of our own nationwide network of Business Growth Enablers, we understand the importance of such frameworks and why support at all levels is crucial for businesses to grow. We look forward to welcoming the applicants to the awards showcase and ceremony in February.”


The deadline for entries is 5pm on Friday 2 December 2016 and the winners will be announced at the Scottish Knowledge Exchange Awards at RBS Gogarburn Conference Centre, Edinburgh, on Tuesday 21 February, 2017.

For more information on the awards, please visit


Interface connects businesses from all sectors to Scotland’s 23 universities and research institutions.  It is a unique service designed to address the growing demand from organisations and businesses which want to engage with academia. Companies supported by Interface inject an estimated £70 million into the economy annually through their partnerships with academics.  Funded by the Scottish Funding Council, Scottish Enterprise and Highlands and Islands Enterprise, Interface is a free and impartial service which aims to stimulate demand for innovation and encourage companies to consider academic support to help solve their business challenges. Interface helps companies operating across a range of sectors, from food and drink to financial services, offering huge benefits for both businesses and academia.

For more information please contact:


Google Plus

November 19, 2016

Revolution Analytics

A heat map of Divvy bike riders in Chicago

Chicago's a great city for a bike-sharing service. It's pretty flat, and there are lots of wide roads with cycle lanes. I love Divvy and use it all the time. Not so much in the winter though: it gets...


November 18, 2016

Revolution Analytics

Tutorial: Build a live rental prediction service with SQL Server R Services

A great way to learn is by doing, so if you've been thinking about how to enable R-based computations within SQL Server, a new tutorial will take you through all the steps of building an intelligent...


Revolution Analytics

Happy Thanksgiving!

It's Thanksgiving day here in the US, so we're taking the rest of the week off to reflect on what we're thankful for. And even if you're not in the US, today is a great day to send thanks to the R...

Teradata ANZ

Managed Cloud – Offering Real Business Benefits for Organisations Globally

Managed Cloud – Offering Real Business Benefits for Organisations Globally

Cloud is rapidly becoming the standard way of doing business and organisations globally are utilising it as a tool for innovation and business transformation. Those who successfully use the cloud to achieve growth will have a mature, strategic view of how best to implement and integrate it across their organisations.

As cloud strategies mature and the business benefits of implementing cloud throughout the organisation become clear, hybrid cloud has emerged as the consensus choice to support business growth. Nearly half of enterprises globally already use some form of hybrid cloud and 72% of enterprises are expected to pursue a hybrid strategy (*1). Hybrid cloud solutions make it easy to deploy new business models and technologies like cognitive analytics, which have the power to transform businesses.

Managed cloud as part of a wider hybrid cloud strategy allows organisations to utilise cloud computing without having to employ an expert in every area. Companies that use managed cloud can focus on their core business rather than having to divert their cash reserves employing large teams IT experts, technical engineers and system administrators and other experts to manage their IT.

A managed cloud provider will offer its customers a range of expertise as well as large economies of scale as the provider’s engineers manage not only the customers’ computing, storage, networks, and operating systems, but also the complex tools and application stacks that run on top of that infrastructure. These can include the latest databases and ecommerce platforms, as well as automation tools. Managed cloud allows each individual customer to choose which IT functions it wishes to manage in-house, leaving all the rest to its chosen service provider.

By partnering with a managed cloud provider, specialists can work with organisations to design and tailor an architecture specific to a customer’s application needs. The provider will also update an organisational architecture on an on-going basis as its requirements evolve and as new features and cloud services become available.

The provider should be able to offer services across a broad range of technologies and deployment models — including dedicated hosting, private cloud platforms like OpenStack and leading public clouds like Amazon Web Services and Microsoft Azure. The combination of choice and expertise means that the provider will deliver an architecture designed to meet an organisation’s application’s specific performance, availability and scalability requirements while eliminating the need for them to retain costly architects in-house.

According to IDC, by 2018, cloud will become a preferred delivery mechanism for analytics, increasing public information consumption by 150% and paving the way for thousands of new industry applications (*1). New industry applications mean more data will be created and with this comes the challenges around the management of data. Data has little value if it is not available to be analysed and used to help grow the business.

From a tactical point of view, the challenge is to set about finding the best way to ensure data is stored, managed and analysed, without incurring expensive overheads and in a scalable way to allow for rapid future growth. A managed cloud infrastructure combined with a powerful database that has the speed and flexibility to deploy complex, analytics, can address all these concerns and help an organisation quickly innovate and build new analytical applications.

The post Managed Cloud – Offering Real Business Benefits for Organisations Globally appeared first on International Blog.

Revolution Analytics

Because it's Friday: Non-transitive dice

I saw a fascinating talk from Christopher Bishop, author of Pattern Recognition and Machine Learning at MLADS (Microsoft's internal machine learning and data science conference) yesterday. He...


Revolution Analytics

Notable New and Updated R packages (to October 2016)

As we prepare for the upcoming release of Microsoft R Open, I've been preparing the list of new and updated packages for the spotlights page. This involves scanning the CRANberries feed (with...


November 17, 2016

Revolution Analytics

Calculating AUC: the area under a ROC Curve

by Bob Horton, Microsoft Senior Data Scientist Receiver Operating Characteristic (ROC) curves are a popular way to visualize the tradeoffs between sensitivitiy and specificity in a binary classifier....

Silicon Valley Data Science

The Venn Diagram of Data Strategy

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

The business community and the technical community can sometimes seem like they live on the opposite sides of the planet—or at least opposite ends of the hallway. When it comes to data strategy, many people read the “data” part and automatically dump the topic in the “technical” bucket. It can be a struggle to adjust that thinking, but it’s a critical thing to do.

Ultimately, what is needed is not to reassign the topic, but to merge the buckets.

Venn diagram

Data strategy matters to both business and tech. It’s a problem that sits in the center of a Venn diagram, and if we get stuck thinking of those two domains as existing solely in completely separate silos, we’ll lock ourselves out of that key middle ground where the really important problems get solved.

From tech to business and back again

My background means that the circles I run in have largely existed in the technical community. These are wonderfully smart, curious, and capable people. But the tech community often does things simply because they are possible and interesting. For both better and worse, the merit of a given pursuit is not always tied in their minds to the needs of the business.

At SVDS, we spend a lot of our time thinking about data strategy: how to start with a set of business objectives, distill those into a batch of use cases, and translate from there into the technical capabilities necessary to support the overall plan. The tools and techniques that techies love to geek out about only start to matter once you know which technical capabilities you need—an order of operations not always familiar in the land of debates over Python vs. R and Hadoop vs. Cassandra.

The emphasis on data serving the practical needs of a business is one of the reasons why I love working at SVDS. We see a bigger context than many other data science consulting shops, and I love being part of that: if there’s one thing I’ve learned, from data visualization to communications to just about anything else, it’s that context matters.

There are lots of other people who get it, too: in particular, Carl Anderson with his book “Creating a Data Driven Organization.” His central thesis is that a data-driven culture rests on a commitment to every step of the data value chain, from collection through analysis and, crucially, ending in action. I’ve been happy to see more and more data scientists take on the mission to preach the importance of business value to the tech community.

The zen of the Venn

At the CDO Vision summit at Enterprise Data World 2015, a panel discussion on data strategy echoed many of these thoughts. Daniel Paolini, CIO of the Philadelphia Department of Behavioral Health, addressed this by saying, “If you can’t speak the language of your business; if you don’t know what they mean when they say a term; if you don’t understand what their objectives are as a business: you cannot be successful at this.” I was busy nodding my head in agreement when the discussion portion started. But then one of the executives in the room expressed an opinion that caught my attention; he said, “the business has abdicated its role by throwing it over the fence to IT and saying, ‘Talk to my DBA.'”

I had almost forgotten that the onus rests on both sides. A woman sitting at the next table replied, “I’m really glad you pointed that out, because I’ve been critical of IT for having responsibility for something they can’t do well, but it’s not like the business side is stepping up.”

Another attendee added to this: “Increasingly, information strategy and business strategy are the same thing. Especially in customer-facing industries, that is increasingly the norm. We keep talking about data people needing to understand what the business is all about. I’m from the business side and I think businesses do not understand how data is changing their competitive landscapes and opportunities, so I think it’s the other way around: business needs to be more aware of data.”

Becoming bilingual

Part of the ability to inhabit any middle ground is learning to become bilingual. Think of it as a trip to Montreal: some people living there will be native speakers of one language, and some will be native speakers of the other—but everyone is surrounded by the words and sounds of both, to an extent that they can at least understand, if not speak, the language of their neighbors.

And it turns out, we all have a lot to learn. As a woman sitting in front of me at CDO Vision pointed out, “Whether you’re business or technology, I’m not sure people know what it takes to get the data where it needs to go. People are so used to opening their phones and getting access to this instant magical asset, that they don’t appreciate what’s involved in exposing data inside the enterprise.”

Those in both the business circle and the technology circle would do well to listen to each other, to consider each other’s priorities, and to embrace the middle of the Venn diagram. It can be uncomfortable or even jarring to live in a bilingual place; there can be misunderstandings. Ultimately, however, both business and technology bear the responsibility—and the potential rewards—of working together for the greater good.

We’ll be talking about data strategy next month at Strata + Hadoop World Singapore. Check out our talks, and sign up for slides, here

The post The Venn Diagram of Data Strategy appeared first on Silicon Valley Data Science.

The Data Lab

Call for projects - Cyber Security

Cybersecurity Call for Projects

IoT expansion and interconnectivity provides increasing opportunities for cyber-attack with devices like drones, driverless cars and even military weaponry all subject to attack. Establishing cyber security policies and best practices provide organisations with a framework to manage cyber-attacks however to implement these frameworks businesses and organisations require new and innovative developments in Cyber Security.

In a survey of 100 business managers by BAE Systems 57pc said that their organisations had received a cyber-attack in the past 12 months, costing companies an average of £300,000 (for one in 10 it cost up to £1m). In Scotland, cybercrime is estimated to cost the economy around £3 billion per year, affecting individuals, private sector and public sector organisations. (1)

Consequently, it’s no surprise that investment in Cyber Security is at an all-time high: in 2015 the global cybersecurity market was valued at $78 billion and is projected to be worth $120 billion by 2017. The UK market makes up a significant part of this and is currently worth well over $5 billion. (2)

Acknowledging this fact, The Data Lab in conjunction with CENSIS is launching a sector specific call for projects, targeting the Cyber security services community in Scotland. We are looking for innovative and collaborative projects between an industry and/or public sector lead organisations, and one or more academic partners. Projects must demonstrate a clear economic or social benefit for Scotland, and focus on a Cyber security solution, product or service.

Specific areas of interest include:

  1. Intrusion detection - using data science to monitor networks
  2. Blockchain - particularly methods of using the technology in public sector and private business
  3. Insider threat - data science methods of mitigating
  4. Malware analysis - methods of speeding up analysis and categorisation
  5. Data Science applications in the digital domain for Police and security services.
  6. Internet of things security – methods to secure IoT products with respect to various aspects including (but not limited to), secure booting, access control, device authentication, firewalling and firmware updates.  

Martin Beaton, Cyber Security Network Integrator for Scotland notes “The information security industry is estimated to grow by £70 billion in the next 3 years yet due to the rate of evolution of the arms race between attackers and defenders many of the technologies for that market are yet to be invented. Scotland needs to innovate to take advantage of this massive market and data science will underpin many of the new products which is why it is fantastic that DataLab and Censis are running this call bringing together fundamental science and industry."

A variety of funding models are available for projects that meet the demonstrable criteria and applicants are advised that a closing date of 1st February 2017 has been set for applications.

For more information in the Cyber security call for projects and how to apply contact us at



Google Plus
Jean Francois Puget

What's New In Machine Learning?

What has changed in Machine Learning in the past 25 years?  You may not care about this question.  You may even not realize that Machine Learning as a technical and scientific field is older than 25 years.  But I do care about this question.  I care because I got a PhD in Machine Learning in 1990.  I then moved sidetrack to work on constraint programming and mathematical optimization.  I am back to machine learning since a couple of years, and I did ask myself this: is my PhD still relevant, or has Machine Learning changed so much that I need to learn it again from scratch?


To cut a long story short: yes my PhD isn't relevant anymore.  The topics I was working on, namely symbolic machine learning, and inductive logic programming, are not very popular these days. 

Does it mean I needed to learn Machine Learning from scratch?  Fortunately for me, the answer to that question was no.


Because of two reasons:

1. Machine Learning technology hasn't changed that much since my PhD

2. And when it has changed, it is to become closer to Mathematical Optimization, something I am quite familiar with.

Let me expand a bit on these two items.

Machine Learning technology hasn't changed that much since 1990.

This may surprise you, because Machine Learning made the headlines since few years only.  Something must have changed in the recent years. 

The most visible change is that Machine Learning is now used in many industries, whereas it wasn't used outside academia in 1990.  A great example of use is recommender systems like the one Netflix uses to recommend you movies to watch, or the ones used by e-commerce sites like Amazon to suggest products to buy. These recommender systems learn from how people decide to watch a movie or buy a product.  Recommendations aren't manually coded by an army of developers.
Is this recent adoption linked to breakthrough in machine learning algorithms?  I don't think so. 

What about Deep Learning then?  This is probably the first thought that comes to mind when you read someone say machine learning technology hasn't changed that much in 25 years or so.

Truth is that most of the algorithms used in Deep Learning are as old as my PhD.  Stochastic Gradient Descent, which is at the heart of how Deep learning models are built, has been developed in the 80s by Léon Bottou.  Convoluted networks, which led to breakthrough in image recognition, were first used at about the same time.  These were not called deep learning, they were called connectionism.  But they were already about neural networks and how to train them. 

What have changed is the ability to scale, i.e. to train neural networks way more efficiently via the use of GPU.  A second change is the availability of imagenet, a very large collection of annotated images.  With faster neural networks and lots of examples to learn from, breakthrough image recognition became possible.  Deep learning is now a vibrant topic, with innovations coming out every year, but let's not forget that the basics are not recent.

There are other major changes in Machine Learning technology that aren't as visible as Deep Learning, but that are worth mentioning.  Let me pick two.

Factorization machines are the technology of choice for recommender systems.  They build at the same time customer profiles, product profiles, and how these interact.  One can then use these profiles to recommend products.  I will blog soon about how this work given it is so neat.  In the meantime, interested readers should read Steffen Rendle's paper and also the Field Aware Factorization Machines paper

Another significant advance is boosting applied to decision trees.  Decision trees aren't a recent idea.  Algorithms like CART and C4.5 were already around when I got my PhD.  However, gradient boosting machine, or gradient boosted decision trees, are recent, see for instance the xgboost implementation for more details.  What is new here is the use  of some optimization algorithm to construct an ensemble of decision trees with unprecedented accuracy.  Gradient boosted decision trees are now the tool of choice for winning machine learning competitions.


This leads me to the second reason why I didn't need to learn Machine Learning from scratch again.

Machine Learning algorithms are optimization algorithms

Gradient boosted decision trees use some optimization algorithm to select the best possible ensemble of trees. Factorization machines also use some optimization algorithms to best approximate how users and products interact.   Deep Learning uses an optimization algorithm like stochastic gradient descent to train models.  As a matter of fact, most, if not all, state of the art machine learning algorithms are solving an optimization problem. 

Basically, a machine learning algorithms selects among a family of possible models the model that minimizes the prediction error.  A model is better if its prediction accuracy is better.  Often we also insist to have a model as simple as possible, see Overfitting In Machine Learning for some examples. We therefore look for the model that minimizes a combination of prediction error and model simplicity.

A machine learning problem is then defined by selecting:

  • a family of models,
  • a definition of the prediction error
  • a definition of model complexity

Then you can use your favorite optimization algorithm to solve it. 

Let me give an example as this is a bit abstract.

Linear regression is probably the simplest and most widely used machine learning algorithm. By the way this exemplifies another major trend that occurred in the past 25 years: Machine Learning and Statistics are way closer now than they were 25 years ago.  Indeed, Linear Regression is a statistical method.  It is now seen as a Machine Learning Algorithm too.  Some even say that Machine Learning should be called Statistical Machine Learning.

Back to our topic, the machine learning problem is to predict a quantity y from a set of quantities x, from a series of examples of the form

yi ≃ f(xi1, xi2, .. , xin)

  where ≃ means approximately equal to.

Our family of models for f will be linear models, i.e. models where f is of the form:

f(x1, x2, .. , xn) = a0 + a1 x1 + a2 x2 + ... +  an xn

The model is described by the parameters ai.


Our prediction error function will be the average of the square of error, i.e.:

sum 1/m [ yi - (a0 + a1 xi1 + a2 xi2 + ... +  an xin  ) ] 2

where m is the number of examples.


Our model simplicity can be the sum of the absolute values of the parameters ai :

sumj | aj |


When using this model simplicity we get a variant of the Lasso algorithm.  This algorithm finds the model that minimizes:

sum 1/m [ yi - (a0 + a1 xi1 + a2 xi2 + ... +  an xin  ) ] 2 + sumj | aj |

This can be turned into an equivalent constrained quadratic minimization problem:


sum 1/m [ yi - (a0 + a1 xi1 + a2 xi2 + ... +  an xin  ) ] 2 + sumj  zj

such that

   aj ≤ zj

 - aj ≤ zj

This problem is something readers familiar with mathematical optimization solvers can immediately relate to.  However, given this optimization problem is mostly unconstrained and can be very large, algorithms available in mathematical optimization solvers may not be best.  Machine learners prefer using local methods like Stochastic Gradient Descent, conjugate gradient method, or LBFGS, to name a few.

Many more machine learning algorithms can be described as optimization algorithms, see for instance this nice presentation from  Stephen Wright. 

Bottom Line

Well, I guess you now understand why I feel at home with Machine Learning as it is today: it is all about optimization.


Update on Nov 2, 2016. The above view is not just mine.  For instance, Prof Diego Kuonen pointed me to how he covers the same topic in his "big data analytics" training :



This is really close to what I presented.


November 16, 2016

Revolution Analytics

How to call Cognitive Services APIs with R

Microsoft Cognitive Services is a set of cloud-based machine-intelligence APIs that you can use to extract structured data from complex sources (unstructured text, images, video and audio), and add...

Jean Francois Puget

Installing LightGBM on MacOSX with Python wrapper

There is a new kid in machine learning town: LightGBM.  It is an implementation of gradient boosted decision trees (GBDT) recently open sourced by Microsoft.  GBDT is a family of machine learning algorithms that combine both great predictive power and fast training times.  Interested readers can find a good introduction on how GBDT work here.  

Why does LightGBM matter?  It matters because it is way faster to train than the  reference implementation for GBDT ( XGBoost.)   I learned about LightGBM few days ago and decided to give it a try.  

Given LightGBM has been developed at Microsoft I don't expect any installation issue on Windows.  However, I found that the instructions provided for installing LightGBM on MacOSX were a bit too simplistic.  You will find below what I did to successfully install LightGBM on my MacBook Pro.  I am using the latest version of Mac OSX, namely Sierra.

Let's start.  We'll follow the Installation Guide provided on LightGBM GitHub repository.  The first step is to launch terminal.  We first launch the launchpad by clicking on this icon:


In the Launchpad windows, we select other, then we select terminal. 

Once we have a terminal window, we type the following.

brew install cmake

Bad start, as this yields an error

Error: The following formula:
cannot be installed as a a binary package and must be built from source.
To continue, you must install Xcode from the App Store,
or the CLT by running:
  xcode-select --install


Fortunately, the error message explains how to fix it. We need to install some prerequisite software. I wish all error messages were that helpful ...

Let's do it.

xcode-select --install

A windows pops up, indicating yet another dependency:



We click on Install, which brings a licence agreement we agree with.  After some time another windows pops up, and some download begins.  Download may take few minutes, depending on your internet connection.  When it finishes, the software is installed:


After clicking the done button, we can resume our installation with this:

brew install cmake

This time installation proceeds, with downloads of Python packages then some compilation.  Overall build takes about 5 minutes on my machine.

That was the first step of the instructions.  Next step is this:

brew install gcc --without-multilib

I so happens that I already installed gcc as output shows:

Warning: gcc-5.3.0 already installed
Warning: You are using OS X 10.12.
We do not provide support for this pre-release version.
You may encounter build failures or other breakages.

However, a warning is a bit suspicious.  It says my OS version isn't supported yet by gcc. 

We need to install a newer version of g++.  Another reason for that is that LighGBM build assumes we are using g++-6 whereas we have g++-5 installed.  I found this tutorial to be useful for installing g++-6 on MacOSX Sierra. Let's follow its instructions.  Beware, this takes at least one hour. 

Once done, we add a soft link in /usr/local/bin to the newly built compiler so that it can be easily invoked:

cd /usr/local/bin

ln -s ../gcc-6.1.0/bin/g++-6.1.0 g++-6

Next is to download the code for LightGBM.  Before that, let's go to where we want to have the code.  For instance, we can create a code directory in our home directory:

cd ~

mkdir code

cd code

Then we can download the code:

git clone --recursive ; cd LightGBM

Next commands are:

mkdir build ; cd build
cmake -DCMAKE_CXX_COMPILER=g++-6 ..

make -j

This completes fine.  LightGBM should be usable now.  The executable is now in the top of the directory we used when cloning code, i.e. ~/code/LightGBM in my case.  This command

cd .. ; ls -l


total 3136
-rw-r--r--   1 JFPuget  staff     636 Nov 15 13:58 CMakeLists.txt
-rw-r--r--   1 JFPuget  staff    1085 Nov 15 13:58 LICENSE
-rw-r--r--   1 JFPuget  staff    2243 Nov 15 13:58
drwxr-xr-x   7 JFPuget  staff     238 Nov 15 20:37 build
drwxr-xr-x   9 JFPuget  staff     306 Nov 15 13:58 examples
drwxr-xr-x   3 JFPuget  staff     102 Nov 15 13:58 include
-rwxr-xr-x   1 JFPuget  staff  833388 Nov 15 20:37
-rwxr-xr-x   1 JFPuget  staff  754996 Nov 15 20:37 lightgbm
drwxr-xr-x  12 JFPuget  staff     408 Nov 15 13:58 src
drwxr-xr-x   3 JFPuget  staff     102 Nov 15 13:58 tests
drwxr-xr-x   5 JFPuget  staff     170 Nov 15 13:58 windows


The executable is lightgbm.  Let's try it using one of the examples provided with the code:

cd examples/binary_classification/

../../lightgbm config=train.conf

This trains lightgbm using the train-config configuration.  Seems everything worked fine given the end of output:

[LightGBM] [Info] 1.067640 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

Using LightGBM via the OS command line is fine, but I much prefer use it from Python as I can leverage other tools in that environment.  Fortunately, ArdalanM already provides a Python wrapper for LightGBM on github:

Let's follow the installation instructions.  We just installed latest LightGBM release.  Next is to install the Python package via the following. 

pip install git+

This will install it in the default Python environment.  I am using Anaconda with Python 3.5.  You may experience something different with a different Python version or different Python distribution.

Collecting git+
  Cloning to /var/folders/vm/xlswqk1j21l4gt_sm16rj2m80000gn/T/pip-gz3o8m7o-build
Installing collected packages: pyLightGBM
  Running install for pyLightGBM ... done
Successfully installed pyLightGBM-0.2.6

Installation went fine.  Let's try it in a Python script , adapted from Ardalan's example as I am not using the same scikit-learn version as him:

import numpy as np
from sklearn import datasets, metrics, cross_validation
from pylightgbm.models import GBMRegressor
import os

# full path to lightgbm executable (on Windows include .exe)
exec = "~/code/LightGBM/lightgbm"

diabetes = datasets.load_diabetes()
X =
y =
clf = GBMRegressor(exec_path=exec,
                   num_iterations=100, early_stopping_round=10,
                   num_leaves=10, min_data_in_leaf=10)

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2), y_train, test_data=[(x_test, y_test)])
print("Mean Square Error: ", metrics.mean_squared_error(y_test, clf.predict(x_test)))

Execution went fine, output ending with:

[LightGBM] [Info] Early stopping at iteration 51, the best iteration round is 41
[LightGBM] [Info] 0.009285 seconds elapsed, finished iteration 51
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Finished loading 41 models
[LightGBM] [Info] Finished initializing prediction
[LightGBM] [Info] Finished prediction
Mean Square Error:  2814.02352609

Seems we are good to go!

I hope the above will help readers.  I welcome comments about issues or installation with another Python environment.





Data Digest

A Closer Look at Customer Experience (CX) in the Digital Age

Social media and digital have had a profound impact on the way in which humans interact with one another, so it’s only natural that these mediums have also had significant implications for customer service and the strategies used by today’s companies to engage with their consumers. With the rapid development in technology, organisations are continually grappling with how they harness the intersection between technology, emotion and brand experience. The advent of the “savvy customer” and platform provided by social media has meant consumers are hyper-aware of the commercial landscape as it pertains to good (and bad) customer experience as well as the power they possess in influencing brand perception. Organisations that want to continue to deliver on customer experience will now find that they are not only competing with their competitors but also themselves.

We had the privilege of interviewing Justin Reilly, Head of Customer Experience Innovation at Verizon Fios on his thoughts regarding the evolution of customer experience as a business asset. Given that the basis of Verizon’s products and services lie within perhaps one of the most fiercely competitive industries as it pertains to customer experience, Justin provides some interesting thoughts on the importance of reinventing yourself in order to remain at the cusp of customer experience excellence and innovation.

Corinium: How would you describe the evolution/progression of the Customer Experience over the last 12 to 18 months? What’s driving this? 

If you look at the evolution of CX as a practice over the last 18 months, it’s really about making experiences hyper relevant and real time for customers. They are naturally drawn to great brand experiences and have never been more aware of poor CX interactions. This sets the bar higher and the expectation for brands now is to develop experiences that fit into their lives in meaningful and contextual ways or they’ll move on to ones that will. Customers, now more than ever, are taste makers. They fundamentally expect more from us. Every moment we wake up, we have a new set of competitors. This is a trend driven primarily by mobile technology innovation, knowledge management automation, and scaled service design. We aren’t just competing against the companies in our industry anymore, we are competing with our customer’s last best experience.

Corinium: What are some of the challenges you still face and what can be done to overcome these challenges?

The single biggest challenge we face is balancing our stakeholders expectations, the immediacy of incremental improvements in service of short term KPIs, and the need to completely disrupt ourselves. We say in our credo: Our best was good for today, and tomorrow we’ll do better. Culture truly drives the ability to optimise your core business, while building the next one. It’s a mindset. So while this is our biggest challenge, we actually believe it’s our biggest strength. The critical mix of both existing and new talent, if managed through the right lens, can be your competitive advantage in an ever-evolving landscape.  

"Customers, now more than ever, are taste makers. They fundamentally expect more from us. Every moment we wake up, we have a new set of competitors. This is a trend driven primarily by mobile technology innovation, knowledge management automation, and scaled service design."

Corinium: What should customer experience executives prioritise in order to be successful? What are your main priorities for the next 2 years?

CX executives should prioritise changing every 3 months rather than 3 years. It’s crucial to ensure that we know the end-to-end experience and not just individual touch points in order to get the type of return that stakeholders expect and customers demand. Additionally, making sure consumer trends and behaviour are aligned and embedded into every touch points and interactions they have with the brand. From an employee and culture perspective, it’s being hyper aware of your talent pool in order to lean into the unique challenge of exceeding our customers’ expectations every single day.

Corinium: How do you believe Big Data and Analytics impact customer experience?

Every morning, Data becomes exponentially more relevant than it was the day before. Companies have traditionally only seen the value in data when they think about precision marketing or support. Now, brands are using data to inform product design, empower every customer touch point, and truly drive their strategic roadmap. It’s important, now more than ever, that we understand how our customers view the utility of sharing their data. We owe it to our customers to both protect their personal information and use it to provide a better brand experience. The future lies in our ability to consume a tremendous amount of information and immediately serve up the most delightful, contextual experience possible at every turn.

Corinium: Why do you think executives should care about customer experience? What do you believe to be the most critical factor in measuring its importance/return on investment? 

We should care because it is the single most important factor that customers judge businesses on today. It’s how they decide whether or not to trust a brand. It drives loyalty, time spent, and share of wallet. For Gen Z consumers, they place very little equity in a “name” brand anymore; it’s truly experience driven. No longer is customer satisfaction or NPS a true measure of customer loyalty or happiness. People pay you in time, attention and the money they spend with you and when experience is king, customer lifetime value is the biggest indicator of ROI. We succeed if they stick with us for the long run, they invest more time with us, recommend us and grow the depth of their relationship with our products and services.

If you would like to hear more from Justin Reilly as well as leading Chief Customer Officers and Customer Experience Executives from Mastercard, Travelers, New York Times, McKesson, CenterPoint Energy and more, join the inaugural Chief Customer Officer USA taking place from the 30th January to 1st February in Miami Florida. For more information visit our website:  


November 15, 2016


How Tagging and Data Harvesting Helps Keep You Updated on Life Events

Our lives are ever-changing. Many, if not all, of us will experience a monumental change somewhere down the road. In order for business owners to grow, they must maintain meaningful relationships with their clients. Therefore, it’s important for them to be aware of the important moments happening in their lives. Tagging and data harvesting is […] The post How Tagging and Data Harvesting Helps Keep You Updated on Life Events appeared first on BrightPlanet.

Read more »

Rob D Thomas

Everything Starts With Design — You Should, Too

“There is no such thing as a commodity product, just commodity thinking.” — David Townsend Anthony DiNatale was born in South Boston. He entered the flooring business with his father in 1921 and...