Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


January 19, 2017

Revolution Analytics

Microsoft R Server in the News

Since the release of Microsoft R Server 9 last month, there's been quite a bit of news in the tech press about the capabilities it provides for using R in production environments. Infoworld's...



Using BrightPlanet’s Compare Function to Analyze Data in Daily News

The fiery 2016 election cycle brought increased wariness and scrutiny of the media — and the news content we digest. Misleading facts and deceptive memes were disseminated through news articles, blogs, and social media at ever-increasing volumes. In this new era of post-trust politics, it is more critical than ever to research and verify the […] The post Using BrightPlanet’s Compare Function to Analyze Data in Daily News appeared first on BrightPlanet.

Read more »
Silicon Valley Data Science

Avoiding Common Mistakes with Time Series Analysis

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.

If you work with data, throughout your career you’ll probably have to re-learn it several times. But you often see the principle demonstrated with a graph like this:

DJ vs JL graph

One line is something like a stock market index, and the other is an (almost certainly) unrelated time series like “Number of times Jennifer Lawrence is mentioned in the media.” The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”.  Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all.  0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong.

The correlation passes a statistical test. This is a great example of mistaking correlation for causality, right? Well, no, not really: it’s actually a time series problem analyzed poorly, and a mistake that could have been avoided. You never should have seen this correlation in the first place.

The more basic problem is that the author is comparing two trended time series. The rest of this post will explain what that means, why it’s bad, and how you can avoid it fairly simply. If any of your data involves samples taken over time, and you’re exploring relationships between the series, you’ll want to read on.

Two random series

There are several ways of explaining what’s going wrong. Instead of going into the math right away, let’s look at an intuitive explanation.

To begin with, we’ll create two completely random time series. Each is simply a list of 100 random numbers between -1 and +1, treated as a time series. The first time is 0, then 1, etc., on up to 99. We’ll call one series Y1 (the Dow-Jones average over time) and the other Y2 (the number of Jennifer Lawrence mentions). Here they are graphed:

F_Y1 graph

F_Y2 graph

There is no point staring at these carefully. They are random. The graphs and your intuition should tell you they are unrelated and uncorrelated. But as a test, the correlation (Pearson’s R) between Y1 and Y2 is -0.02, which is very close to zero. There is no significant relationship between them. As a second test, we do a linear regression of Y1 on Y2 to see how well Y2 can predict Y1. We get a Coefficient of Determination (R2 value) of .08 — also extremely low. Given these tests, anyone should conclude there is no relationship between them.

Adding trend

Now let’s tweak the time series by adding a slight rise to each. Specifically, to each series we simply add points from a slightly sloping line from (0,-3) to (99,+3). This is a rise of 6 across a span of 100. The sloping line looks like this:

Trend line graph

Now we’ll add each point of the sloping line to the corresponding point of Y1 to get a slightly sloping series like this:

Y1 with trend line

We’ll add the same sloping line to Y2:

Y2 with trend line

Now let’s repeat the same tests on these new series. We get surprising results: the correlation coefficient is 0.96 — a very strong unmistakable correlation. If we regress Y on X we get a very strong R2 value of 0.92. The probability that this is due to chance is extremely low, about 1.3×10-54. These results would be enough to convince anyone that Y1 and Y2 are very strongly correlated!

What’s going on? The two time series are no more related than before; we simply added a sloping line (what statisticians call trend). One trended time series regressed against another will often reveal a strong, but spurious, relationship.

Put another way, we’ve introduced a mutual dependency. By introducing a trend, we’ve made Y1 dependent on X, and Y2 dependent on X as well. In a time series, X is time. Correlating Y1 and Y2 will uncover their mutual dependence — but the correlation is really just the fact that they’re both dependent on X. In many cases, as with Jennifer Lawrence’s popularity and the stock market index, what you’re really seeing is that they both increased over time in the period you’re looking at. This is sometimes called secular trend.

The amount of trend determines the effect on correlation. In the example above, we needed to add only a little trend (a slope of 6/100) to change the correlation result from insignificant to highly significant. But relative to the changes in the time series itself (-1 to +1), the trend was large.

A trended time series is not, of course, a bad thing. When dealing with a time series, you generally want to know whether it’s increasing or decreasing, exhibits significant periodicities or seasonalities, and so on. But in exploring time-independent relationships between two time series, you really want to know whether variations in one series are correlated with variations in another. Trend muddies these waters and should be removed.

Dealing with trend

There are many tests for detecting trend. What can you do about trend once you find it?

One approach is to model the trend in each time series and use that model to remove it. So if we expected Y1 had a linear trend, we could do linear regression on it and subtract the line (in other words, replace Y1 with its residuals). Then we’d do that for Y2, then regress them against each other.

There are alternative, non-parametric methods that do not require modeling. One such method for removing trend is called first differences. With first differences, you subtract from each point the point that came before it:

y'(t) = y(t) – y(t-1)

Another approach is called link relatives. Link relatives are similar, but they divide each point by the point that came before it:

y'(t) = y(t) / y(t-1)

More examples

Once you’re aware of this effect, you’ll be surprised how often two trended time series are compared, either informally or statistically. Tyler Vigen created a web page devoted to spurious correlations, with over a dozen different graphs. Each graph shows two time series that have similar shapes but are unrelated (even comically irrelevant). The correlation coefficient is given at the bottom, and it’s usually high.

How many of these relationships survive de-trending? Fortunately, Vigen provides the raw data so we can perform the tests. Some of the correlations drop considerably after de-trending. For example, here is a graph of US Crude Oil Imports from Venezuela vs Consumption of High Fructose Corn Syrup:

Correlation graph

The correlation of these series is 0.88. Now here are the time series after first-differences de-trending:

Corrected correlation graph

These time series look much less related, and indeed the correlation drops to 0.24.

A blog post from Alex Jones, more tongue-in-cheek, attempts to link his company’s stock price with the number of days he worked at the company. Of course, the number of days worked is simply the time series: 1, 2, 3, 4, etc. It is a steadily rising line — pure trend! Since his company’s stock price decreased over time, of course he found correlation. In fact, every manipulation of the two variables he performed was simply another way of quantifying the trend in company price.

Final words

I was first introduced to this problem long ago in a job where I was investigating equipment failures as a function of weather. The data I had were taken over six months, winter into summer. The equipment failures rose over this period (that’s why I was investigating). Of course, the temperature rose as well. With two trended time series, I found strong correlation. I thought I was onto something until I started reading more about time series analysis.

Trends occur in many time series. Before exploring relationships between two series, you should attempt to measure and control for trend. But de-trending is not a panacea because not all spurious correlation are caused by trends. Even after de-trending, two time series can be spuriously correlated. There can remain patterns such as seasonality, periodicity, and autocorrelation. Also, you may not want to de-trend naively with a method such as first differences if you expect lagged effects.

Any good book on time series analysis should discuss these issues. My go-to text for statistical time series analysis is Quantitative Forecasting Methods by Farnum and Stanton (PWS-KENT, 1989). Chapter 4 of their book discusses regression over time series, including this issue.

The post Avoiding Common Mistakes with Time Series Analysis appeared first on Silicon Valley Data Science.

Revolution Analytics

Diversity in the R Community

In the follow-up to the useR! conference in Stanford last year, the Women in R Task force took the opportunity to survey the 900-or-so participants about their backgrounds, experiences and interests....


The 4 Types of Data Analytics

We've covered a few fundamentals and pitfalls of data analytics in our past blog posts. In this blog post, we focus on the four types of data analytics we encounter in data science: Descriptive, Diagnostic, Predictive and Prescriptive. 

Mario Meir-Huber

Application Localization Service

Alconost – Application Localization Service 

If you need to have quality and affordable translations for your website, then Alconost – Software Localization Service is the best solution for your specific needs. They provide you fast translation services by native speakers, who translate a number of online applications into many different languages. In fact, the company localizes websites and products into approximately 40 different languages and they can transform your websites into powerful online and successful platforms. Moreover, the company creates various videos and audio contents.

Successful & Quality Work

Alconost - Localization

The experienced team of translators in Alconost are very professional and work hard, in order to get your work done within minutes, using a number of innovative online and successful tools. More specifically, the team of translators can do ios localization services business websites, online stores, as well as various landing pages or any other kind of website and can translate sites into various subjects, like finance, cars, manufacturing, commerce and many more. More specifically, they translate games, as well as various computer programs into many different languages. During the process of translation, each product is adapted according to the needs and customs of the target market (like encoding, different currencies and many more).

Powerful Copy Editing

After the translation process, they recreate the text in a certain language of your choice, conveying your tasks in different key words and they also rewrite the whole text by combining various sources, in order to make it look more appealing. In fact, they customize your product as you wish, correct all language mistakes and use the most accurate phrases in order to give you an excellent content. For example, if your project work needs voice overs in a different language or if you wish to localize a video or redraw any of your graphics, the professional staff of Alconost does a great work, in order to deliver you the best possible results.

High Quality Glossary & Language

They develop the most specific terminology words and phrases to accompany your website or videos, in such a way as to ensure the best best results to your followers or potential buyers. The experienced translators of Alconost use various and specific dictionaries, that are accompanied with specific terminologies, as to deliver you the exact content with specific meanings in various languages. For example, for large projects, the well experienced team of the company usually works with the already existing and wide range of glossaries of Alconost. This way they can produce a consistent high quality and eye software application localization services, that can draw attention.

The post Application Localization Service appeared first on Techblogger.


January 18, 2017

Revolution Analytics

The fivethirtyeight R package

Andrew Flowers, data journalist and contributor to, announced at last weeks' RStudio conference the availability of a new R package containing data and analyses from some of their...

Data Digest

2017 SURVEY: Companies achieving measurable results from Big Data investments

“Corporations are achieving measurable results and business benefits from their Big Data investments.”

This is the principal finding of NewVantage Partners as it releases its Big Data Executive Survey 2017 entitled, “Big Data Business Impact: Achieving Business Results through Innovation and Disruption.” The survey, participated by CDOs (32.3%), CAOs (22.6%), CIOs (12.9%) and CEOs (8.1%) among others, revealed that corporations have achieved considerable value from their Big Data initiatives. More than 80% of them shared that their Big Data investments have been successful, with 21% of executives declaring Big Data to have been disruptive or transformational for their firm.

However, challenges still abound as the survey also found that cultural issues remain a hurdle to successful business adoption. This finding also coincides with Corinium’s earlier survey of Chief Analytics Officer where “57% of respondents found ‘Driving cultural change’ as the biggest barrier to advancing data and analytics strategy.” Indeed, culture is an important aspect that key executives must put more emphasis on as they continue to pursue business success through Big Data in 2017 and beyond.

Summary of key findings of NewVantage Partners Big Data Executive Survey 2017:

Executives report measurable results from Big Data investments.
Corporations are achieving measurable results and business benefits from their Big Data investments. That is the principal finding of the 2017 executive survey. A strong plurality of executives, 48.4%, report that their firms have realized measurable benefits as a result of Big Data initiatives. A remarkable 80.7% of executives characterize their Big Data investments as successful, with 21% of executives declaring Big Data to have been disruptive or transformational for their firm.

Cultural challenges remain an impediment to successful business adoption.
In spite of the successes, executives still see lingering cultural impediments as a barrier to realizing the full value and full business adoption of Big Data in the corporate world. 52.5% of executives report that organizational impediments prevent realization of broad business adoption of Big Data initiatives. Impediments include lack or organizational alignment, business and/or technology resistance, and lack of middle management adoption as the most common factors. 18% cite lack of a coherent data strategy.
52.5% of executives report that organizational impediments prevent realization of broad business adoption of Big Data initiatives. Impediments include lack or organizational alignment, business and/or technology resistance, and lack of middle management adoption.

Firms are focusing on opportunities to innovate -- while reducing expense levels.
Firms are striving to establish data-driven cultures (69.4%), create new avenues for innovation and disruption (64.5%), accelerate the speed with which new capabilities and services are deployed (64.5%), launch new product and service offerings (62.9%), “monetize” Big Data through increased revenues and new revenue sources (54.8%), and transform and reposition their business for the future (51.6%). And, of course, 72.6% are seeking to decrease expenses through operational cost efficiencies -- with 49.2% reporting successful results from their cost reduction efforts as a result of Big Data investments.

Chief Data Officer’s will be expected to step up to lead the data innovation charge.
A majority of firms report having appointed a Chief Data Officer (55.9%). While 56% see the role as largely defensive and reactive in scope today -- driven by regulatory and compliance requirements-- 48.3% believe that the primary role of the Chief Data Officer should be to drive innovation and establish a data culture, and 41.4% indicate that the role of the CDO should be to manage and leverage data as an enterprise business asset. Only 6.9% suggest that regulatory compliance should be the focus of the CDO.

Big firms are bracing for a decade of disruptive change.
Executives fear that disruption is looming on the immediate horizon. A robust 46.6% of executives express the view that their firm may be at risk of major disruption in the coming decade. They envision a future where “change is coming fast” and it may be “transform or die”. In addition to Big Data, these firms see disruption coming from a range of emerging capabilities, including Artificial Intelligence and machine learning (88.5%), digital technologies (75.4%), cloud computing (65.6%), Block chain (62.3%), and Fin Tech solutions (57.4%). Prepare for the decade of disruption.

For updates on Data, Analytics, Customer, Digital Innovation, follow Corinium on Twitter @coriniumglobal and Instagram @coriniumglobal

January 17, 2017

Revolution Analytics

Git Gud with Git and R

If you're doing any kind of in-depth programming in the R language (say, creating a report in Rmarkdown, or developing a package) you might want to consider using a version-control system. And if you...

Big Data University

This Week in Data Science (January 17, 2017)

Here are some stories from this week in Data Science and Big Data. Don’t forget to subscribe if you find this useful!IBM Watson Health

Interesting Data Science Articles and News

Cool Data Science Videos

Teradata ANZ

Talking about real-time analytics? Be clear about what’s on offer

The inexorable increase in competition around the globe has led to an explosion of interest in real-time and near real-time systems.

Yet despite all this understandable attention, many businesses still struggle to define what “real-time” actually means.

A merchandiser at a big box retailer, for example, may want a sales dashboard that is updated several times a day, whereas a marketing manager at a mobile Telco wants the capability to automatically send offers to customers within seconds of them tripping a geo-fence. Her friend in capital markets trading, meanwhile, may have expectations of “real-time” systems that are measured in microseconds.

Since appropriate solutions to these different problems typically require very different architectures, technologies and implementation patterns, knowing which “real-time” we are dealing with really matters.

Before you start, think about your goals

Real-time systems are usually about detecting an event – and then making a smart decision about how to react to it.  The Observe-Orient-Decide-Act or “OODA loop” gives us a useful model for the decision-making process.  Here are some tips for business leaders about how to minimise confusion when engaging with I.T. at the start of a real-time project:

  1. Understand how the event that we wish to respond to will be detected. Bear in mind that this can be tough – especially if the “event” we care about is one when something that should happen does not. Or which represents the conjunction of multiple events from across the business.
  1. Clarify who will be making the decision – man, or machine? Humans have powers of discretion that machines sometimes lack, but are much slower than a silicon-based system, and only able to make decisions one-at-a-time, one-after-another.  If we chose to put a human in the loop, we are normally in “please-update-my-dashboard-faster-and-more-often” territory.
  2. It is important to be clear about decision-latency. Think about how soon after a business event you need to take a decision and then implement it. You also need to understand whether decision-latency and data-latency are the same. Sometimes a good decision can be made now on the basis of older data. But sometimes you need the latest, greatest and most up-to-date information to make the right choices.
  1. Balance decision-sophistication with data-availability. Do you need to use more, potentially older, data to take a good decision, or can you make a “good enough” decision with less data? Think that through.

Can you win at both ends?

Let’s consider what is required if you want to send a customer an offer in near real-time when she is within half-a-mile of a particular store or outlet.  It can be done solely because she has tripped a geo-fence, which means all that is required is the information about where the customer is now.

But you will certainly need access to other data if you want to know if the same offer has been made to her before and how she responded. Or, if you want to know which offers customers with similar patterns of behaviour have responded to an offer in the last six months. That additional data is likely to be stored outside the streaming system.

Providing a more sophisticated and personalised offer to this customer will cost the time it takes to fetch and process that data, so “good”, here, may be the enemy of “fast”. We might need to choose between “OK right now” or “great, a little later”.  That trade-off is normally very dependent on use-case, channel and application.

Rigging the game in your favour

Of course, I can try and manipulate the system – by working out beforehand the next-best actions in relation to a variety of different scenarios I can foresee. This is instead of retrieving the underlying data and crunching it in response to events that I have just detected. With this kind of preparation, I can at least try to be fast and good.

But then the price I pay is reduced flexibility and increased complexity. And the decision is based on the data from our previous interactions, not the latest data.

All these options come with different costs and benefits and there is no wrong answer – they are all more or less appropriate in different scenarios.  But make sure that you understand your requirements before IT starts evaluating streaming and in-memory technologies for a real-time system.

The post Talking about real-time analytics? Be clear about what’s on offer appeared first on International Blog.


January 16, 2017

Data Digest

Want an honest measure of your customer centricity? Try this.

When asking the question to Senior Executives "How Customer-Centric is your company", there are typically two answers:

  1. We are very Customer-Centric
  2. We are way behind on this
The strange thing is that in both cases, many of them are wrong.

Those that think they are very Customer-Centric are regularly siloed, disconnected and slow but refer to their own personal approach as Customer-Centric. Those that think they are way behind are often ahead of the curve or at least, at par with their peers in the industry and just being hard on themselves or don't really have a benchmark to compare against. So given this, what are some simple ways to gauge Customer Centricity?

Benchmark against Competitors

It can be hard to get a hold of competitor information. Not impossible and easier than you might think if you really go looking, but, at best, you can just get a hold of their metrics and see how you compare. After all, who wants to tell their competitors what they are doing or working on?

Having said that, there are things that affect CX that you can have access to. Website responsiveness and UX, enquiry responsiveness, converted churn, third party research, social analytics, online reviews and personally buying their product, to name a few. These are things you can easily glean to form an intelligent opinion as to how you compare relative to your peers/industry.

Gather Feedback

For those who have driven their CX towards Data, you likely have a more realistic view of what is being achieved. In my opinion (see what I did there?), CX and Analytics must go hand in hand. Feedback and a simple measure like Net Promoter Score (NPS) should be part of what you are capturing.

Whilst NPS is only a part of what you should be capturing, it can give great insight into potential risks and opportunities.

Pay Attention to Social Media

A former colleague of mine, Alistair Clemett of The Customer Experience Company summarises this here better than I could. Depending on the type of business you are in, there is likely more insight to be gained from Social Media than surveys. Both have their faults but can be some of your most powerful ways to measure CX.

Measure Responsiveness

There are some major elements here for me, they are;

  • Website - Page load, forms, UX, etc.
  • Enquiry - When someone asks a question or registers to find information, how quickly do you respond? Think in seconds rather than the hours or days many companies think in.
  • Service Recovery - How simply and quickly do you solve your customer's problems?

Try this at home!

There are many more options, but these can be some of the simplest areas where you can get started. With or without these options, there is still one more thing you should be doing (and never stop doing) to get an honest, unfiltered measure of your customer centricity: Experience being a customer for yourself.

Look for information on your site, fill in forms, inquire, buy (if you can), speak to agents, use the product and generally see things from your customer’s perspective by being a customer. You will often find some of the simplest fixes this way.

If you really are curious you will find answers you otherwise wouldn't have.

For updates on Data, Analytics, Customer, Digital Innovation, follow Corinium on Twitter @coriniumglobal and Instagram @coriniumglobal

By Ben Shipley:

Ben Shipley is the Partnerships Director at Corinium Global Intelligence. You may reach him at  Twitter: @benjaminshipley LinkedIn: 

Data Digest

5 Ways to Start Your Data Governance Framework Right

Chief Data Officer at UNSW, Kate Carruthers, shares her top tips ‪for getting started with data and information governance‬.‬

Corinium: We are looking forward to hearing from you at the CDAO Sydney event speaking on your data governance journey at UNSW. Establishing a data governance framework is inevitably challenging, what are the key cornerstones to consider of any successful data governance framework?

Kate Carruthers: Clarify your mandate. Get your policies and procedures sorted out early. An official policy clarifies your mandate for running the data governance program and can assist in obtaining buy-in.  My starting point was a definition:

“Data governance is the organisation and implementation of policies, procedures, structure, roles, and responsibilities which outline and enforce rules of engagement, decision rights, and accountabilities for the effective management of information assets.”

Source: John Ladley, Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program, 2012

Setup an effective governance structure. This seems like an obvious thing, but many organisations struggle with this. Getting the right structure setup and the right people involved is critical to success. I have setup a Data Governance Steering Committee (DGSC), which has oversight of the entire program, with cross-organisational executive involvement, and it has been very important in obtaining credibility. The DGSC is supported by another Committee, which takes a more hands-on day to day role in deciding how we manage data across the organisation. We also work closely with IT, Privacy, Procurement, and Legal to ensure that they are involved in the data governance program.

Data governance is the organisation and implementation of policies, procedures, structure, roles, and responsibilities which outline and enforce rules of engagement, decision rights, and accountabilities for the effective management of information assets.

Make a start. Typically, in a large organisation, it can be daunting to consider data governance and to know where to start. Find an area of the organisation that has some willing people and just get started. This lets you demonstrate success and leverage that success to get the next area of the organisation involved.

Take inspiration from other organisations. Don’t feel the need to invent data governance from scratch. Talk to other practitioners – they’re usually delighted to find a fellow traveler. Find groups where data and information governance folks hang out, like The Data Governance Institute: The DGI, Information Governance ANZ or the Data Management Association Australia (DAMA). The kind folks at DG @ Stanford University were particularly helpful to me in the early days.

Ignore the vendors. There are a plethora of vendors who say they have solutions for data governance. Ignore them. It is not about the tools, it is about practice and culture.

Corinium: You’ve once said that when it comes to building a culture for data governance, “work with the willing and win hearts.” What are your top tips for achieving a truly data-driven culture?

Kate Carruthers: We are still very early on in the journey towards a data-driven culture at UNSW. However, a combination of a good policy and standards framework, together with the right tools to enable people to understand and manage their data are key foundations.

Corinium: Tell us your view on data ownership. Do you agree that territorial stance on data ownership is one of the greatest challenges in establishing the CDO role?

Kate Carruthers: Not at all. The Chief Data Officer role at UNSW is part of the business and I’m here to help data owners across the organisation to truly own and understand their data.  Another key relationship is with IT, and I work very closely with them to ensure that we build effective governance for the business.

Corinium: What do you see as the key benefits to an organisation in having a CDO role?

Kate Carruthers:
Increasingly, data is seen as a critical corporate asset that needs to be managed effectively. Having someone who is responsible for facilitating discussions about how the organisation determines its use of new, existing, and legacy information assets is critical. Additionally, CDOs can lead the debate on digital ethics, privacy and other regulatory issues relating to data.

Corinium: What do you believe are the top 3 qualities needed to be a successful CDO?

Kate Carruthers: Solid understanding of the business, ability to listen to and understand the needs of business and IT colleagues, and an understanding of data and related technologies.

Corinium: What are your key priorities for investment in the next year?

Kate Carruthers: At UNSW, we’re all about delivering on the UNSW 2025 Strategy and there are huge array of projects kicking off. In the business-as-usual space however, I am particularly concerned with the Data & Information Governance program and implementing the Cybersecurity Program, including rollout of the Information Security Management System and the Data Classification process. These are the key foundations for our strategic data initiatives. We are also looking at our next generation analytics to support the 2025 Strategy.

Corinium: What is the one advice you will give to a CDO who will be assuming the role for the very first time?

Kate Carruthers: Take your time to understand the context and getting to know the business priorities.

Corinium: What are the biggest trends you envisage dominating the data analytics space over the coming year?

Kate Carruthers: The legacy data warehouses will linger but Hadoop and predictive analytics will continue their growth. I’m expecting to see tools for managing realtime data streaming to go mainstream – we’ve already experimented with Amazon Kinesis during 2016. And of course, more cloud based offerings as vendors try to catch up.

To hear more from Kate, reserve your seat at the Chief Data and Analytics Officer, Sydney taking place on the 6-8 March, 2017.

For more information visit:  

For updates on Data, Analytics, Customer, Digital Innovation, follow Corinium on Twitter @coriniumglobal and Instagram @coriniumglobal

January 14, 2017

Simplified Analytics

This is how Analytics is changing the game of Sports!!

Analytics and Big Data have disrupted many industries, and now they are on the edge of scoring major points in sports. Over the past few years, the world of sports has experienced an explosion in the...


January 13, 2017

Revolution Analytics

Because it's Friday: Code Burn

I was unaware of the work of Jenn Schiffer until recently. At the risk of giving away the joke, she writes satire for coders. Some of her best pieces include: A Call For Web Developers To Deprecate...


Revolution Analytics

Microsoft R Server tips from the Tiger Team

The Microsoft R Server Tiger Team assists customers around the world to implement large-scale analyytic solutions. Along the way, they discover useful tips and best practices, and share them on the...


Mario Meir-Huber

Banking System

Standfore – Banking system

The world of finance is one of the most sensitive sectors of the economy as it dictates how money is going to flow and trades. For that reason, banking systems have to be given the utmost priority as this ensures that the users are able to trust in the system.
The Standfore banking system is designed, prototyped, tested and built on the grounds that secure information flows enhances daily business transactions as well as putting confidence into businesses and governments that their monies will be handled with discretion. With a firm firewall securing and filtering the data that goes in or out of the system, you can rest assured that all unauthorized intrusions will be thwarted and your system administrators alerted in the event of any suspicious or fraudulent looking activities.
Security plays a huge role in most modern systems as the threat of attacks is always present and very ominous, going in the form of malware attacks, information theft while in transit, denial of service attacks and so many other varied attacks that could totally cripple a bank thus halting its operations. In case information gets stolen while in transit to the online banking system, there has to be a way of ensuring that this information will be useless when it falls into the wrong hands. Encrypting all the data that is moving using strong digital keys prevents snooping attacks hence ensuring that all communications with the banking system are verifiable and totally secure.
standfore software
The banking platform software also needs to offer services at a fast pace which gives customers the confidence that no matter what time of day or night they log into the system, their account details will be very secure and conducting transactions will be easy, quick and very efficient. Building the system on reliable infrastructure is one of the ways towards ensuring that service delivery is quick and reliable so that the customers and other partners are able to gain trust in the system and do their business operations more often.
In terms of keeping track of all transactions that are being handled by the system, the banking system has got strong databases that encrypt all their content thus making it seem nonsensical to the outsiders while keeping the integrity of the information intact. This way, customers can be assured that all their sensitive information is stored in strong digital vaults that will stand all kinds of attacks. This is what the Standfore banking solutions software has been designed and built for in the modern digital age.

The post Banking System appeared first on Techblogger.


January 12, 2017

Revolution Analytics

Education Analytics with R and Cortana Intelligence Suite

By Fang Zhou, Microsoft Data Scientist; Hong Ooi, Microsoft Senior Data Scientist; and Graham Williams, Microsoft Director of Data Science Education is a relatively late adopter of predictive...


January 11, 2017

Revolution Analytics

In case you missed it: December 2016 roundup

In case you missed them, here are some articles from December of particular interest to R users. Power BI now has a gallery of custom visualizations built with R. Chicago's Department of Public...


January 10, 2017

Revolution Analytics

The anatomy of a useful chart: NOAA's flood forecasts

With thanks to NOAA's incredible data gathering and forecasting activities, I've been obsessed with this chart for the past few days: We used to live near the Napa river where this river gage is...

Jean Francois Puget

A Nice Optimization Problem From Santa Claus


Kaggle is a site that is most known for hosting machine learning competitions.  However, once a year, Kaggle team runs an optimization competition on some problem Santa Claus could face. 

This year competition is a stochastic optimization problem: we are asked to optimize some outcome when the data is known with some uncertainty.  Many real word problems are of this form. For instance, optimizing store replenishment and inventory levels takes as input sales forecasts.  By definition, future sales are only known up to some prediction uncertainty.  In a case like this, one can optimize for the worst case for instance: find a replenishment plan that minimizes the likelihood of out of stock.  I could go an and expand on this, but let's go back to Kaggle competition for now.

The Problem

This year competition description is the following:

♫ Bells are ringing, children singing, all is merry and bright. Santa's elves made a big mistake, now he needs your help tonight ♫

All was well in Santa's workshop. The gifts were made, the route was planned, the naughty and nice list complete. Santa thought this would finally be the year he didn't need Kaggle's help with his combinatorial conundrums. At last, the Claus family could take the elves and reindeer on that well deserved vacation to the South Pole.

Then, with just days until the big night, Santa received an email from a panicked database admin elf. Attached was a server log with the six least jolly words a jolly old St. Nick could read:


One of the North Pole elf interns had mistakenly deleted the weights for all of the inventory in the workshop! Santa didn't have a backup (remember, this is a guy who makes a list and checks it twice) and, without knowing each present's weight, he didn't know how he would safely pack his many gift bags. Gifts were already on their way to the sleigh packing facility and there wasn't time to re-weigh all the presents. It was once again necessary to summon the holiday talents of Kaggle's elite.

Can you help Santa fill his multiple bags with sets of uncertain gifts? Save the season by turning Santa's uncertain probabilities into presents for good little boys and girls.

The data section contains additional information:

Santa has 1000 bags to fill to fill with 9 types of gifts. Due to regulations at the North Pole workshop, no bag can contain more than 50 pounds of gifts. If a bag is overweight, it is confiscated by regulators from the North Pole Department of Labor without warning! Even Santa has to worry about throwing out his bad back.

Each present has a fixed weight, but the individual weights are unknown. The weights for each present type are not identical because the elves make them in many types and sizes.

It then provides a way to compute the weight distributions of each gift type.  Details are available on Kaggle site, just follow the above link. 

It is definitely a stochastic optimization problem:  we are asked to optimize the weights of the gifts Santa can distribute, when these weights are only known via a probability distributions.  I decided to approach this as a stochastic cutting stock problem.  All the code used here is available in a notebook on Kaggle site.  The code can be run for free by all using the DOcplexcloud service, or with the freely available academic version of CPLEX if you are eligible to it.

The Model

One way to approach the competition is to look for a solution structure that has a good chance to yield good submission. A solution structure is defined by a number of bag types, plus a number of occurrence of each bag type. A bag type is defined by the number of gifts of each type it contains. For instance 3 blocks and 1 train

We can focus on bag types because all bags have the same capacity (50 pounds).

There is a finite number of bag types that are possible. We define one random variables for each bag type.

All we need is an estimate the expected value and the variance of each possible bag type. Then we use two properties to find a combination of bags that maximizes a combination of expected value and standard deviation:

  • the expected value of a sum of random variables is the sum of the expected values of the random variables
  • the variance of a sum of independent random variables is the sum of the variances of the random variable

We estimate the mean and variance of each bag type via the law of large numbers: we use Monte Carlo simulation (with 1M sample), and compute mean and variance of the simulated results. 

While simple, this approach is way too expensive to run.  We improved running time in two ways:

  • limiting the number of bags to those that are inside the Pareto frontier (details below),
  • precomputing distributions for bags made of one gift type and reusing them for more complex bag types.

Let me expand a bit on the Pareto frontier idea.  Let's consider two bags for the sake of clarity:

  1. Three blocks, one train
  2. Three blocks, one bike, and one train

The second bag is obtained by adding one gift to the first bag.  We can compute the expected weight for each of these bags. If the expected weight is lower for the second bag than for the first bag, then the second bag can be ignored.  Why?  Because it uses more gifts for a lower value.  More generally, if the first bag cannot be extended with an additional gift without lowering the expected value then it is Pareto optimal.

Given this, we start with an empty bag, and add one gift at a time in every possible way. We do it until the expected value of the bag decreases. When this happens then we can discard the newly created bag, as it uses more items and yields a lower expected value.  This results in about 40,000 bag types.

Optimizing Expected Value

Next step is to solve the optimization problem.  As said before, it is a cutting stock problem.

The mathematical formulation is as follows.



  • n is the number of bag type
  • m is the number of gift types
  • meani is the expected weight of bag type i
  • gij is the number of gifts of type j in bag type i
  • capj is the number of available gifts of type j
  • xj is an integer decision variable that takes value v if bag type i is used v times
  • mean is a continuous variable that represents the expected value of the solution structure

This defines a mixed integer problem with linear constraints (MIP). 

Solving it is pretty straightforward with a state of the art MIP solver like CPLEX.  I used the recent DOcplex package to call it from Python. The code is close to the above mathematical formulation.

from import Model

def mip_solve(gift_types, bags, nbags=1000):
    mdl = Model('Santa')
    rbags = range(bags.shape[0])
    x_names = ['x_%d' % i for i in range(bags.shape[0])]
    x = mdl.integer_var_list(rbags, lb=0, name=x_names)
    mean = mdl.continuous_var(lb=0, ub=mdl.infinity, name='mean')
    for gift in gift_types:
        mdl.add_constraint(mdl.sum(bags[gift][i] * x[i] for i in rbags) <= allgifts[gift])
    mdl.add_constraint(mdl.sum(x[i] for i in rbags) <= nbags)

    mdl.add_constraint(mdl.sum(bags['mean'][i] * x[i] for i in rbags) >= mean)
    mdl.parameters.mip.tolerances.mipgap = 0.00001
    s = mdl.solve(log_output=True)
    assert s is not None
    x_val = s.get_values(x)
    mean_val = s.get_value(mean)
    print('mean:%.2f' % mean_val, 'std_val:%.2f' % std)
    return bags[x_val > 0]

bags is a data frame containing all bag types.  We return the portion of it that contains only the bag types that are used.

Solving this MIP yields solution structures with expected value around 35,540 pounds.  Results depend on how we estimate the expected value of each bag type.  The more simulation runs, the more accurate it is, but the more time it takes to generate all bag types.

I thought finding the optimal expected value would be good enough to win the competition, but I was really wrong.  My first submission scored about 35,880 pounds, and as I write now, top score is close to 37,000 pounds.

How could that happen?  Isn't my solution the optimal one?  It is, but in a probabilistic way: it is the best one one average.  Issue is that the competition isn't about finding the best solution on average.  The goal is to find the best solution given the actual (hidden) weights of the gifts. 

Optimizing Mean and Variance

One way to improve result is to generate many solutions from the same solution structure, albeit using different gifts.  For instance, if the solution structure contains one bag made of one train and three blocs, a first solution could include this bag: [train_1, blocks_3, blocks_8, blocks_12].  In a second solution, the same bag could be [train_1, blocks_3, blocks_8, blocks_12].   The expected value for both bags is the same, but the actual values will be different, because the weights of individual gifts are different: the weight for train_1 is not the same as the weight of train_2

Given we can generate many solutions from a given solution structure, how could we improve the value of the best possible one?  One way to do it is to favor solution structure with larger variance.  If two solution structures have the same expected value, then the one with the larger variance is more likely to generate larger value submissions.  It is also more likely to generate lower value submissions.

Question is how to do that with a solver like CPLEX?

Well, the standard deviation of the solution structure is the root of its variance.  And its variance is the sum of the variance of all bags in it.  The mathematical formulation is therefore a slight extension of the previous one:



  • alpha is the relative importance of standard deviation in the objective function
  • n is the number of bag type
  • m is the number of gift types
  • meani is the expected weight of bag type i
  • vari is the variance of the weight of bag type i
  • gij is the number of gifts of type j in bag type i
  • capj is the number of available gifts of type j
  • xj is an integer decision variable that takes value v if bag type i is used v times
  • mean is a continuous variable that represents the expected value of the solution structure
  • std is a continuous variable representing the standard deviation of the solution structure
  • var is a continuous variable representing the standard deviation of the solution structure

This problem contains a quadratic constraint.  It is a quadratically constrained mixed integer problem (QCMIP).  Again, solving it with CPLEX is rather easy, the code becomes:

def qcpmip_solve(gift_types, bags, alpha, nbags=1000):
    mdl = Model('Santa')
    rbags = range(bags.shape[0])
    x_names = ['x_%d' % i for i in range(bags.shape[0])]
    x = mdl.integer_var_list(rbags, lb=0, name=x_names)
    var = mdl.continuous_var(lb=0, ub=mdl.infinity, name='var')
    std = mdl.continuous_var(lb=0, ub=mdl.infinity, name='std')
    mean = mdl.continuous_var(lb=0, ub=mdl.infinity, name='mean')
    mdl.maximize(mean + alpha * std)
    for gift in gift_types:
        mdl.add_constraint(mdl.sum(bags[gift][i] * x[i] for i in rbags) <= allgifts[gift])       
    mdl.add_constraint(mdl.sum(x[i] for i in rbags) <= nbags)
    mdl.add_constraint(mdl.sum(bags['mean'][i] * x[i] for i in rbags) == mean)   
    mdl.add_constraint(mdl.sum(bags['var'][i] * x[i] for i in rbags) == var)
    mdl.add_constraint(std**2 <= var)

    mdl.parameters.mip.tolerances.mipgap = 0.00001
    s = mdl.solve(log_output=True)
    assert s is not None
    x_val = s.get_values(x)
    mean_val = s.get_value(mean)
    std_val = s.get_value(std)
    bags['used'] = x_val
    print('mean:%.2f' % mean_val, 'std_val:%.2f' % std)
    return bags[(bags['used'] > 0) | (bags['number'] > 0)]

Solving this QCMIP with alpha=2 yields solution structures with expected value around 35,525 pounds, and standard deviation around 333 pounds.  The expected value is a bit lower, but the standard deviation is much larger.  With solution structure like this, I had about 1/4 chance to get a submission above 36,400 pounds by the end of competition (about 90 submissions left at that time).  This looked much better but truth is that people have found ways to generate way higher value, as shown by the current leader board of the competition.  I am also able to generate better  solutions, but don't count on me to disclose how before competition ends ;)

There is a caveat in the second approach.  Can you see it?

The caveat is about how we prune the generation of candidate bags.  We need to modify it to take into account the new objective function we have.  When we generate bag2 by adding one gift to bag1 then we should compare mean(bag1) + alpha * std(bag1) with mean(bag2) + alpha * std(bag2).  If the former is higher than the latter then we can safely drop bag2 from further consideration.


What I found interesting is to use a QCMIP to optimize the maximum of the values we can get when generating solutions from one solution structure.  This is significantly different from usual stochastic optimization problems.  Indeed, in my experience, people are usually interested in finding solutions that are good on average (like in my first model), or that minimize worst case scenario.  Here we are asked to maximize the best case scenario.  It is very unusual.





Techniques for Assigning Dates to Web Content: What Was the Publish Date?

When making sense of a web page’s raw text, one of the ideal pieces of metadata is the “publish date.” Assigning dates to web content attributes the documents, and all other pieces of intelligence found within that document, to a specific time period. This helps the data analyst quickly drill-down into the data by date […] The post Techniques for Assigning Dates to Web Content: What Was the Publish Date? appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (January 10, 2017)

Hello all! My name is Janice Darling and I will be taking over this column from Cora.
Here is a roundup of the news this week in Data Science and Big Data. IBM Watson Power7

Don’t forget to subscribe to keep up-to-date with developments in Big Data and Data Science!

Interesting Data Science Articles and News

Cool Data Science Videos

The post This Week in Data Science (January 10, 2017) appeared first on Big Data University.


January 09, 2017

Revolution Analytics

What can we learn from StackOverflow data?

StackOverflow, the popular Q&A site for programmers, provides useful information to nearly 5 million programmers worldwide with its database of questions and answers — not to mention the...


#PostTruth – what does it mean in the world of Data Science?

If I was to sum up our purpose at Principa, it would be “to help clients make informed decisions using data, analytics and software”.  As information grows, so the opportunity to make better decisions increases.  Data helps you understand your customer better.  That’s our mantra.  That’s our ethos.  That’s why we are.


January 07, 2017

Simplified Analytics

What are Microservices in Digital Transformation?

Today’s organizations are feeling the fear of becoming dinosaur every day. New disrupters are coming into your industry and turning everything upside down. Customers are more demanding than ever and...


January 06, 2017

Revolution Analytics

Three reasons to learn R today

If you're just getting started with data science, the Sharp Sight Labs blog argues that R is the best data science language to learn today. The blog post gives several detailed reasons, but the main...


Revolution Analytics

Because it's Friday: The camera might not lie, but sometimes it fibs

Photography is my favourite art form: it's more than just capturing a scene in the frame. A good photograph tells a story, chosen and delivered by the photographer. But sometimes that story isn't...


January 05, 2017

Revolution Analytics

Analyzing emotions in video with R

In the run-up to the election last year, Ben Heubl from The Economist used the Emotion API to chart the emotions portrayed by the candidates during the debates (note: auto-play video in that link)....

Silicon Valley Data Science

Imbalanced Classes FAQ

We previously published a post on imbalanced classes by Tom Fawcett. The response was impressive, and we’ve found a good deal of value in the discussion that took place in the comments. Below are some additional questions and resources offered by readers, with Tom’s responses where appropriate.

Questions and clarifications

Which technique would be best when working with the Predictive Maintenance (PdM) model?

This is kind of a vague question so I’ll have to make some assumptions of what you’re asking. Usually the dominant problem with predictive maintenance is the FP rate, since faults happen so rarely. You have so many negatives that you need a very low (eg <0.05) FP rate or you’ll spend most of the effort dealing with false alarms.

My advice is:

  1. Try some of these techniques (especially the downsampled-bagged approach that I show) to learn the best classifier you can.
  2. Use a very conservative (high threshold) operating point to keep FPs down.
  3. If neither of those get you close enough, you could see if it’s possible to break the problem in two such that the “easy” false alarms can be disposed of cheaply, and you only need more expensive (human) intervention for the remaining ones.

Can you suggest a modeling framework that targets high recall at low FP rates.

I’m not quite sure what “modeling framework” refers to here. High recall at low FP rates amounts to (near) perfect performance, so it sounds like you’re asking for a framework that will produce an ideal classifier on any given dataset. I’m afraid there’s no single framework (or regimen) that can do that.

If you’re asking something else, please clarify.

Why did over- and undersampling affect variance as they did in the post? Shouldn’t the (biased) sample variance to stay the same when duplicating the data set, while there’d be no asymptotic difference when using undersampling?

A fellow reader stepped in to help with this question:

You’re very close — it’s n-1 in the denominator. When duplicating points in a dataset the mean stays the same, and the numerator for the variance is proportional depending on which observations are duplicated, but the variance itself decreases.

Mathematically, variance is defined as E( [X – E(X)]^2 ), where E() is the mean function (typically just sum everything up and divide by n), but when finding the variance of a sample, instead of taking the straight-up mean of the squared differences as the last step you need to sum everything up and divide by n-1. (It can be shown that dividing by n underestimates the variance, on average.)

Suppose some dataset consists of 5 points. Say the numerator is sum([X – E(X)]^2) = Y, so the variance is Y/4. Now duplicate the data, creating dataset Z, and you have 10 points and the numerator of the formula is sum([Z – E(Z)]^2) = 2Y. But now the variance is 2Y/9, which is smaller than Y/4.”

With well-behaved data, this does not affect much in practical terms.

Additional tools and references

The post Imbalanced Classes FAQ appeared first on Silicon Valley Data Science.


Principa's Top 10 Data Analytics Blog posts for 2016

We take pride in our ability to predict - from the results of the 2015 Rugby World Cup and the 2016 Oscars to predicting profitable customers and customer churn. However, there is no denying that 2016 was a year full of shocking, unexpected events - from Brexit and the US election results to the acrimonious break-up of "Brangelina" (shocking!) and the sad loss of some very talented artists.


January 04, 2017

Revolution Analytics

The Flexibility of Remote and Local R Workspaces

by Sean Wells, Senior Software Engineer, Microsoft The mrsdeploy R package facilitates Remote Execution and Web Service interactions from your local R IDE command line against a remote Microsoft R...

Jean Francois Puget

Installing XGBoost on Mac OSX

OSX is much better than Windows, isn't it?  That's a common wisdom, and it seemed to be confirmed once more when I installed XGBoost on both OS.  Before I deep dive, let me briefly describe XGBoost.  It is a machine learning algorithm that yields great results on recent Kaggle competitions.  I decided to install it on my laptops, an old PC running Windows 7, and a brand new Mac Pro running OSX.  I thought the OSX installation was a no-brainer compared to the Windows one, as explained in Installing XGBoost For Anaconda on Windows

Reality is a bit different, and the OSX installation isn't as smooth as it seems.  To be accurate, the default OSX installation of XGBoost runs in single thread mode, as explained in these instructions.

Why is this a problem?  Because XGBoost is a machine learning algorithm, and running it may be time consuming.  I decided to install it on my computers to give it a try.   I am currently working on a dataset with about 100k rows (samples) only, and tuning XGBoost on my old Windows laptop (a Lenovo W520) takes about 2 hours.  What surprised me is that it takes 7 hours on my brand new Macbook Pro!  It is a bit weird, given they both have Intel i7 quad core cpus, and given that the Mac clock speed is higher.  Add to this the premium price of the Mac, and you get me really surprised.

I further observed that other cpu intensive tasks are faster on the Mac Book Pro.  Something is definitely wrong, but the culprit is easy to spot: it is all about XGBoost being single threaded on OSX. 

Before I explain how to enable multi threading for XGBoost, let me point you to this excellent Complete Guide to Parameter Tuning in XGBoost (with codes in Python).  I found it useful as I started using XGBoost.  And I assume that you could be interested if you read this far ;) 

Back to XGBoost, the installation instructions do explain how to get the mutli-threaded version of XGBoost. unfortunately they did not work for me.  The following is what worked for me.  i am sharing in case it helps others.  I had to perform the following step:

  • Get Homebrew if it is not installed yet.  Indeed, this is a very useful open source installer for OSX.  Instaling it is straightforward, open a terminal, then paste and execute the instruction available on Homebrew home page. I reproduce it here for convenience:
    /usr/bin/ruby -e "$(curl -fsSL"
  • Get gcc with open mp.  Just paste and execute the following command in your terminal, once Homebrew installation is completed.
    brew install gcc --without-multilib    
    This automatically downloads and builds gcc.  It can take a while, it took about 30 minutes for me.  Be patient.
  • Get XGBoost.  Go to where you want in your filesystem, say <directoy>.  Then type the git clone command and execute it:
    cd <directory>
    git clone --recursive 
    This downloads the XGBoost code into a new directory named xgboost.
  • Next step is to build XGBoost.  By default, the build process will use the default compilers, cc and c++, which do not support the open mp option used for XGBoost multi-threading. We need to tell the system to use the compiler we just installed.  That's the step that was missing from the installation instructions on XGBoost site. 
    There are various ways to do it, here is the one I used. 
  • Go to where we downloaded XGBoost
    cd <directory>/xgboost
  • Then open make/ and uncomment these two lines

export CC = gcc
export CXX = g++

  • Depending on you g++ installaiton you may need to change the above two lines into:
    export CC = gcc-6
    export CXX = g++-6
  • We then build with the following commands.
    cd <directory>/xgboost
    cp make/ .
    make -j4
  • Once the build is finished, we can use XGBoost with its command line.  I am using Python, hence I performed this final step.  You may need to enter the admin password to execute it.
    cd python-package; sudo python install

This concludes the installation. 

I tested it with My Anaconda distribution with Python 3.5.  It worked fine, and I could run XGBoost.  The speedup thanks to multi threading is noticeable, and my Mac Book Pro is now faster than my old PC.   

Updated on July 16, 2016.  Makefile changed in xgboost, making it easier to use gcc.

Updated on Jan 4, 2017. Upated the gcc and g++ declarations in makefile.  The original way didn't worked on some g++ installations.  Thanks to Brandon Mitchell who spot the issue.


January 03, 2017

Revolution Analytics

The biggest R stories from 2016

It's been another great year for the R project and the R community. Let's look at some of the highlights from 2016. The R 3.3 major release brought some significant performance improvements to R,...


January 01, 2017


December 30, 2016

Revolution Analytics

Because it's Friday: Goodbye, 2016

Between the deaths of beloved heroes and heroines, the civil unrest and political upheavals, and a slew of natural disasters, 2016 wasn't the greatest year. If you made a movie about it, this is what...


Revolution Analytics

Power BI custom visuals, based on R

You've been able to include user-defined charts using R in Power BI dashboards for a while now, but a recent update to Power BI includes seven new custom charts based on R in the customs visuals...


Simplified Analytics

Do you know what is powerful real-time analytics?

In the Digital age today, world has become smaller and faster.  Global audio & video calls which were available only in corporate offices, are now available to common man on the...

InData Labs

AI is changing the face and voice of customer service as we know it.

Deep learning as a game changer in modern customer service. Learn what is behind the DeepMind neural network that generates most natural speech signals that can be used to make communication with a customer care representative even more pleasant.

Запись AI is changing the face and voice of customer service as we know it. впервые появилась InData Labs.


December 29, 2016

Revolution Analytics

Using R to prevent food poisoning in Chicago

There are more than 15,000 restaurants in Chicago, but fewer than 40 inspectors tasked with making sure they comply with food-safety standards. To help prioritize the facilities targeted for...


Revolution Analytics

Combine choropleth data with raster maps using R

Switzerland is a country with lots of mountains, and several large lakes. While the political subdivisions (called municipalities) cover the high mountains and lakes, nothing much of economic...


December 27, 2016

Big Data University

This Week in Data Science (December 27, 2016)

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

The post This Week in Data Science (December 27, 2016) appeared first on Big Data University.


December 26, 2016

InData Labs

Data Science Competition 2017

InData Labs welcomes all the participants of the Data Science Competition! Take the challenge, show you have the fire and join our R&D Data Science Lab!

Запись Data Science Competition 2017 впервые появилась InData Labs.


December 25, 2016

Revolution Analytics

Parallelizing Data Analytics on Azure with the R Interface Tool

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft) In data science, to develop a model with optimal performance, exploratory experiments on different...


Revolution Analytics

The Basics of Bayesian Statistics

Bayesian Inference is a way of combining information from data with things we think we already know. For example, if we wanted to get an estimate of the mean height of people, we could use our prior...


December 24, 2016

Simplified Analytics

Fail fast approach to Digital Transformation

Digital Transformation is changing the way customers think & demand new products or services. Today Bank accounts are opened online, Insurance claims are filed online, patient’s health is...


December 23, 2016

Revolution Analytics

Because it's Friday: A Christmas Destiny

This video isn't CGI. It's more like a Machinima Christmas pantomime: carefully selected costumes, choreographed performances, and sharp editing of some long takes in the world of Destiny, one of my...


Revolution Analytics

Merry ChRistmas!

Christmas day is soon upon us, so here's a greeting made with R: Each frame is a Voronoi Tesselation: about 1,000 points are chosen across the plane, which each generate a polygon comprising the...

VLDB Solutions

Xmas Wish List

Teradata MPP Setup on AWSAll We Want For Christmas

It’s that time of year…yes Christmas (Xmas for short), most definitely not <yuck>’holidays'</yuck>.

There’s far too much chocolate in the VLDB office as you might expect. Geeks & chocolate are a winning combo, so it won’t last long.

Moving on…office Xmas decorations – check. Office Xmas party – check. Silly jumpers – check. Secret Santa – check. Loud Xmas music – check.

As we draw ever nearer to *finally* going home for Xmas, our thoughts turn to what VLDB’s Xmas list to Santa might look like…so, here goes…

Dear Santa, can you please make sure clients understand at least some of the following:

  1. Data warehouse systems aren’t a side-project you can pay for with left over funding from another project. Real funding, sponsorship, requirements & commitment are required. Subject Matter Experts (SMEs) or Business Analysts (BAs) will need to provide guidance to get an analytic solution delivered. Technology & designers/developers on their own won’t get very far.
  2. High praise from the likes of Gartner doesn’t mean a particular technology is a good fit for your organisation. Figure out your needs/wants/desires afore ye go looking to buy shiny new tech. Thou shalt not believe all thou hears at conferences. It’s the job of tech companies, VCs, analysts & conference organisers to whip up excitement (see kool-aid). They’re not on the hook for delivery.
  3. Accurate estimates for design/build/test are only possible if analysis is carried out. Either you do it, pay us to do it, or accept estimates with wide tolerances.
  4. Quality Assurance (QA) is not the same as unit testing. It’s a real thing that folks do. Lots of them. No really!
  5. CSV files with headers and trailers are a perfectly acceptable way to build data interfaces. Lots of very large organisations are guilty of clinging on to this ‘unsexy’ approach. It ‘just works’.
  6. You’ll probably need a scheduler to run stuff. cron is not a scheduler. Nor is the DBA.
  7. If you could have ‘built that ourselves in SQL Server in 5 days’ you would have already done so.
  8. Don’t focus on our rate card. Focus on the project ROI. Oh, wait, you haven’t even thought about ROI. Doh!
  9. Yes, we can get deltas out of your upstream applications without crashing the system. It’s what we do. We’ll even prove it.
  10. If you want us to work on site we’ll need desks to sit at, preferably next to each other. We’re picky like that 😉

Have a great Xmas & New Year Santa,

Love from all at VLDB

Have a great Xmas & New Year, and here’s to 2017 .


December 22, 2016

Revolution Analytics

Take a Test Drive of the Linux Data Science Virtual Machine

If you've been thinking about trying out the Data Science Virtual Machine on Linux, but don't yet have an Azure account, you can now take a free test drive -- no credit card required! Just visit the...

Silicon Valley Data Science

Techniques and Technologies: Topology and TensorFlow

On December 7, 2016, we hosted a meetup featuring Dr. Alli Gilmore (Senior Healthcare Data Scientist at One Medical), and Dr. Andrew Zaldivar (Senior Strategist in Trust & Safety at Google). Despite the drizzle and gloom outside, the atmosphere of the room was bright and buzzing. The lively audience engaged with both speakers throughout their talks, lending the event the feeling of an intimate small group discussion among peers.

Dr. Gilmore, spoke about user experiences that come with the application of machine learning algorithms. Carefully considering the experience that comes with using a particular machine learning algorithm is what will make artificial intelligence more productive and useful to people. She walked through using the unsupervised Mapper topological data analysis algorithm to group similar types of medical claims, discussed the varied reactions of subject matter experts to its outputs, and envisioned a more interactive and satisfying version of this.

Dr. Zaldivar illuminated the path to harnessing TensorFlow‘s powerful capabilities without the complex configuration by using a set of high-level APIs called TFLearn. He showed us how to quickly prototype and experiment with various classification and regression models in TensorFlow with only a few lines of code, as well as how to access other useful functionality in the TF package.


Unsupervised Topological Data Analysis

Gilmore presentation

After an introduction to topological data analysis, Dr. Gilmore summarized the reactions of domain experts to unsupervised clustering algorithms as finding the results difficult to interpret and being underwhelmed by the amount that they can contribute to the grouping process. It may be unsatisfying if interpreting the clusters feels like a guessing game, if there are seemingly duplicate groups, or even if the groups are really obvious. Similarly, it’s frustrating when people want to but can’t contribute their expertise. They may also want to reinforce the model’s results when it does something well, but it’s not necessarily easy to tell the system to do more of a particular thing.

How can we overcome the drawbacks that accompany unsupervised methods? Put a human in the loop! Make using the algorithm a positive and fruitful experience by leveraging what people can do confidently while avoiding things that are hard. For example, users can likely explain what features are relevant (this is what they know and care about), but they may have a difficult time describing how many groups should exist in the data. Let them influence the algorithm on these kinds of terms, perhaps by providing labels for the grouping process via exemplars selection as well as propagating labels through a question–answer feedback loop from machine to human and back. I’m sure every data scientist has imagined the day when they can more colloquially interact with an algorithm to get better results, even if the majority of today’s feedback only involves cursing that falls on deaf ears.

Practical TensorFlow

Dr Zaldivar presenting

Dr. Zaldivar took the audience through the steps required to build a relatively simple convolutional neural network (CNN) using the low-level TensorFlow Python API. It took four slides of code to cover all of the setup, which involved a lot of expertise to implement but demonstrated how specific one can be if needed. He contrasted this with implementing a deep neural network in just four lines of code using functions from the TFLearn module. He recommended running models at the highest level of abstraction first and only dig down into the details if performance is suboptimal. After all, more lines of code means more to debug if something is going wrong.

Peeking under the hood at the underlying architecture, we got a brief overview of the graphical nature of TF networks. At the lowest level, functional operations like multiply and add are nodes in a graph, and tensors (the data) flow through the graph. Operations become larger as TF is abstracted up to TFLearn, which has a similar level of abstraction to Keras. In this high-level API, many TFLearn models should already be familiar to anyone who has used scikit-learn-style fit/predict methods.

Falling somewhere between the core TF API and TFLearn is another module called TF-Slim, and its API can implement a CNN in far fewer lines of code than the initial approach. Slim focuses on larger operations but can intertwine with low-level API to give greater control that TFLearn. With the extensible capabilities of this module, you can also fine-tune a pre-trained model for operating on your own dataset, thereby providing yet another way to get up and running quickly with state of the art networks like Inception-ResNet-v2.

Next steps

You can find Dr. Gilmore’s slides here, and Dr. Zaldivar’s slides here. The decks contain a number of links to resources related to their talks—the interested reader is encouraged to peruse the slides to find gems related to the interactive machine learning field, topological data analysis, logging and monitoring capabilities in TensorFlow, additional built-in neural networks, Jupyter notebook examples, and tutorials. We’ve also put recordings of Dr. Gilmore’s and Dr. Zaldivar’s presentations on YouTube.

SVDS offers services in data science, data engineering, and data strategy. Check out our newsletter to learn more about the company and current projects, and to hear about future meetups hosted at our offices.

The post Techniques and Technologies: Topology and TensorFlow appeared first on Silicon Valley Data Science.


December 21, 2016

Revolution Analytics

Introducing the AzureSMR package: Manage Azure services from your R session

by Alan Weaver, Advanced Analytics Specialist at Microsoft Very often data scientists and analysts require access to back-end resources on Azure. For example, they may need to start a virtual machine...

Big Data University

This Week in Data Science (December 20, 2016)

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

The post This Week in Data Science (December 20, 2016) appeared first on Big Data University.

Teradata ANZ

Is Collaboration killing Creativity?

If collaboration is good, is more collaboration better?

Project management methodologies that have been successful in production-centric environments, e.g., agile, dev-ops, lean are increasingly being deployed in big data projects. However, big data projects are a combination of production and creative work.

Software engineering and development is arguably production-centric and well-suited to optimisation workflows. On the other hand, Science –and research in general- explores how to reach a long-term goal. Outcomes of the scientific process are highly non-linear; significant results are obtained in a similar fashion to an artist’s creative breakthrough.

Data science is no exception to Science; it is a creation endeavour, not a production one.

To maximise the potential of data science teams, one should provide an environment that is suitable for creativity. Fortunately, that’s a well-researched area with evidence-based answers; unfortunately, these findings are often ignored.

Adequate collaboration is the most critical enabler of creativity, but not all collaboration principles are equal: how many works of art, such as paintings or books, are the products of teamwork? In short, not that many [1].

Creative “outbursts”, in research or artistic pursuits follow the common pattern of mentally reaching out to novel – and apparently unrelated – ideas to solve a problem or fulfill a vision, until they coalesce into what, to an outsider, appears as an epiphany.

Artists’ collaborative life is well documented: from circles of philosophers to Andy Warhol’s Factory and the close connections painters and writers in European capitals a few centuries ago. The most creative people experience a mixture of solitary work and external influences: the collaborative aspect having less to do with creating the work, but more with the inspiration it provides.

A number of studies on aspects as diverse as, e.g., on the ideation process [2], the quality and success Broadway shows [3] or communication vs. productivity in the workplace [4] all demonstrate the same two aspects: too much close collaboration is harmful – as it naturally leads to cliques, groupthink, and echo chambers – , while too little contact with “the outside world” also hampers creativity.

In the realm of research – academic or otherwise – that form of collaboration had been ongoing for a long time: a personal space to create (the fast disappearing office), and a collective space to make face-to-face contact [5] and exchange ideas (the fast disappearing workplace cafeteria, external seminars or conferences).

Instead of following these findings, which have been long-held best practices, recent trends have almost obliterated them: open offices, small kitchens replicated around various floors, restriction of travel budgets and constant collaborative meetings with a core team are stifling innovation. Indeed, pretty much every department or company I have visited over the past few years has showcased environments with similar answers to similar problems. This is not about skills shortage; this is about buzz-word driven project methodologies without understanding their context [7] or looking at the evidence.

On the optimist side, the recent emergence of “people analytics” as an area of focus may offer solutions to re-ignite the true innovation that leads to significant competitive advantage in data science. Indeed, the research is already there and most promising answers involve collaborative network graphs.

Among the key features of interest: successful creative projects and companies are composed of people that have a low local clustering coefficient [8] and short average path length [9], i.e., people whose collaborative and conversational networks are compact, but not inter-related.

Left: A highly connected graph of short paths, forming almost a clique (LCC = 0.66, APL = 1.1). This type of collaboration, occurring when everyone works tightly with everyone else and no one outside, leads to unproductive “groupthink”

Middle: a graph of long paths and limited inter-connectivity (LCC= 0, APL = 2.05). This type of collaboration, occurring when people only work with closely related trusted parties, can lead to “echo chambers”. Note that these types of paths are often in fact disconnected.

Right: a graph containing short path lengths and limited inter-connectivity (LCC = 0.05, APL = 1.5). The combination of short paths (easy access to diverse people) and close collaboration on a small scale is beneficial to the creative process

You can’t manage what you can’t measure. Fortunately, you can measure and quantify the structure of internal collaboration within an organisation. With that information, you can manage teams, projects or departments to maximizse the inventiveness and creativity of knowledge workers, which results in more significant findings and outcomes.

It’s not that: more collaboration => better outcomes
But: better collaboration => more outcomes



[1] Music is an exception here as there are at least two distinct areas: songwriting and composing

[2] Andrew T. Stephen, Peter Pal Zubcsek, and Jacob Goldenberg (2016) Lower Connectivity Is Better: The Effects of Network Structure on Redundancy of Ideas and Customer Innovativeness in Interdependent Ideation Tasks. Journal of Marketing Research: April 2016, Vol. 53, No. 2, pp. 263-279.

[3] Brian Uzzi and Jarrett Spiro (2005) Collaboration and Creativity: The Small World Problem. American Journal of Sociology: September 2005

[4] Alex Pentland (2013) Beyond the Echo Chamber. Harvard Business Review: November 2013

[5] Unscripted face-to-face communication is overwhelmingly more conductive to engagement and idea sharing [6]

[6] Alex Pentland (2012) The new science of building great teams. Harvard Business Review: April 2012

[7] Consider the difference between the original agile manifesto and its current incarnation.

[8] A number between 0 and 1 that measures the proportion of a person’s contacts who also know each other. If everyone knows everyone else, the network is called a clique.

[9] The average number of people a person has to “go through” to contact everyone in the network

The post Is Collaboration killing Creativity? appeared first on International Blog.


December 20, 2016


Guest Post: Was Santa involved in WikiLeaks too?

We partnered with Basis Technology to show how their technology Rosette Text Analytics could be utilized with ours in a fun, Christmas-themed example that they published on their blog. We harvested data from WikiLeaks and curated the data to find Christmas-related mentions. Find out what we were able to uncover. You can read it here. The post Guest Post: Was Santa involved in WikiLeaks too? appeared first on BrightPlanet.

Read more »

Revolution Analytics

Mixed Integer Programming in R with the ompr package

Numerical optimization is an important tool in the data scientist's toolbox. Many classical statistical problems boil down to finding the highest (or lowest) point on a multi-dimensional surface: the...