Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


February 27, 2017

Ronald van Loon

How B2B Ecosystems & (Big) Data Can Transform Sales and Marketing Practices

Managing your relationships with customers, suppliers, and partners and constantly improving their experience is a proven way to build a sustainable and profitable business, and contrary to popular assumption, this doesn’t apply to B2C businesses only. With 89% of B2B research studies using the internet during the internet research process, improved collaboration to deepen existing customer relationships and build new ones is vital to running a successful B2B business as well.

This increasing need for collaboration has led to the development of digital ecosystems. Players like Apple and Google present an interesting case for how B2C ecosystems work. Consider Apple, which is primarily a B2C tech vendor, but it has built a new smart business model that pulls together technologies from multiple domains and combines them to form a solution that wins buyers acceptance. Amazon, Facebook, and Google are working on a similar kind of business model as well.

So, considering the examples of these tech giants, one can suggest that in this era of personalized customer experiences, B2B ecosystems are no longer an add-on; instead, they have become a necessity for progressive B2B businesses.

Offering valuable insights into customer journey, B2B ecosystems work by segmenting and targeting your audience, allowing you to deliver personalized content to clients across all channels. What’s more, with B2B ecosystems, you can improve customer engagement and develop more profitable relationships by optimizing content on various touch points that combine to form your customer journey roadmap. So, the secret to success for B2B businesses lies in collaboration and sharing of (big) data to improve customer experience and engagement.

What is a B2B Ecosystem?

The core concept of B2B ecosystems comes from the natural ecosystem, a biological community comprising of living and non-living things. The essential feature of any ecosystem is interdependence. For example, herbivorous animals like sheep and goats feed on plants, which in turn need water, sunlight, and minerals from soil to grow.

The same concept applies to a B2B ecosystem as well, which is a community of systems working together to serve the needs of customers. Just like a natural ecosystem, a B2B ecosystem has several different components, such as:

  • Enterprise Resource Planning System
  • Customer Relationship Management System
  • Product Information Management System
  • Order Management System
  • Marketing Automation System

The list is non-exhaustive and may contain various other types of systems based on the precise needs and scope of a B2B business.

The Role of B2B Ecosystems in Sales and Marketing

While B2B marketing practices have changed considerably over the past few years, marketing goals have remained the same — more leads, more sales, and more revenue. However, the explosion of new marketing channels and the changing demographics of B2B customers have posed certain challenges to B2B marketers which can only be overcome through the use of unified marketing framework — a framework that connects the marketing and sales department of a business with other ‘systems’ in the ecosystem for improved collaboration and sharing of (big) data.

The customer-related data of a B2B business reside in multiple systems, such as ERP, CMS, POS, PIM, Order Management, Sales Enablement, etc. To best serve the customer with a cohesive experience across all marketing channels and touch points of the customer journey, these systems must be interconnected to form a B2B ecosystem.

Leveraging on a well-connected and well-equipped B2B ecosystem, marketers can:

  • Use Customer Insights to Cross-Sell — Using the previously gathered information, B2B businesses can improvise their campaigns, as well as use the customer insights to pitch products/services that customers are most likely to invest in.
  • Offer a Personalized Experience — For any B2B business, an ecommerce website is their most important marketing tool. Utilizing data made available from collaboration, businesses can deliver a more relevant and engaging experience to the customer from their very first visit.
  • Optimize the Order and Reorder Processes — Using the information available about the customer and their previous purchases, businesses can optimize the order value and order frequency.
  • Better Manage Content — Access to data residing in the content management system, markets can reuse it for different devices and across multiple channels, which in turn can help optimize the content marketing strategy to generate more leads.
  • Facilitate Lead Nurturing — Using a marketing automation tool, businesses can track and analyze the behavioral data of customers to identify and work on the leads that are likely to nurture into a customer more quickly.

Role of Leadership in the Development of a B2B Ecosystem

The planning and development of a B2B ecosystem is a process that involves virtually all systems and departments of a B2B system. Therefore, in order to facilitate its implementation, commitment of senior-level management is of utmost important, particularly the Chief Information Officers have an important role to play in the process.

Since a true collaborative ecosystem goes beyond the organizational boundaries, it requires an enterprise to invest in multiple technology solutions and collaborate with multiple partners, suppliers, and customers. As a result, the decision to plan and implement a B2B ecosystem requires long-term commitment of organizational leaders.

B2B Ecosystems as a Source of Innovations

Just like people networks, B2B ecosystems can also serve as a source of innovation. Consider Proctor & Gamble. The company leverages on its external networks to crowdsource new ideas, and as a result, possesses this unique ability to solve problems in collaboration with the members of its business network than it could do on its own.

The same applies to B2B ecosystems as well. Using B2B ecosystems, sales and marketing can collaborate with other functions of the business to have a real-time access to the latest customer information. Furthermore, collaboration with extended workforce, customers, and suppliers can offer B2B businesses access to information in areas where they cannot be physically present.

Internet of Things (IoT) by SAP and Cloud28+ are the two primary examples of B2B ecosystems being used by businesses today. In addition to this, collaboration solutions by Lithium and Jive also provided B2B businesses with a way to improve their collaboration with suppliers, business partners, and customers to develop a B2B ecosystem that offers them greater access to data and improve visibility and control over their customer journey and experiences.

To conclude, collaboration is the lifeblood of businesses and it can be achieved best with the use of technological solutions like Cloud28+ that are designed to facilitate and accelerate enterprise cloud adoption and organization-wide collaboration.

What is your opinion about Eco Systems and use of (big) data? Let us know what you think.


Andrea Monaci is Marketing Director Cloud at HPE and Cloud28+ the B2B Eco System for Digital Transformation. Connect with Andrea on Linkedin and Twitter to learn more about B2B Eco systems. Follow Cloud28+ on Twitter and Linkedin.


If you would like to read Ronald van Loon future posts then please click ‘Follow‘ and feel free to also connect on LinkedIn and Twitter to learn more about the possibilities of Big Data.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How B2B Ecosystems & (Big) Data Can Transform Sales and Marketing Practices appeared first on Ronald van Loons.

InData Labs

The Best Ways of Applying AI in Mobile

While mobile apps continue to be a prime focus for the enterprise, there is an increasing interest in artificial intelligence technologies. Gartner predicts that intelligent apps will be one of the top ten strategic trends for 2017. When an app claims to be powered by “artificial intelligence” it feels like you’re in the future. What...

Запись The Best Ways of Applying AI in Mobile впервые появилась InData Labs.

VLDB Solutions

MPP & Redshift Musings

Teradata AWS AccessWhat On Earth is MPP?

In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously). Source: Wikipedia.

Teradata & MPP Beginnings

Once upon a time, in a land far, far away…well, OK, California in the late 1970’s/early 1980’s to be precise…the MPP database world started to stir in earnest.

Following on from research at Caltech and discussions with Citibank’s technology group, Teradata was incorporated in a garage in Brentwood, CA in 1979. Teradata’s eponymous flagship product was, and still is, a massively parallel processing (MPP) database system.

Teradata had to write their own parallel DBMS & operating system. They also used weedy 32bit x86 chips to compete with IBM’s ‘Big Iron’ mainframes to perform database processing. Quite an achievement, to say the least.

The first Teradata beta system was shipped to Wells Fargo in 1983, with an IPO following in 1987.

It is interesting to note, for those of us in the UK, that there were Teradata systems up and running over here as early as 1986/87, partly down to the efforts of folks such as our very own ‘Jim the phone’.

Early Teradata adopters in the UK included BT (who later moved off Teradata when NCR/Teradata was acquired by AT&T), Royal Insurance, Great Universal Stores (GUS) and Littlewoods. GUS and Littlewoods combined to become ShopDirect, who still run Teradata 30 years later.

Yours truly ran his first Teradata query at Royal Insurance in Liverpool from ITEQ on an IBM 3090 back in ’89. It killed the split screen in MVS, but ho-hum, better than writing Cobol against IMS DB/DC to answer basic questions. We were taught SQL by none other than Brian Marshall who went on to write the first Teradata SQL & performance books. I still have the original dog-eared Teradata reference cards, or ‘cheat sheets’ as we called them:

Teradata ITEQ/BTEQ & Utilities Reference Cards.

Teradata ITEQ/BTEQ & Utilities Reference Cards, 1988 vintage.

From the 1980’s through to the early 2000’s Teradata had a pretty clear run at the high-end analytic DBMS market. There was no serious competition, no matter how hard the others tried. All those big name banks, telecoms companies and retailers couldn’t be wrong, surely?

MPP Upstarts – Netezza, Datallegro & Greenplum

Teradata’s first real competition in the commercial MPP database space came in the form of Netezza in the early 2000’s.

Like Teradata, Netezza consisted of dedicated hardware & software all designed to work in harmony as an engineered MPP database ‘appliance’. Unlike Teradata, Netezza was able to take advantage of open source DBMS software in the form of PostgreSQL, and open source OS software in the form of Linux.

We discovered Netezza by accident in 2002/03 after landing on a PDF on their web site following a Google search. “Netezza is Teradata V1” was our initial response. Apart from the FPGAs, we were pretty close.

A few phone calls, a trip to Boston and a training session later, and we’re up and running as Netezza partners.

Following a successful Netezza project at the fixed line telecoms division of John Caudwell’s Phones4U empire, yours truly was a guest speaker at the inaugural Netezza user conference in 2005.

Following an IPO in 2007, Netezza was bought by IBM in 2010 where it remains to this day, somewhere in IBM’s kitbag. Poor old Netezza.

Also in the early 2000’s, a bunch of mainly Brits at Datallegro were trying to build an MPP appliance out of Ingres. They were bought by Microsoft in 2008. Why Microsoft needed to buy a startup running Ingres on Linux is anyone’s guess.

The last of the new MPP players we had dealings with all those years ago are none other than Greenplum.

Through the Netezza partner ecosystem we actually knew the Greenplum team when they were still called ‘Metapa’, probably around 2002/03. Yours truly was at a dinner with Luke & Scott in Seattle as they wooed none other than Teradata luminary Charles ‘Chuck’ McDevitt (RIP) to head up the Greenplum architecture team. With Chuck on board they were virtually guaranteed to succeed.

Greenplum was always of particular interest to us because, unlike Teradata & Netezza, it was and still is a software only offering. You get to choose your own platform, which can be your favourite servers/storage or even that new fangled ‘cloud’ thing.

Not only have we got Greenplum to work happily on every single platform we ever tried, Greenplum is also the only open source MPP database.

Other Parallel PostgreSQL Players

In addition to early parallel PostgreSQL players such as Netezza and Greenplum (Datallegro used Ingres, not PostgreSQL), a whole raft of ‘me too’ players cropped up in 2005:

  • Vertica – acquired by HP in 2011
  • Aster – acquired by Teradata in 2011
  • Paraccel – acquired by Actian in 2013

It is interesting to note that, unlike Netezza, not one of the new parallel PostgreSQL players put even the smallest dent in Teradata’s core MPP data warehouse market. At least Netezza shook Teradata from their slumber which, in turn, gave the Teradata appliance to the world.

That said, perhaps Paraccel will have the biggest impact on the MPP database market which, for so long, has been dominated by Teradata, the main player in the space for 30 years. How so? Read on…


Amazon launched the Redshift ‘Data Warehouse Solution’ on AWS in early 2013.

Yours truly even wrote some initial thoughts at the time, as you do.

As has been well documented, Redshift is the AWS implementation of Paraccel, which was one of the crop of ‘me too’ parallel PostgreSQL players that appeared in 2005.

Redshift adoption on AWS has been rapid, and reported to be the fastest growing service in AWS.

So, despite Teradata having been around for 30 years, the awareness of MPP and adoption of MPP as a scalable database architecture has been ignited by Redshift, which started out as Paraccel, which is built out of PostgreSQL.

Redshift Observations

Our previous post on Redshift in 2013 made mention of the single column only distribution key, and the dependence on the leader node for final aggregation processing.

There is a workaround for the single column only distribution key restriction, for sure. MPP folks will no doubt be mystified as to why such a restriction exists. However, it neatly removes the possibility of a 14 column distribution key (primary index) that we encountered at a Teradata gig a few years ago (no names). Cloud, meet silver lining.

The dependence on a leader node for final aggregation processing is much more of an issue. Aggregation queries that return a high number of groups will simply choke on the final aggregation step. Just ask Exadata users. There’s also not much you can do to remediate this issue if you can’t over-provision CPU/RAM at the leader node, which is the case with Redshift.

More recent Redshift observations from our own experiences, and from trusted contacts (you know who you are) include the following:

  • lack of node OS access – security auditing, hardening & OS customisation not possible.
  • non-persistent storage – if you stop or lose the EC2 cluster you lose the database data & will need to re-load from S3 or a database snapshot.
  • poor concurrency – default is 5 concurrent operations. Although this can be increased, performance drops off quickly as concurrency increases.
  • poor workload management – no short query ‘fast path’. Redshift is essentially a ‘free for all’.
  • 1MB block size & columnar only storage – very large table space overhead of 1 MB per column for every table x number of segments.
  • limited tuning options – no partitions or indexes. If the sort keys don’t help it’s a full table scan every time.
  • automatic database software updates – this might appeal to the ‘zero admin is good crowd’, but enterprise customers will baulk at the notion of a zero-choice upgrade that could break existing applications.

Lack of OS access and non-persistent storage will no doubt be show-stoppers for some enterprise folks.

The SQL Server crowd will probably not care and just marvel at the performance offered by SQL queries that run against all CPU cores in the cluster. Unless, of course, they didn’t read the data distribution notes.

Meet The New Boss, Same as The Old Boss

No matter that Redshift could be improved, what we can applaud is that Amazon has opened many folks eyes to the benefits of a scalable ‘old-skool’ relational database management system (RDBMS), and one with an MPP architecture to boot. For this we can only be thankful.

The rate of Redshift uptake speaks to the usefulness of a scalable RDBMS. The architecture is being socialised by Amazon (high volume/low margin) in a way that never was by Teradata (low volume/high margin).

All of the MPP database vendors, Amazon included, owe a debt of gratitude to Teradata who proved the architecture 30 years ago. That Teradata got so much so right so long ago never ceases to amaze yours truly.

To Redshift the old adage remains true, there really is ‘nowt new under the sun’.

The post MPP & Redshift Musings appeared first on VLDB Blog.


February 25, 2017

Simplified Analytics

Beginner's guide to Chatbots - a driver for Digital Transformation

We are living in a century where technology dominates lifestyle; Digital Transformation with Big Data, IoT, Artificial Intelligence (AI) are such examples. Over the past six months, Chatbots have...


February 24, 2017

Revolution Analytics

Because it's Friday: Looking down the Glory Hole

California, which has been in drought for for the past five years, has found some respite in heavy rains this winter. But while those same rains have caused flooding, mudslides, and a...

Ronald van Loon

How IoT is Changing the World: Cases from Visa, Airbus, Bosch & SNCF

The Internet of Things (IoT) is changing our world. This may seem like a bold statement, but consider the impact this revolutionary technology has already had on communications, education, manufacturing, science, business, and many other fields of life. Clearly, the IoT is moving really fast from concept to reality and transforming how industries operate and create value.

As the IoT creeps towards mass adoption, IT giants experiment and innovate with the technology to explore new opportunities and create new revenue streams. I was invited to Genius of Things Summit as a Futurist by Watson IoT and WIRED Insider and attended the long-awaited grand opening of IBM’s headquarters for Watson Internet of Things in Munich. The two-day event provided me an insight into what IBM’s doing to constantly push the boundaries of what’s possible with the IoT.

In this article, I have discussed the major developments that caught my interest and that, in my opinion, will impact and improve customer experience substantially.

IoT capabilities become an integral part of our lifestyle

According to IBM the number of connected devices is expecting to rise as high as 30 billion in the next three years. This increasingly connected culture presents businesses with an opportunity to harness digital connections to improve their products and services and ultimately, foster deeper human connections in order to improve customer experiences and relationships.

IBM, being one of the world’s top innovators in IoT, announced an exciting series of new offerings at The Genius of Things Summit alongside 22 clients and partners. These new IoT capabilities are likely to be the future of the IoT and become an integral part of our lifestyle in the near future.

Digital Twin 

Traditionally, industrial assets are designed, built, and operated using numerous data sources with engineers working in specialized teams that conduct analysis for their specific tasks separately. As a result, the most current information may not be available readily for critical decisions. These silos, in turn, lead to increased costs and inefficiencies, create uncertainties, and require vast amount of time and resources. Digital Twin is a more efficient of working. It is a cloud-based virtual image of an asset maintained throughout the lifecycle and easily accessible at any time. One platform brings all experts together, allowing them to work cost-effectively using a collaboration platform, which helps reduce errors and improve efficiency. Consequently, this enables more profitable, safe, and sustainable operation.

Case — Airbus Makes Digital Twin Come to Life

Airbus and Schaeffler are using digital twin engines and digital twin bearings, respectively, to transform their production process, increasing operation productivity and improving design elements. IBM Watson is the IoT platform through which these two companies are reshaping their corresponding industries. Cognitive cloud based insights augments predictive systems to enable improved safety and efficiency for these two manufacturing organizations. Watch the Digital Twin replay from Genius of Things.

Cognitive Commerce 

Cognitive commerce is a revolutionary phenomenon that involves the use of a spectrum of technologies, ranging from speech recognition to a recommendations system based on machine learning. A cognitive commerce journey is based on an in-depth understanding of customers’ behaviors and preferences, both at aggregate and individual level. The knowledge is then applied in a real-time manner to offer a truly personalized experience to the customers in order to improve their satisfaction and drive more revenue to the business.

Case: Visa Embraces the IoT

Visa partnered with IBM to leverage on the cognitive capabilities of IBM’s Watson IoT platform. The collaboration allowed Visa to launch a technology that will allow customers to make payments from any IoT connected device, from an application to a car or a watch. The new technology will not only eliminate the need to use sensitive financial information present on payment cards, but will also introduce a new level of simplicity and convenience to customer journey. See more about Visa and IBM.

Predictive Maintenance 

Predictive maintenance is a valuable application of the Internet of Things that helps reduce maintenance costs, increase asset availability, and improve customer satisfaction by issuing an alert before a machine or equipment breaks down. The technology involves analysis of large volumes of sensor data, such as temperature, oil levels, vibration, and voltage to predict maintenance needs before equipment failures happen.

Case: Watson IoT to Help SNCF Railway Run Smoothly

SNCF is a leader in passenger and freight transport services that has a network of over 15,000 trainers covering more than 30,000 kilometers of track. The company recently announced its collaboration with IBM. The collaboration will help SNCF connect its entire rail system, including trains, train stations, and railroad tracks to Watson IoT. Using real-time data collected from sensors, the company will be able to anticipate repair needs and improve the security and availability of its assets. Watch the CTO of SNCF explain more about their approach to better client outcomes with IoT.

Connected Devices

This involves the use of sensors to merge the real world and the digital world. These sensors are used in automobiles, smartphones, and other devices to make the devices web-compatible. These sensors measure humidity, light, temperature, magnetic fields, pressure, and sound. The information collected is programmed, processed, and transmitted using a radio network to the user, allowing them to control their smart devices from a remote location.

Case: Bosch Makes Industrial IoT a Reality

Bosch recently introduced its new and revolutionary ‘Bosch IoT Rollouts’ service for advance device management and cloud-based software updates. Bosch will leverage on its development and manufacturing expertise as well as the IBM’s Watson IoT platform to update connected devices in a seamless manner and deliver personalized services and experience to customers with connected devices. Watch how Bosch and IBM are working together on the glue between IoT and connected products and devices.

The impact of how digitizing the physical infrastructure around us affects customer experiences is an ongoing source of inspiration for me. I will appreciate your comments, insights, and feedback on this article, as well as invite you to follow me on Twitter and LinkedIn to learn more about Big Data and IoT.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How IoT is Changing the World: Cases from Visa, Airbus, Bosch & SNCF appeared first on Ronald van Loons.

Jean Francois Puget

Feature Engineering For Deep Learning

Feature engineering and feature extraction are key, and time consuming, parts of the machine learning workflow.  They are about transforming training data, augmenting it with additional features, in order to make machine learning algorithms more effective.  Deep learning is changing that according to its promoters.  With deep learning, one can start with raw data as features will be automatically created by the neural network when it learns.  For instance, see this excerpt from Deep Learning and Feature Engineering:

The feature engineering approach was the dominant approach till recently when deep learning techniques started demonstrating recognition performance better than the carefully crafted feature detectors. Deep learning shifts the burden of feature design also to the underlying learning system along with classification learning typical of earlier multiple layer neural network learning. From this perspective, a deep learning system is a fully trainable system beginning from raw input, for example image pixels, to the final output of recognized objects.

As usual with bold statements, this is both true and false.  In the case of image recognition, it is true that lots of feature extraction became obsolete with deep learning.  Same for natural language processing where the use of recurrent neural networks made a lot of feature engineering obsolete too.  No one can challenge that.

But this does not mean that data preprocessing, feature extraction, and feature engineering are totally irrelevant when one uses deep learning.

Let me take an example for the sake of clarity, taken from Recommending music on Soptify with deep learning.  I recommend reading this article as it introduces deep learning and how it is used for a particular case pretty well. 

I will not explain what deep learning is here in general.  Suffice to say that deep learning is most often implemented via a multi layer neural network.  An example of neural network is given in the above article:


Data flows from left to right.  The input layer, the leftmost layer, receives an encoding of songs.  Then the next 3 layers are max pooling layers.  The next layer computes mean, max, and L2 norm of its input data.  The next 3 layers are convolutional layer.  Last layer is temporal pooling layer. 

Don't worry if you don't fully understand what all this means.  The key point is that learning only happens between the 3 convolutional layers.  All the other layers are hard coded feature extraction and hard coded feature engineering.  Let me explain why:

  • The input data is not raw sound data.  The input data is a spectrogram representation of the sound obtained via a Fourier transform.  That data transformation happens outside the neural network.  This is a drastic departure from the above quote that claims deep learning starts with raw data.
  • The next 3 levels are max pooling.  They rescale their input data to smaller dimension data.  This is hard coded, and it is not modified by learning.
  • The next level computes the mean, the max, and the L2 norm of time series.  This is a typical feature engineering step.   Again, this is hard coded and not modified by learning.
  • The next 3 levels are convolutional levels.  Learning happens on the connections between the first two convolutional levels and on the connections between the last two convolutional levels.
  • The last level computes statistics on the data output by the last convolutional level.  This is also a typical feature engineering.  It is also hard coded and not modified by learning.

This example is a neural network where most of the network is some hard coded feature engineering or some hard coded feature extraction.  I write hard coded as these are not learned by the system, they are predefined by the network designer, the author of the article.  When that network learns, it adjusts weights between its convolutional layers, but it does not modify the other arcs in the network.  Learning only happens for 2 pairs of layers, while our neural network has 7 pairs of consecutive layers.

This example is not an unusual exception.  The need for data preprocessing and feature engineering to improve performance of deep learning is not uncommon.  Even for image recognition, where the first deep learning success happened, data preprocessing can be useful.  Finding the right color space to use can be very important for instance.  Max pooling is also used a lot in image recognition networks.

The conclusion is simple: many deep learning neural networks contain hard coded data processing, feature extraction, and feature engineering. They may require less of these than other machine learning algorithms, but they require some still. 

I am not the only one stating the above, see for instance In deep learning, architecture engineering is the new feature engineering.


Update on Feb 24, 2017.  The need for data pre processing for deep learning is getting more attention.  Google just announced tf.Transform to address that need for TensorFlow users.

Jean Francois Puget

IBM Machine Learning

We announced IBM Machine Learning last week, see here and here  for event replays.  I was interviewed as part of the launch.  A good write up of what I said can be found on Silicon Angle. You can find the video of the interview at the end of this post. 

This video has been shared on social media under various titles, but he one that got most impact is : The evolution of : fusing human thought with algorithmic insights.  It is probably because the interview contains a discussion of AI potential danger.  Our take at IBM, and my take, is that we do not care much about artificial intelligence if the goal is to reproduce human cognition in order to replace it.  Our take is to work on tools that help humans perform their tasks better.  We speak of augmented intelligence, or assisted intelligence.  Machine Learning, as one of the prominent artificial intelligence capability, is no exception.

I also discuss other, more near term, topics around machine learning and the forthcoming IBM offering for it. Here it is if you want to watch the full interview (about 18 minutes ):



Update on Feb 24, 2017. Seems my interview is quite popular, I just got named Guest of the week on Silicon Angle, with a second interesting write up again.


February 23, 2017

Revolution Analytics

Preview: R Tools for Visual Studio 1.0

After more than a year in preview R Tools for Visual Studio, the open-source extension to the Visual Studio IDE for R programming, is nearing its official release. RTVS Release Candidate 1 is now...

Silicon Valley Data Science

Open Source Toolkits for Speech Recognition

As members of the deep learning R&D team at SVDS, we are interested in comparing Recurrent Neural Network (RNN) and other approaches to speech recognition. Until a few years ago, the state-of-the-art for speech recognition was a phonetic-based approach including separate components for pronunciation, acoustic, and language models. Typically, this consists of n-gram language models combined with Hidden Markov models (HMM). We wanted to start with this as a baseline model, and then explore ways to combine it with newer approaches such as Baidu’s Deep Speech. While summaries exist explaining these baseline phonetic models, there do not appear to be any easily-digestible blog posts or papers that compare the tradeoffs of the different freely available tools.

This article reviews the main options for free speech recognition toolkits that use traditional HMM and n-gram language models. For operational, general, and customer-facing speech recognition it may be preferable to purchase a product such as Dragon or Cortana. But in an R&D context, a more flexible and focused solution is often required, and that is why we decided to develop our own speech recognition pipeline. Below we list the top contenders in the free or open source world, and rate them on several characteristics.

Comparison of open source and free speech recognition toolkits

This analysis is based on our subjective experience and the information available from the repositories and toolkit websites. This is also not an exhaustive list of speech recognition software, most of which are listed here (which goes beyond open source). A 2014 paper by Gaida evaluates the performance of CMU Sphinx, Kaldi, and HTK. Note that HTK is not strictly open source in its usual interpretation, as the code cannot be redistributed or re-purposed for commercial use.

Programming Languages: Depending on your familiarity with different languages, you may naturally prefer one toolkit over another. All of the listed options except for ISIP have Python wrappers available either on the main site or found quickly with a web search. Of course, the Python wrappers may not expose the full functionality of the core code available in the toolkit. CMU Sphinx also has wrappers in several other programming languages.

Development activity: All of the projects listed have their origins in academic research. CMU Sphinx, as may be obvious from its name, is a product of Carnegie Mellon University. It’s existed in some form for about 20 years, and is now available on both GitHub (with C and Java versions there) and SourceForge, with recent activity on both. Both the Java and C versions appear to have only one contributor on GitHub, but this doesn’t reflect the historical reality of the project (there are 9 administrator and more than a dozen developers on the SourceForge repo). Kaldi has its academic roots from a 2009 workshop, with its code now hosted on GitHub with 121 contributors. HTK started its life at Cambridge University in 1989, was commercial for some time, but is now licenced back to Cambridge and is not available as open source software. While its latest version was updated in December of 2015, the prior release was in 2009. Julius has been in development since 1997 and had its last major release in September of 2016 with a somewhat active GitHub repo consisting of three contributors, which again is unlikely to reflect reality. ISIP was the first state-of-the-art open source speech recognition system, and originated from Mississippi State. It was developed mostly from 1996 to 1999, with its last release in 2011, but the project was mostly defunct before the emergence of GitHub.1

Community: Here we looked at both mailing and discussion lists and the community of developers involved. CMU Sphinx has online discussion forums and active interest in its repos. However, we wonder if the duplication of repos in both SourceForge and GitHub are blocking more widespread contribution. In comparison, Kaldi has both forums and mailing lists as well as an active GitHub repo. HTK has mailing lists but no open repository. The user forum link on the Julius web site is broken but there may be more information on the Japanese site. ISIP is primarily targeted for educational purposes and the mailing list archives are no longer functional.

Tutorials and Examples: CMU Sphinx has very readable, thorough, and easy to follow documentation; Kaldi’s documentation is also comprehensive but a bit harder to follow in my opinion. However, Kaldi does cover both the phonetic and deep learning approaches to speech recognition. If you are not familiar with speech recognition, HTK’s tutorial documentation (available to registered users) gives a good overview to the field, in addition to documentation on actual design and use of the system. The Julius project is focused on Japanese, and the most current documentation is in Japanese2, but they also are actively translating to English and provide that documentation as well; there are some examples of running speech recognition here. Finally, the ISIP project has some documentation, but is a little more difficult to navigate.

Trained models: Even though a main reason to use these open or free tools is because you want to train specialized recognition models, it is a big advantage when you can speak to the system out of the box. CMU Sphinx includes English and many other models ready to use, with the documentation for connecting to them with Python included right in the GitHub readme. Kaldi’s instructions for decoding with existing models is hidden deep in the documentation, but we eventually discovered a model trained on some part of an English VoxForge dataset in the egs/voxforge subdirectory of the repo, and recognition can be done by running the script in the online-data subdirectory. We didn’t dig as deeply into the other three packages, but they all come with at least simple models or appear to be compatible with the format provided on the VoxForge site, a fairly active crowdsourced repository of speech recognition data and trained models.

In the future, we will discuss how to get started using CMU Sphinx. We also plan to follow up on our earlier deep learning post with one that applies neural networks to speech, and will compare the neural net’s recognition performance to that of CMU Sphinx. In the meantime, we always love feedback and questions on our blog posts, so let us know if you have additional perspective on these toolkits or others.


  • Gaida, Christian, et al. “Comparing open-source speech recognition toolkits.” Tech. Rep., DHBW Stuttgart (2014).

sign up for our newsletter to stay in touch

1. After noticing many of the web site links are broken, we sent email to the mailing list to let them know about the broken links, and got a reply letting us know that the site mostly serves historical purposes now.
2. Thus, our rating of “++” applies to English only, as we do not read Japanese.

The post Open Source Toolkits for Speech Recognition appeared first on Silicon Valley Data Science.


How to avoid the Texas Sharpshooter Fallacy in data analysis

The rise of Big Data, data science and predictive analytics to help solve real world problems is just an extension of science marching on. Science is humanity’s tool for better understanding the world. The tools that we use to build models, test hypotheses, look for trends to build value with our brand all derive directly from scientific principles.

With these principles comes a myriad of obstacles. The obstacles are known to philosophers as “logical fallacies”, which I outlined in my previous post "The 7 Logical Fallacies to avoid in Data Analysis."  In this blog post, we focus on the Texas Sharpshooter Fallacy and how to avoid it in your data analysis.

Teradata ANZ

Turbo Charge Enterprise Analytics with Big Data

renato-manongdo_enterprise-analytics-1We have been showing off the amazing art works drawn from numerous big data insight engagements we’ve had with Teradata, Aster and Hadoop clients. Most of these were new insights to answer business questions never before considered.

While these engagements have demonstrated the power of insights from the new analytics enabled by big data, it continues to have limited penetration to the wider enterprise analytics community. I have observed significant investment in big data training and hiring of fresh data science talents but the value of the new analytics remain a boutique capability and not yet leveraged across the enterprise.

Perhaps, we need to take a different tact. Instead of changing the analytical culture to embrace big data, why not embed big data into existing analytical processes? Change the culture from within.

How exactly do you do that? Focus big data analytics to adding new data points for analytics then make these data points available using the current enterprises data deployment and access processes. Feed the existing machinery with the new data points to turbo-charge the existing analytical insight and execution processes.

A good starting area is the organisation’s customer events library. This library is a database that contains customer behavior change indicators that provide a trigger for action and provide context for a marketing interventions. For banks, this would be significant deposit events (e.g. three standard deviations from the last 5 months average deposit) and for Telco’s significant dropped calls. Most organisations would have a version of this in place and would have dozens of these pre-defined data intervention points together with customer demographics. These data points support over 80% of the actionable analytics currently performed to drive product development and customer marketing interventions.

What new data points can be added? For example, life events that can provide context to the customer’s product behavior remains a significant blind spot for most organisation e.g. closing a home loan due to divorce, or refinance to a bigger house because of a new baby, etc. The Australian Institute of Family Studies have identified a number of these life events.

Big data analytics applied to combined traditional, digital and social data sources can produce customer profiles scores that become data points for the analytical community to consume. The score can be recalculated periodically and the changes become events themselves. With these initiatives, you have embedded the big data to your existing enterprise analytical processes and moved closer to a deeper understanding to enable pro-active customer experience management.

We have had success with our clients in building some of these data points. Are you interested?

The post Turbo Charge Enterprise Analytics with Big Data appeared first on International Blog.


February 22, 2017

Revolution Analytics

The difference between R and Excel

If you're an Excel user (or any other spreadsheet, really), adapting to learn R can be hard. As this blog post by Gordon Shotwell explains, one of the reasons is that simple things can be harder to...



Corey Hermanson and Perfecting Deep Web Technologies

How does someone go from being an aspiring sports writer to having the job title “Data Acquisition Engineer”? Although it seems unconventional, this path has paid dividends for one of our experts on Deep Web technologies, Corey Hermanson. Born and raised in the Minnesota, Corey has always called the upper midwest his home. After graduating high school, Corey […] The post Corey Hermanson and Perfecting Deep Web Technologies appeared first on BrightPlanet.

Read more »

Revolution Analytics

Finding Radiohead's most depressing song, with R

Radiohead is known for having some fairly maudlin songs, but of all of their tracks, which is the most depressing? Data scientist and R enthusiast Charlie Thompson ranked all of their tracks...

InData Labs

Reproducibility and Automation of Machine Learning Process

We’re very happy to keep engaging with professional communities on topics we’re passionate about. This time, our Data Science expert Denis Dus spoke at PyCon Belarus’17. At the event he covered the topic of Reproducibility and Automation of Machine Learning Process.

Запись Reproducibility and Automation of Machine Learning Process впервые появилась InData Labs.


February 21, 2017

Silicon Valley Data Science

Breaking Down Communication Barriers in Tech

In late 2016 I spoke with with Travis Oliphant, co-founder of Continuum Analytics. We covered many topics, including building a community and balancing enterprise with open source. I’ve broken our conversation up into a series of posts, which will be published over the next several weeks. In this first part of our interview, we discuss breaking down silos, the importance of effectively communicating about cutting-edge technology, and where Anaconda is going next.

What are you most excited about right now? I’m looking for a gut reaction.

A gut reaction. I think that the future is going to be very bright because of the innovation engines that are exploding around open source. The opportunities for machine learning get me excited. I am an old-school, statistical signal processing guy who is also an applied mathematician. When I see machine learning, I see a specialized application of techniques that have been used for decades. Machine learning is re-energizing applied math in the world in a way that you can see. Business intelligence, which used to be dumbed-down math and statistics, can now be driving the education of the world into better solutions.

I’m excited about that trend and I am excited about the integrations that are occurring. Right now, currently, the way that the world is today people end up re-implementing everything in silos, and I see how that can be broken down. I see how we can actually reuse each other’s code and algorithms in a way that was never possible before. This could happen as early as next year. I mean, it is starting to happen a little bit now using the filesystem as an intermediate store—so we aren’t just competing in silos between the Scala, Python, and R worlds. Those worlds can actually start cooperating.

So when you say “we,” do you mean people in tech in general? Or are you thinking about specific initiatives that Continuum is taking part in?

I’m thinking about tech in general, like the world. Certainly Continuum is playing its role in all of that. Towards the end of my previous statement I was specific—there are some things that we are doing around breaking down silos that I’m very excited about. What I’m excited about with Continuum is just how we have a bunch of disparate products in the marketplace and those are becoming unified behind a single data science platform.

I think you are right. We are talking a lot on our side about moving away from silos and democratizing data. As you look at all these changes that are happening, what do you see as the biggest challenge to actually making it real as soon as possible?

Our biggest challenge is actually because of the previous success. Now we need to break down some of the communication barriers that have emerged. We also need to unlearn incorrect assumptions that stand in the way. People have made these assumptions about things that aren’t correct or don’t have to be true but will be made true by the assumptions being believed. Does that make sense?

It does. So it is a people issue?

It is a people issue. It is helping break down the mental models—the world views that don’t have to be the way they are, but are because of a lack of prior art. People have to see something, it is really hard to communicate abstractly. You have to kind of see a thing and then you can abstract around it. So the biggest challenge is that we see how the world could work, and have to communicate that to the world in a way that is consistent, or connected, or somehow brings them from where they are currently thinking into how the world could be.

For example, I was talking with some people from a large company yesterday and realized they are two innovation cycles behind. It was a little bit frustrating I have to admit, because I hadn’t been around somebody who felt like they were from the ’90s for so long. And I was just like, whoa, I don’t even know how to talk. I don’t know how to communicate because your world view and perspective is so foreign to me now. Now some of that—it is a bit arrogant for me to assume it is all that person. I recognize things take a while to change. So even though the ability to do it is there, you have got all kinds of work to do to connect it to the day-to-day of somebody, and the day-to-day to somebody is working in right now.

I was just in an auto parts store, and there was an old dot matrix line printer from the early to late ’70s. Still working and still connected to their point-of-sale device. This is life, you know? People don’t just immediately swap out everything they were doing that was working.

The IRS is still on an old mainframe for managing tax returns. They are currently in the middle of a 10-year project to modernize it, but the challenge of course is they are modernizing to yesterday’s technology. But that is always a challenge. I think education will always remain a big part of the challenge we face. There is a lot of stuff that could be if we can communicate about it effectively. So effective communication and getting the word out is key. We need to help break down barriers caused by other— not incompatible—but under-informed perspectives.

If anyone, what group do you think owns the responsibility to start that conversation in an organization? The business people, the tech people, or anyone who happens to get it?

First, anybody who happens to get it owns the responsibility to tell the story and recruit the people to help them do it better. There are documentation people who are really good at helping, and design and graphic people that can put pictures and videos together. There are obviously public relations and marketing efforts, and any organization can be a part of that.

Here at Continuum, a bunch of different people end up helping that happen over time. It has to be pursued from multiple angles, because you are talking about breaking down barriers, and you are talking about communicating and people are different. Some ways of communicating will resonate with certain people, and other ways will resonate with other groups.

Do you feel that the rate of change in tech right now is on par with the past several years, or do you think things are really starting to speed up?

I think it is accelerating. Certainly the demand for new and improved capability—there is lots of data interconnecting. There is demand to have that data be useful. So it is driving innovation, and at the same time you have got this open source ecosystem that is lit up—kind of an undercurrent of innovation that was always there, but was underutilized. It was sort of hidden behind the corporate structures of how work got done, and old academic structures.

Now there is enough momentum where corporations are funding it, and academics are funding it—of course, now the problem is integration. And the other problem is that there is this really powerful innovation across the board, but it is all scattered. Bringing it together so it can be applied usefully is still a significant effort.

That is what Anaconda is. That’s its whole purpose. The way it’s designed, it is all about recognizing a world full of disparate packages and projects that can be brought together to do amazing things.

I’ve been excited about bringing technology together since I was a grad student. I want to connect the ideas in libraries—these optimization libraries, integration libraries, visualization libraries, and analytics libraries— and pull them together in a way that can be accessible and used. My first big project, the SciPy project, was really just a large distribution of software. You look at what SciPy was—it brought together whole bunch of disparate ideas into a single library. In fact, it should have brought them together into a distribution of separate libraries.

Humans struggle to interact meaningfully on teams that are bigger than 7 to 11 maybe (and really I think the number is 5). The cognitive load of understanding the team members enough sufficiently to make intimate progress—to find out all the different perspectives and concerns and really build those bonds of trust that produce viable results—those teams can’t be very big and have our brains keep up. So innovation is modular.

And SciPy—as the community grew, it needed to be around the modules and those needed to support thousands of people developing, but they couldn’t do it in one place, so really the distribution was the problem. That really was the impetus for Anaconda—recognizing that the problems of SciPy (some of which were carried into NumPy) were really problems of packaging distribution. Anaconda grew up out of that recognition and a desire to try to make things better. And in the process realizing that solves a whole slew of other problems between the parts of any significant software project. Software builds on other software, and these can be brought together by other people to create new artifacts and solutions. This creates a dependency tree—a tree of interconnections—that has to be managed so it can be updated, deployed, and all these things can be reproduced and governed. That is Anaconda; that is what it’s about.

So it starts with just getting all the stuff. The reason I created SciPy in the first place was to help people get the stuff. And a lot of work was spent doing that, and then Anaconda just helps people get more stuff more quickly and in a more repeatable way. But underneath that is the architecture that solves the fundamental problem of developer interconnection.

You talked a bit about the genesis of Anaconda—what do you see it evolving into in the future?

Yeah. It has become the foundation of our platform, and it started with Python but it has evolved to include R, Scala, Julia, Node any package from any language. We call it an open data science platform, instead of just a distribution of Python. So it is evolving into an ecosystem which brings data science together.

What I see as its future is there is a place where free stuff is shared freely and then a place where people can sell to each other as well. You can sell modules because they are easily plugged into this system, and then of course we are trying to make it easy for people to sell—that is a separate problem. Now you are looking at, how do I make it easy for people to buy each other’s work on top of the free stuff? How do you help people interact with each other?

It all comes down to the people, doesn’t it?

It does. That is why markets are so hard to understand and predict, because ultimately markets evolve with groups of people. And you try to understand—you fundamentally solve somebody’s problem and help some people. But then how that interacts with other problems that are being solved at the same time by other people can be hard to predict.

For us, Anaconda solves the heterogeneity problem in an exploding world of innovation. I have written a blog post about this notion that Anaconda helps to normalize enterprise deployment. Or, how does enterprise consume open source? How do they do that? If you don’t use Anaconda, you have to create something like Anaconda. So Anaconda is a place where it can all come together. And for us, the free is free—it will always be free—and then on top of that we believe there is another enterprise layer that is necessary, that maybe open source people won’t create quickly, but enterprises will need to pay for and will pay for. Because that’s a decision a lot of enterprises incorrectly make—rather than amortize the cost of that shared layer across multiple customers, they each independently build it and then pay the cost to maintain it themselves, because they are not going to find an open source community to maintain it. They are going to have to do it, and they just do it rather than have a common layer.

Editor’s note: The above has been edited for length and clarity. In the next installment of this interview, we’ll cover more on how enterprise and open source goals can work together, and advice for building a community.

sign up for our newsletter to stay in touch

The post Breaking Down Communication Barriers in Tech appeared first on Silicon Valley Data Science.

Big Data University

This Week in Data Science (February 21, 2017)

Here’s this week’s news in Data Science and Big Data. mathmachinelearning

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

Featured Courses From BDU

  • Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
  • Predictive Modeling Fundamentals I
    – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data.
  • Using R with Databases
    – Learn how to unleash the power of R when working with relational databases in our newest free course.

Cool Data Science Videos

The post This Week in Data Science (February 21, 2017) appeared first on BDU.


February 19, 2017

Simplified Analytics

What is the difference between Consumer IoT and Industrial IoT (IIoT)?

Internet of Things (IoT) began as an emerging trend and has now become one of the key element of Digital Transformation that is driving the world in many respects. If your thermostat or refrigerator...


February 17, 2017

Revolution Analytics

Because it's Friday: Et tu?

I spent 6 years learning to speak French as a student in Australia, so naturally I was excited to try out my language skills on my first visit to France. Inevitably, I could understand no-one, and...


Revolution Analytics

Catterplots: Plots with cats

As a devotee of Tufte, I'm generally against chartjunk. Graphical elements that obscure interpretation of the data occasionally have a useful role to play, but more often than not that role is to...


February 16, 2017

Silicon Valley Data Science

Analyzing Caltrain Delays

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! In this case, Caltrain has updated some of their technology, but hope you’ll still enjoy our analysis. The original version of this post can be found here, and to see our newer Caltrain work, check out Introduction to Trainspotting

SVDS OfficeMany people who live and work in Silicon Valley depend on Caltrain for transportation. Here at SVDS we’ve long had an interest in Caltrain because our headquarters have always been close to its tracks, first in Sunnyvale and now in Mountain View. Caltrain is literally in our backyard. About half of our employees regularly commute on Caltrain.

Also, we just like trains.

But along with our fascination with Caltrain is some frustration. Caltrain provides a “real-time” API of expected train departure times by station. As with many systems we interact with in our daily life, when they work, they are almost invisible and we take them for granted. But when their performance fails or degrades, they suddenly seem very important. When Caltrain delivers you on time to your destination you may not give it a second thought, but if you’re ten minutes late to a critical meeting — or you miss a flight — you may develop very “strong” feelings about it.

Yet, uncertainty can be more aggravating than failure. You may not mind that a train is 10 minutes late as long as you can rely on that fact. For this reason, we’re ultimately interested in modeling Caltrain delays so we can make more reliable predictions.1

In Figure 1 below, we see the performance of three forecasts up to 20 minutes before train 138 is scheduled to depart Palo Alto. A perfect predictor (green) would have told us the train was running six minutes late over that entire time. That’s enough advance notice for me to know I can finish my coffee and Atlantic article before I have to head out the door. On the other hand, a poor “predictor” (red), which is not really a predictor at all, says the train is on time until it’s not, and then simply tells you how late the train currently is. The closer a predictor is to the horizontal green line, the better it is. The currently available Caltrain API (blue) is somewhere in between. At 10 minutes out, it still says the train is only a minute late, and then just increases from there.

Caltrain predicted delay

Figure 1: Predicted lateness vs scheduled arrival
Source: Train 138 from 9/25/2015.

Is that the best we can do? We think not. In this case, the train was already four minutes late at Hillsdale, 20 minutes before it arrived at Palo Alto. We can tell that the Caltrain API does not include this prior knowledge to create a better prediction because its “prediction” starts at 0 delay.

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model. Be prepared — we ask more questions than we answer! This preliminary exploratory analysis is often important for doing effective feature engineering and model evaluation (to be discussed in later posts). This understanding-first approach stands in contrast to some prediction competitions and machine learning tasks where participants may care little about the meaning of features they are given — if they are told at all! But as our name says, we do data science for business clients that want actionable insights more than one-off models.

So, as an R&D project, we have been playing with data science techniques to better understand and/or predict movement of trains within the Caltrain system. We want to apply data science to a variety of local, distributed, redundant, and possibly unreliable information sources in order to understand the state of the system. Data sources are never completely reliable and consistent across a large system, so this gives us experience producing predictions from messy and possibly erroneous data. For this reason in the past we’ve analyzed audio and visual signals from the trains.

Show me the data!

Our dataset was scraped from Caltrain’s API and includes 83 days from Sep 25, 2015 to Feb 16, 2016. The API attempts indicate how much longer until a train will depart at each station (Table 1). As we’ve seen, Caltrain’s real-time prediction is not a reliable predictor of when a train will depart, but we believe that it’s reliable for detecting when a train has departed (most of the time…). This information, along with the schedule, allows us to see how Caltrain performs: its on-time performance, when and where it accrues delay, and so on. Here is a small sample of our data:

Caltrain arrival data

Table 1: Random sample of 5 rows from scraped API data. Keep in mind when interpreting figures and analysis that “Minutes until arrival” is only reported with minute resolution.

The Station column is a short code used by Caltrain: ctmi is the Millbrae station, ct22 is the 22nd Street station, ctsb is San Bruno, etc.

In general, we take a station-centric view of the system; we look at trains as they leave stations (we don’t have data of when they arrive.) From this perspective, delays are accrued at stations.

There are three classes of train — local, bullet, and limited — which differ mostly in which stops they make. We will see that they have different delay characteristics, and so we will often examine them separately. For example, we can already see surprising features in the delays simply by plotting the delays at each station for a particular train.

Figure 2 below shows the cumulative delays of local train 138 as it goes southbound from San Francisco. Notice that there’s a prominent recurring upward spike in delay at the Burlingame station. But why? This occurs for limited trains as well, but Figure 3 shows that the effect is less pronounced for northbound trains and in the opposite direction (i.e. a downward bump).

Local train 138 delays

Figure 2: Local train 138 delays by station, from north to south.


Delay for weekday northbound and southbound trains

Figure 3: Delay for weekday northbound and southbound trains by station, scaled by geographical distance. The shaded area indicates the standard deviation of recorded delays. Credit: Daniel Margala

In both Figure 2 and Figure 3 above, we see that as trains move down the line, they accrue delays in accordance with our expectations. This also quickly sets our understanding of the magnitude of the problem. On average, accrued delays are on the order of 5 minutes, but as we see from the standard deviations, there is a huge variation for any particular train. The question becomes: can a model of train delays describe that variation sufficiently well to be useful for individual predictions? How much better would it do than a simple model which always predicts the average delay for a given train?

What’s up with Burlingame?

At first glance of the delays for a single train (Figure 2), we saw hints that there may be systematic station effects. Colloquially, we’d say there’s a huge upward spike at Burlingame. An effective way to see this is to compare the delay added at stations. To illustrate this, we switch to a slightly different type of plot, in which the incoming (previous) delay is plotted on the x axis and the outgoing (departure) delay is plotted on the y axis. Figure 4 shows such a plot for Burlingame.

Departure delays vs incoming delay

Figure 4: Departure delays vs incoming (previous) delay. Color indicates number of trains with that departure and previous delay (data is reported only to the minute). The dashed red line is the line y=x, or perfect correlation between incoming and outgoing delay.

As you might expect, delay into a station is roughly equal to delay leaving. For a train running perfectly on time, they would be exactly equal, and delays would lie exactly on the red dashed line below. In reality, trains sometimes make up a bit of time (points below the line), but more often acquire additional delay (points above the line).

These quantities are all related quite simply:

departure_delay = incoming_delay + added_delay

To probe for station-level effects, we study the distribution of these added delays at each station. Note that we have removed outliers here by dropping the most extreme 1% of added delays. We then produce “box and whiskers” plots of the delays (you may recall that a box and whiskers plot shows the mean and the variability of a set of values). Immediately we notice Burlingame (‘ctbu’) and San Mateo (‘ctsmat’) have striking differences between north and south for local and limited trains.

Added delay of local trains

Figure 5: Added delay of local trains by direction for selected stations


Added delay of limited trains

Figure 6: Added delay of limited trains by direction for selected stations

The Burlingame effect is even more striking when comparing the departure (outgoing) delay vs the incoming delay, as in the following bivariate kernel density estimate (Figure 7; left). For all northbound trains (red), no matter the delay at the previous station, the train makes up time. Going south, the trains almost always get behind and in a nonlinear way.

We currently have a working hypothesis for why this is so. Caltrain reports that the API uses data from track sensors and GPS systems on locomotives, which we don’t have access to. Suppose that the track sensors form the bulk of the API data (which seems plausible given the API’s performance). If the track sensors that determine departures from Burlingame station are too far to the south of the station, then every train going north will look a little early, and every train going south will look a little later. In general, we think that there’s a systematic error in the measurement of train departures at that station. One way we can validate this hypothesis is riding the train ourselves with GPS tracking, and check for discrepancies. Watch for a follow-up that explains this mystery.

In San Mateo, the effect is flipped (Figure 7; right). The southbound trains tend to get slower, hence the blue (southerly) distribution lying below the ‘on time’ line that runs along the diagonal.

Outgoing vs ingoing delays for Burlingame and San Mateo

Figure 7

The bullet trains also had a north/south difference at San Mateo (data not shown), while the weekend bullet has its largest difference at Millbrae. It’s clear that any predictive model will need to account for these large systematic effects, or in a sense, model the distributions in the bivariate KDEs above.

Future work

There is clearly much more that can be explored, like correlations with day of week, time of year or holidays, special events, and ridership data. Stay tuned for more results, and very soon, a better train delay prediction!
Caltrain logo

Thanks to Daniel Margala and Jonathan Whitmore for lots of consultation on figures, analysis, and wording. Eric White also did some of the original processing, analysis, and visualization of the delay data.

1. Note that there are two notions of reliability here: reliability of the train and the reliability of predictions about train arrivals. We obviously can’t make Caltrain itself more reliable but we’re hoping we can make better predictions about when a train will depart.

The post Analyzing Caltrain Delays appeared first on Silicon Valley Data Science.

Revolution Analytics

Six Articles on using R with SQL Server

Tomaž Kaštrun is developer and data analyst working for the IT group at SPAR (the ubiquitous European chain of convenience stores) in Austria. He blogs regularly about using Microsoft R and SQL...


Revolution Analytics

Performance improvements coming to R 3.4.0

R 3.3.3 (codename: "Another Canoe") is scheduled for release on March 6. This is the "wrap-up" release of the R 3.3 series, which means it will include minor bug fixes and improvements, but eschew...

Teradata ANZ

Spotting the pretenders in Data Science

The term “Data Scientist” is often over-used or even abused in our industry. Just the other morning I was watching TV and a news piece came on talking about the hottest careers in 2017 and data science was top of the list. Of course this is good for those who have been dealing with data for many years in one shape or another because your skills will be in demand. However the bad news is that the industry gets flooded with fakes all looking to get in on the action. It really is the wild west.

The problem with the industry is that there is not an official certification program like say Microsoft or Cisco certification programs. Therefore it is often difficult for an employer to identify how good they say they really are. Some might have a background in data and may be able to punch out some lines of SQL, but that doesn’t make a data scientist.

You can rely on the old method of making contact with references but we all know that can be fraught with danger as you’ll often reach the prospective employee’s best friend or someone who has been coached on what to say when they are called.
And most of the time, the prospective employee will be unable to show you the types of projects they have previously worked on because it may be commercially sensitive or just plain difficult to demonstrate in an interview.

Ben Davis_Data ScienceWhat makes the hiring process so much more difficult is that you are often under pressure to hire because data based projects are considered a priority within your organisation and are being carefully watched by management, therefore you must hire quickly and hire quality to deliver. The pressure is on you to get it right from the start.

So what more can one do to weed out the fake data scientists?

I’ve listed some interview questions below that will reveal how good of a data scientist they really say they are:

Q: If you had a choice of a Machine Learning algorithm, which one would you choose and why?
This is a trick question. Everyone should have a “go-to” algorithm that’s the easy part of the question, the devil lies in the 2nd part of the question the “why”. A good Data Scientist should be able to explain why they prefer the algorithm they mentioned and give an explicit answer as to it’s applicability or flexibility. If they went a step further and compared and contrasted their favourite algorithm with an alternate approach it would demonstrate an intricate knowledge of the algorithm.

Q: You’ve just made changes to an algorithm. How can you prove those changes make an improvement?
Once again you’re not seeking the obvious answer, rather testing the data scientists ability to demonstrate reason. In a research degree you have to demonstrate components of your research such as:
• The results are repeatable
• The demonstration of the before and after test are performed within a controlled environment using the same data and same hardware on both occasions.
• Ensuring that the test data is of sufficient quantity and quality to test your algorithm accurately. For example don’t test it on a small dataset and then roll it into production against a huge dataset with a lot more variables.

The key with his answer is that you are seeking to see how scientific the applicant is. Such a question would potentially give you an insight into their background, do they come from an academic background?

Q: Give an example approach for root cause analysis.
Wikipedia states that root cause analysis is “a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event’s outcome, but is not a root cause”
This question seeks to understand if the applicant has ever performed these types of investigations in the past to troubleshoot an issue in their code. Once again we’re not looking for an explanation of what is root cause analysis, more so how root cause analysis may have been used in the past to solve something they were working on.

Each member of your team should have specific sets of skills that they bring to the table that compliments the team.

Q: Give examples of when you would use Spark versus MapReduce.
There are many answers to this question for example in-memory processing using RDD’s on small datasets is faster than MapReduce which has a higher IO overhead. You’re also looking for flexibility in a data scientist. There are many approaches a Data Scientist can take that lead to the same outcome. For example MapReduce may get to the same answer as Spark, albeit just a bit slower. But knowing when to use which approach and why is a valuable skill for a data scientist to have.

Q: Explain the central limit theorem.
Many data scientists come from a background of statistics. This question is testing a basic knowledge of statistics that any statistician should know if they are applying for a role as a Data Scientist. There’s a whole blog that compares and contrasts the role of a Data Scientist and a Statistician, however you may be seeking to build a data science team with a wide range of skills including statistics.

By the way CLT is a fundamental theorem of probabilities in that across a large distribution of data the mean of the variances will be approximately equal to the mean of the data itself. There’s many other explanations of CLT available online.

Q: What are your favourite Data Science websites?
This is attempting to find out how passionate they are about data science. A good Data Scientist would obviously bring up the usuals such as kdknuggets or Data Science Central You want to hire Data Scientists that not only use these sites, but keep their pulse on what’s happening, engage online with other like-minded individuals and you never know your next hire may come from one of these sites.

At the end of the day, you are not only assessing their knowledge but what skills and knowledge they would bring to your data science team.

In a previous blog on ‘Seven traits on successful Data Science teams‘ I discussed forming a team with varied skills. You don’t want a Data Science team of clones. Each member of your team should have specific sets of skills that they bring to the table that compliments the team. Get your interview questions formed well before the interview and you’re well on the way to building that special team.

The post Spotting the pretenders in Data Science appeared first on International Blog.


February 15, 2017

Silicon Valley Data Science

Getting Started with Deep Learning

At SVDS, our R&D team has been investigating different deep learning technologies, from recognizing images of trains to speech recognition. We needed to build a pipeline for ingesting data, creating a model, and evaluating the model performance. However, when we researched what technologies were available, we could not find a concise summary document to reference for starting a new deep learning project.

One way to give back to the open source community that provides us with tools is to help others evaluate and choose those tools in a way that takes advantage of our experience. We offer the chart below, along with explanations of the various criteria upon which we based our decisions.

These rankings are a combination of our subjective experiences with image and speech recognition applications for these technologies, as well as publicly available benchmarking studies. Please note that this is not an exhaustive list of available deep learning toolkits, more of which can be found here. In the coming months, our team is excited to checkout DeepLearning4j, Paddle, Chainer, Apache Signa, and Dynet. We explain our scoring of the reviewed tools below:

Languages: When getting started with deep learning, it is best to use a framework that supports a language you are familiar with. For instance, Caffe (C++) and Torch (Lua) have Python bindings for its codebase, but we would recommend that you are proficient with C++ or Lua respectively if you would like to use those technologies. In comparison, TensorFlow and MXNet have great multi language support that make it possible to utilize the technology even if you are not proficient with C++.

Note: We have not had an opportunity to test out the new Python wrapper for Torch, PyTorch, released by Facebook AI Research (FAIR) in January 2017. This framework was built for Python programmers to leverage Torch’s dynamic construction of neural networks.

Tutorials and Training Materials: Deep learning technologies vary dramatically in the quality and quantity of tutorials and getting started materials. Theano, TensorFlow, Torch, and MXNet have well documented tutorials that are easy to understand and implement. While Microsoft’s CNTK and Intel’s Nervana Neon are powerful tools, we struggled to find beginner-level materials. Additionally, we’ve found that the engagement of the GitHub community is a strong indicator of not only a tool’s future development, but also a measure of how likely/fast an issue or bug can be solved through searching StackOverflow or the repo’s Git Issues. It is important to note that TensorFlow is the 800-pound Gorilla in the room in regards to quantity of tutorials, training materials, and community of developers and users.

CNN Modeling Capability: Convolutional neural networks (CNNs) are used for image recognition, recommendation engines, and natural language processing. A CNN is composed of a set of distinct layers that transform the initial data volume into output scores of predefined class scores (For more information, check out Eugenio Culurciello’s overview of Neural Network architectures). CNN’s can also be used for regression analysis, such as models that output of steering angles in autonomous vehicles. We consider a technology’s CNN modeling capability to include several features. These features include the opportunity space to define models, the availability of prebuilt layers, and the tools and functions available to connect these layers. We’ve seen that Theano, Caffe, and MXNet all have great CNN modeling capabilities. That said, TensorFlow’s easy ability to build upon it’s InceptionV3 model and Torch’s great CNN resources including easy-to-use temporal convolution set these two technologies apart for CNN modeling capability.

RNN Modeling Capability: Recurrent neural networks (RNNs) are used for speech recognition, time series prediction, image captioning, and other tasks that require processing sequential information. As prebuilt RNN models are not as numerous as CNNs, it is therefore important if you have a RNN deep learning project that you consider what RNN models have been previously implemented and open sourced for a specific technology. For instance, Caffe has minimal RNN resources, while Microsoft’s CNTK and Torch have ample RNN tutorials and prebuilt models. While vanilla TensorFlow has some RNN materials, TFLearn and Keras include many more RNN examples that utilize TensorFlow.

Architecture: In order to create and train new models in a particular framework, it is critical to have an easy to use and modular front end. TensorFlow, Torch, and MXNet have a straightforward, modular architecture that makes development straightforward. In comparison, frameworks such as Caffe require significant amount of work to create a new layer. We’ve found that TensorFlow in particular is easy to debug and monitor during and after training, as the TensorBoard web GUI application is included.

Speed: Torch and Nervana have the best documented performance for open source convolutional neural network benchmarking tests. TensorFlow performance was comparable for most tests, while Caffe and Theano lagged behind. Microsoft’s CNTK claims to have some of the fastest RNN training time. Another study comparing Theano, Torch, and TensorFlow directly for RNN showed that Theano performs the best of the three.

Multiple GPU Support: Most deep learning applications require an outstanding number of floating point operations (FLOPs). For example, Baidu’s DeepSpeech recognition models take 10s of ExaFLOPs to train. That is >10e18 calculations! As leading Graphics Processing Units (GPUs) such as NVIDIA’s Pascal TitanX can execute 11e9 FLOPs a second, it would take over a week to train a new model on a sufficiently large dataset. In order to decrease the time it takes to build a model, multiple GPUs over multiple machines are needed. Luckily, most of the technologies outlined above offer this support. In particular, MXNet is reported to have one the most optimized multi-GPU engine.

Keras Compatible: Keras is a high level library for doing fast deep learning prototyping. We’ve found that it is a great tool for getting data scientists comfortable with deep learning. Keras currently supports two back ends, TensorFlow and Theano, and will be gaining official support in TensorFlow in the future. Keras is also a good choice for a high-level library when considering that its author recently expressed that Keras will continue to exist as a front end that can be used with multiple back ends.

If you are interested in getting started with deep learning, I would recommend evaluating your own team’s skills and your project needs first. For instance, for an image recognition application with a Python-centric team we would recommend TensorFlow given its ample documentation, decent performance, and great prototyping tools. For scaling up an RNN to production with a Lua competent client team, we would recommend Torch for its superior speed and RNN modeling capabilities.

In the future we will discuss some of our challenges in scaling up our models. These challenges include optimizing GPU usage over multiple machines and adapting open source libraries like CMU Sphinx and Kaldi for our deep learning pipeline.

sign up for our newsletter to stay in touch

The post Getting Started with Deep Learning appeared first on Silicon Valley Data Science.


The 7 Logical Fallacies to avoid in Data Analysis

“Lies, damned lies and statistics” is the frequently quoted adage attributed to former British Prime Minister Benjamin Disraeli. The manipulation of data to fit a narrative is a very common occurrence from politics, economics to business and beyond. 

In this blog post, we'll touch on the more common logical fallacies that can be encountered and should be avoided in data analysis.

Big Data University

Learn how to use R with Databases

R is a powerful language for data analysis, data visualization, machine learning, statistics. Originally developed for statistical programming, it is now one of the most popular languages in data science.

If you are a database professional (Data Engineer, DBA, Database Developer, etc.) and looking to leverage the power of R to analyze and visualize data in relational databases (RDBMSes), Big Data U (BDU) has a free, self-paced online course to get you going quickly: Using R with Databases.

And if you are not yet familiar with R, there is also a free crash course in R to get you started: R 101.

If you are a Data Scientist or a Data Analyst, chances are that you are already familiar with the richness of the R programming languages and are already leveraging it for modeling, classification and clustering analysis, creating great graphs and visualizations, etc. But you may be hitting the memory limits of R when utilizing it for very large data sets.

Utilizing R with databases and data warehouses that are known for scalability and performance with large amounts of data is one mechanism to overcome the memory contraints of R. And the free course – Using R with Databases – will show you how.

This course starts with a comparison of R and Databases and discusses the benefits of using R with databases. It teaches you how to setup R for accessing databases and demonstrates how to connect to databases from R, specifically using interfaces like RJDBC and RODBC.

The course then goes on to show you how to query data from databases, get the results and visualize the analysis. It also covers some advanced topics like modifying and saving saving data in databases from R, as well as using database stored procedures from R.

Some databases also support in-database analytics with R, so you can benefit from the large amounts of memory and parallel processing features of databases while employing R for analysis. This course also helps you to learn about using in-database analytics with R.

Like other courses in BDU, each module inUsing R with Databases, comes with hands-on labs so you can practice what you learn in the course and try out your own variations.

More over, the hands-on lab environment, called BDU Labs, is free, cloud-based, ready to use, and integrated within BDU so you don’t have to register for a new account or worry about installing software.

The course consists of 5 learning modules and after each module there are review questions. At the end of the course there is final exam.

Successfully passing the course (by proving your proficiency with review questions and final exam), marks your achievement with a a course completion certificate.

This course is part of Data Science with R learning path on BDU, and when you complete all courses in this learning path, you also earn an IBM badge that can be shared on your social profiles.

Enroll now for free and start Using R with Databases!

The post Learn how to use R with Databases appeared first on BDU.

Mario Meir-Huber

Ecommerce Software Solutions For Growth in Business

In the past few years, e-commerce has changed the face of business. Today, the internet has become the buzzword for trade. From buying groceries to looking for designer apparel or even gadgets, e-commerce has opened many doors for success.

In fact, this has been possible due to the presence of e-commerce software solutions in the market. These tools have made the business safe and secure for the users, who completely rely on e-commerce these days.

One of the prominent features of ecommerce software solutions is that you can give a unique identity to your website, depending on what you plan to sell. These days, companies can even get a customized template and that too at an affordable price.

There are options to pick from an already existing template design. Having a unique identity is important to break through the clutter of myriad sites that are present in the market and also to have your own USP.

Another important aspect of e-commerce software solutions is the shopping cart. These shopping carts are available for businesses of all shapes and sizes. So, if you are planning a small web shop or a virtual mall experience, it all can be customized.

Services like a store front, multiple payment options, and full inventory control are readily available. Along with this, tools like shipping options, upload of unlimited products and promotional aspects can also take care of.

Ecommerce software solutions also come equipped with a built-in content management system. This allows companies to create, edit and publish all kinds of content on their websites.

This also aids in creating static pages, surveys, text and graphic banners plus newsletters. Also, the navigation is super simple, which gives a convenient experience to users too.

The most positive aspect about this software is that it aims for a better online promotion of goods and services. Through this software, it is possible to create bespoke shopping experiences that help a business attain great heights in a short span of time.

Moreover, with the use of this software, marketing strategies are also executed in the best possible manner.

Remember, the popularity and the success of your business depends entirely on the how your online platform engages the customers. It would be interesting to add some features related to multimedia application development.

The post Ecommerce Software Solutions For Growth in Business appeared first on Techblogger.


February 14, 2017

Revolution Analytics

Galaxy classification with deep learning and SQL Server R Services

One of the major "wow!" moments in the keynote where SQL Server 2016 was first introduced was a demo that automated the process classifying images of galaxies in a huge database of astronomical...



Open Data vs. Web Content: Why the distinction?

For those who are unfamiliar with our line of work, the difference between open data vs. web content may be confusing. In fact, it’s even a question that doesn’t have a clear answer for those of us who are familiar with Deep Web data extraction. One of the best practices we do as a company is […] The post Open Data vs. Web Content: Why the distinction? appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (February 14, 2017)

Here’s this week’s news in Data Science and Big Data. ibm-watson

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Featured Courses From BDU

  • Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
  • Predictive Modeling Fundamentals I
    – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data.
  • Using R with Databases
    – Learn how to unleash the power of R when working with relational databases in our newest free course.

Upcoming Data Science Events

The post This Week in Data Science (February 14, 2017) appeared first on BDU.


February 13, 2017

Revolution Analytics

A comparison of deep learning packages for R

Oksana Kutina and Stefan Feuerriegel fom University of Freiburg recently published an in-depth comparison of four R packages for deep learning. The packages reviewed were: MXNet: The R interface to...

Ronald van Loon

Designing the Data Management Infrastructure of Tomorrow

Today, more than ever before, organisations realise the strategic importance of data and consider it to be a corporate asset that must be managed and protected just like any other asset. Considering the strategic importance of data, increasing number of farsighted organisations are investing in the tools, skills, and infrastructure required to capture, store, manage, and analyse data.

More organisations are now viewing data management as a holistic activity that requires enterprise-wide collaboration and coordination to share data across the organisation, extract insights, and rapidly convert them into action before opportunities are lost. However, despite the increasing investment in data management infrastructure, there are not many organisations that spend time and effort on anticipating the future events that may impact their data management practices.

From upcoming rules and regulations to the need to create better customer experiences in order to discover hidden value in customer journeys, there are a number of factors that demand a more proactive approach from organisational leaders and decision makers when it comes to the planning and design of an enterprise’s data management infrastructure.

Breaking Down the Data Silos 

When it comes to efficient data management, the biggest challenge that enterprises need to overcome is the elimination of the silos that keep big data from coalescing to its full potential. What’s more, the ever-growing number of channels that retailers and other B2C businesses use to engage with customers is also contributing to the scope of the issue. From brick-and-mortar stores to ecommerce websites, mobile applications, and social media channels, each channel generates huge volumes of data that remains unshared by an organisation’s internal systems and departments. By breaking down these data silos, organisations can optimise productivity and enable a more holistic view of the customers they serve.

The easiest way a CIO or an organisation leader can determine if their data management strategy effectively addresses the problem of data silos is by answering this simple question: Do I have access to all the data I need, or do I need to collaborate with other departments to integrate data?

If the answer is no, you need to start tearing down the data silos in order to combine data from multiple sources and deliver improved experiences to your customers.

 Get access to all the customer journey data you need from all required departments

Creating a Culture of Collaboration

Enterprises struggle to create a data-driven culture in order to realise the true business value of data. While there is no one-size-fits-all approach towards creating a data-driven culture as this depends on the people and the precise work environment of an organisation, there is a unique business model that can be used as an inspiration to create an effective data management strategy and plan data management infrastructure. It is called Agile.

The Agile approach is being used by renowned firms like Google, Spotify, Zappos, and Netflix. The purpose of this approach is to empower people to collaborate in multidisciplinary teams and enable them to make the right decisions quickly and effectively.

Here is an overview of how Agile works at Spotify.

At Spotify, the entire enterprise comprises of four types of units — squads, chapters, tribes, and guilds. Squad, which is the very basic unit of the organisation, is a multidisciplinary team whose members work together to achieve a shared goal. Chapters, on the other hand, are groups of people with similar expertise across various squads. Squads that work on related areas form a tribe, while guilds are loosely formed interest groups that any employee can join.

There are two primary characteristics of the agile approach used by Spotify. First, it creates alignment among all working groups, offering them greater flexibility and autonomy. Second, the agile approach supports a culture of innovation. As a result, this agile business model is best suited for organisations that serve customers with varying needs and must enable sharing of data among different departments to deliver an improved experience to their customers.

 How agile works at Spotify. Source:

Preparing for the EU General Data Protection Regulation 

The EU General Data Protection Regulation (GDPR) came into force in May, 2016 and B2C businesses operating in the EU have until May, 2018 to ensure their compliance to the requirements set out by this new customer data protection regulation. With fines as high as up to four percent of the annual revenues, organisations are left with no option but to rethink their data management strategy.

In its broadest sense, GDPR implementation will offer customers improved control over their personal data. But what does it mean for enterprises, particularly B2C companies? The implementation of GDPR will transform the way businesses manage data and run analytics projects. What’s more, the regulation also gives the power back to the ‘owner of the data’, allowing customers to determine who may store and use their data. This will be a huge headache for companies as they will serve the role of ‘data custodian’ and will be required to ensure that the data is accurate and up to date.

Considering the anticipated outcomes of GDPR implementation, one can easily deduce out that ensuring compliance will require extensive changes in the data management strategies and infrastructure of organisations; however, not many organisations are taking the necessary steps to prepare for the upcoming change.

 Example of compliant data management & analytics setup of ‘tomorrow’

A whitepaper published by the collaboration of AvePoint and Centre for Information Policy Leadership reports that nearly 50 percent of organisations have not taken decisions regarding how to optimise their data management policies to ensure compliance to GDPR and almost 30 percent do not have additional resources available to embrace the change.

What this all boils down is that executives, particularly the Chief Information Officer, need to take responsibility optimise the data management infrastructure and policies of their organisations in order to ensure compliance and avoid hefty compliance. Apart from improved compliance, breaking down silos and optimisation of data management policies and procedures will also lead to improved customer experiences and help businesses extract more value from their customer journeys.

While this change in data management infrastructure requires extensive investment in new tools and technologies, developing the right mind-set is equally important. Businesses also need to create a data-driven culture that supports and facilitates data protection not just because it is a regulatory requirement, but also because it is the right thing to do to acknowledge and appreciate your customers’ trust in your products/services.

What is your opinion about Data Management Infrastructures of the Future? Let us know what you think.

Patrick Aarbodem is Managing Director of Adversitement. Connect with Patrick on Linkedin and Twitter to learn more about Data Management.


If you would like to read Ronald van Loon future posts then please click ‘Follow‘ and feel free to also connect on LinkedIn and Twitter to learn more about the possibilities of Big Data.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Designing the Data Management Infrastructure of Tomorrow appeared first on Ronald van Loons.


February 10, 2017

Revolution Analytics

Because it's Friday: Remembering Hans Rosling

Some sad news to share this week: Hans Rosling, the renowned statistician and pioneer in data visualization best remembered for Gapminder, died on February 7. If you're not familiar with his work, do...


Revolution Analytics

Update on R Consortium Projects

On January 31, the R Consortium presented a webinar with updates on various projects that have been funded (thanks to the R Consortium member dues) and are underway. Each project was presented by the...

Ronald van Loon

How to Boost Your Career in Big Data and Analytics

The world is increasingly digital, and this means big data is here to stay. In fact, the importance of big data and data analytics is only going to continue growing in the coming years. It is a fantastic career move and it could be just the type of career you have been trying to find.

Professionals who are working in this field can expect an impressive salary, with the median salary for data scientists being $116,000. Even those who are at the entry level will find high salaries, with average earnings of $92,000. As more and more companies realize the need for specialists in big data and analytics, the number of these jobs will continue to grow. Close to 80% of the data scientists say there is currently a shortage of professionals working in the field.

What Type of Education Is Needed?

Most data scientists – 92% – have an advanced degree. Only eight percent have a bachelor’s degree; 44% have a master’s degree and 48% have a Ph.D. Therefore, it stands to reason that those who want to boost their career and have the best chance for a long and fruitful career with great compensation will work toward getting higher education.

Some of the most common certifications for those in the field include Certified Analytics Professional (CAP), EMC: Data Science Associate (EMCDSA), SAS Certified Predictive Modeler and Cloudera Certified Professional: Data Scientist (CCP-DS). The various certifications are for specific competencies in the field.

Now is a good time to enter the field, as many of the scientists working have only been doing so for less than four years. This is simply because the field is so new. Getting into big data and analytics now is getting in on the ground floor of a vibrant and growing area of technology.

Multiple Job Roles

Many who are working in the field today have more than one role in their job. They may act as researchers, who mine company data for information. They may also be involved with business management. Around 40% work in this capacity. Others work in creative and development roles. Being versatile and being able to take on various roles can make a person more valuable to the team.

Being willing to work in a variety of fields can help, too. While the technology field accounts for 41% of the jobs in data science currently, it is important to other areas too. This includes marketing, corporate, consulting, healthcare, financial services, government, and gaming.

Add More Skills

To become more attractive to companies, those who are in the big data and analytics fields can work to add more skills by taking additional courses. Some of the options to consider include:

⦁ Hadoop and MapReduce

⦁ Real Time Processing

⦁ NoSQL Databases

⦁ GTA Support

⦁ Excel

⦁ Data Science with R

⦁ Data Science with SAS

⦁ Data Science with Python

⦁ Data Visualization – Tableau

⦁ Machine Learning

⦁ Cloublabs for R and Python

Continuing to take classes will provide you with the edge needed to become a valuable member to any team. It shows initiative and drive, and it makes you more of an asset to companies.

Keep Up With the Changes

The field of big data and analytics is not static. As technology changes and increases, so will the field. It is vital that those who are in the field and who want to remain in the field take the initiative to stay up to date with any changes that could affect the field.

If you would like to learn more from Ronald van Loon on “How to Prepare Your Analytics Team for Digital Transformation in 2017” join the webinar.


Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post How to Boost Your Career in Big Data and Analytics appeared first on Ronald van Loons.

Simplified Analytics

Digital Transformation and high-tech Robo-Advisor - do you need one?

How many times you have listened to the advice of your friend/colleague or someone you know, to invest in stock market? Many people have gained and lost their fortune with this guess work and now...


February 09, 2017

Silicon Valley Data Science

The ROI of a Modern Data Strategy

You finally have top management on board to get started on your company’s data transformation, but now comes the hard part. You know that you need a solid data strategy to guide your efforts, but other managers at your company would rather focus investment on building things, investigating questions in the data, you know—things that have a much more visible ROI. You know that there’s value in having the right strategy before you begin your big data work, but how do you convince the rest of your stakeholders and team on the ROI of beginning with the strategy instead?

The main output of your data strategy will be a project roadmap that shows you how to use technology to meet your business goals. This roadmap identifies the projects that have the biggest impact for your business and likely the biggest return. These projects, and the strategy itself, often also affect not only your technology strategy, but your overall business strategy and any future initiatives as well. When thinking about the specific value of a data strategy, you can start by looking at the following three components (I will walk through examples of each in detail):

  1. The ROIs of all of the projects in your data strategy roadmap.
  2. The additional “speed to value” that the strategy helps you bring to future projects.
  3. The additional value, enabled by the data strategy, to your company’s overall business strategy.

Value of the roadmap projects

A modern data strategy will identify the optimal projects and corresponding implementation order so you can get the fastest ROI. In terms of measuring ROI, one of the best places to start is by looking at decreased costs or increased revenue that result from each project. Figure 1 provides a quick list of some basic examples before we dive into a more detailed look at projects that might be on your roadmap in one form or another.

Figure 1: Components of roadmap project ROI

Let’s look at some examples of project ROI by walking through two common types of data and analytics projects that could appear on a data strategy roadmap. We’ve seen many companies use these initial projects to build out new analytics capabilities or to better understand their customer markets.

  1. Common data and analytics platform: A project where you would design and build a common technology platform to support the data storage, data processing, and analytical reporting needs of your company. This is often the foundation or starting point for many other capabilities.
  2. Micro-targeted marketing: Using analytical techniques to identify niche population groups of potential customers—like VIPs—and design marketing campaigns for them. Even if it’s not explicitly one of your business objectives, you’re probably interested one way or another in better understanding your customers.

Ideally, the new capabilities and processes from these projects will decrease costs or increase revenue. For example, the common platform can reduce duplicate hardware and data across your company, reducing your hardware and storage costs. It might even reduce the number of software licenses your company is using, if you can move to a single platform license.

A common platform will bring you many efficiencies—if all of the data is in the same place, everyone has access to up-to-date data. There’s no wasted time looking at data that’s out of date, and you’ll spend less time searching for data that may be stored in different environments (and merging it later). Think of all the time your employees currently spend integrating data or trying to sift through mounds of data files to find the right data points. It adds up pretty quickly.

What about opportunities for increased revenue? With the analytical outcomes of micro-targeted marketing, you can segment your customers into different groups and then market to each more effectively with new campaigns. The upside of this can be huge, realized through increased sales after the new campaigns are introduced. You may also be able to market high margin products to the right customers more effectively, again helping to increase revenue.

There are some common themes here. For decreased costs, when thinking about ROI, you should look at decreased time or decreased amount of resources required to complete a specific task. For increased revenue, you need to think about increased output (with the same amount of time or resources) and freeing up employees for higher value, higher margin tasks.

Speed to business value

The projects in the roadmap aren’t the only piece of the puzzle when it comes to measuring your data strategy’s ROI. With those projects come more capabilities and efficiency gains that will help your teams complete tasks more quickly, bringing you early returns that compound over time. Going back to the consolidated data and analytics platform example, the efficiency gains from this project may speed up product development for one of your teams, helping them to release products sooner than was previously possible. Because of this early release, your company now has the opportunity to earn revenue sooner. All of the resulting revenue from the early release measures up as part of the strategy ROI (since the strategy identified the potential for the efficiencies in the first place). Perhaps even more importantly, speeding time to market may have allowed you to capture greater market share or beat a competitor’s offering.

To enable micro-targeted marketing, you could look at a framework for formalizing the development and production of consumer analytics models. Clients who develop micro-targeted marketing can now rapidly deploy and test new models, learning more about their consumers than ever before. If the client sees a change in behavior, they have the opportunity to react more quickly and reap the monetary gains.

Overall impact to business strategy

The final piece of a data strategy’s ROI is its effect on your company’s overall business strategy. The capabilities that result from your strategy roadmap projects will allow your company to perform completely new types of tasks and functions. You’ll be able to develop products, capabilities, or strategic initiatives that were nowhere near feasible in the past.

Going back to the micro-targeted marketing example, if it identifies new groups of customers, you can now create business initiatives that focus on going after those markets and adding additional revenue. New capabilities resulting from your data strategy roadmap might also help you to identify new patterns of customer behavior, or analyze each individual customer’s behavior to create truly personalized marketing campaigns. These are customer classifications you might have never even known existed before. This is vastly different from the past, where you were forced to analyze each customer as part of a larger market segment or demographic.

Simply having more of these new capabilities at your disposal will allow you to tackle more ambitious problems. Previously in the healthcare and biotech spaces, computations to study different drugs or genetics were extremely time consuming and expensive. With recent advances, such companies can tackle tasks that were simply too cost prohibitive before. Technologies like Apache Spark are vastly reducing the time (and thus cost) it takes to complete certain genetic calculations, as mentioned in David Patterson’s talk on using Spark for genetics processing. With advances like this, capabilities and tasks that were nearly impossible before are now attainable realities. A data strategy can help an R&D company identify these opportunities and then build a plan to realize them, resulting in many new opportunities for the client.

How do you identify the ROI from examples like this in your data strategy? Think about the outcome of the new business strategy initiatives. For the initiatives that go after new markets, one way to look at the ROI is the new revenue from those market segments. For the initiatives that create more personalized customer experiences, think of the revenue resulting from the uplift in customer loyalty. For R&D companies, the ROI can be the cost savings in decreased R&D time and decreased product development time and costs.


A data strategy will help you identify the optimal projects for you to tackle that will have the highest return to your business and its overall strategy. Looking at the resulting projects, along with accelerated business value, and effects to the company’s overall business strategy, will give you a comprehensive view on how to think about the specific value a data strategy will bring. Knowing this, you can begin to look at a data strategy not as a project that delivers abstract, unquantifiable results, but as an impetus for identifying key sources of ROI for your business. To learn more about creating a data strategy, download our related position paper.

The post The ROI of a Modern Data Strategy appeared first on Silicon Valley Data Science.

Revolution Analytics

ModernDive: A free introduction to statistics and data science with R

If you're thinking about teaching a course on statistics and data science using R, Chester Ismay and Albert Kim have created an online, open-source textbook for just that purpose. ModernDive is a...


Infographic: How to create a Data-Driven Customer Loyalty Strategy

Here's a great visual overview of what you need to get started with a data-driven customer loyalty programme: the questions to ask before getting started and an overview of all the possible data sources to consider.


February 08, 2017

Teradata ANZ

The IoT: Redefining industry, again.

The Internet of Things (IoT) is much more than fitness tracking, AI toothbrushes and targeted discounts when you walk by a coffee shop. Put simply: the opportunities arising from ubiquitous connectivity and access to data will bring about the end of many of today’s industries as we know them.

For some, this new IoT-enabled world might mean a slow and painful end as they fight to remain relevant. But many others will grasp the opportunity. They’ll innovate, reinvent, redefine. And succeed. This is not speculation. It’s happening today. Let’s consider some examples from travel & transportation, as we watch the development of the nascent Mobility Industry.

Ford’s 2017 Superbowl advert presents them not as a car company, but a mobility business. It’s no longer about engine performance to get your heart racing or TV screens in the headrests to keep the kids quiet. Now, Ford can get you where you need to be, when you need to be there and in the way that suits you best. It’s not about features and functions. It’s about removing obstacles. And serving your needs.

While the Ford example is mainly about business-to-customer, the story is little different in the business-to-business world. Today, Siemens Mobility don’t present themselves as a personal mobility provider in quite the same way as Ford. But they do present themselves as the intelligent infrastructure provider that can guarantee trains run on time; city traffic flows efficiently; and metro services become more flexible to meet changing passenger needs. Siemens aren’t selling rolling stock any more. They’re selling mobility. Hmm…not so different to Ford after all.

It is true that some will see this redefinition of today’s industries as a bad thing. Especially those that can’t adapt quickly enough to ride the wave. But remember, disruption in industry is nothing new. In manufacturing, The Industrial Revolution may have started it all (at least from a European perspective) but today, we’re already on the way to Industry 4.0. Change is inevitable. And it is constant.

At the top of this blog, I cited the IoT as the key enabler for this massive disruption in how 20th Century industries will serve their customers in the 21st. And that’s true. Things that can connect to each other and exchange information are critical to all the services I’ve discussed so far. Without all those remote devices being able to communicate with each other and with the mothership, everything kinda falls apart. But it’s also clear that just communicating is not enough.

David Socha_IoT

How can Ford create a seamless travel experience for you, from home to office to social event to leisure weekend to…wherever? Just having a fleet of connected cars, buses, bicycles and…eh… gyrocopters isn’t going to cut it. Though the gyrocopters might be quite cool. I might bring that up with them. Anyway…Ford also needs to know about you and your needs; your preferences; your habits; your budget and so much more. They need to know where their fleets are; their schedule of availability; their state of repair and time to next maintenance. And that’s just on the operations side of the business.

Thinking more strategically, Ford, Siemens and others also need to learn from the equipment that’s out there so that they can remain competitive, relevant and profitable next year and next decade too. They must understand which design features have been successful and improve upon them in the next product release. They need to learn what mix of their services works in particular market conditions and predict what those conditions might be in the future. And most importantly, they need to be prepared for further disruption in what is very obviously an immature industry.

To deliver on all these promises needs more than the Internet of Things. It needs the Analytics of Things. Data and analytics are fundamental to everything that a modern mobility provider must achieve. From knowing where a customer is likely to be, who they are with and what services they are likely to call upon, to predicting the remaining useful lifespan of a braking system and scheduling repairs at a time that best suits all parties, providers must be able to analyse all their data, at a speed and cost that meets a very broad range of requirements.

Sometimes, this will mean making real-time, data-driven operational decisions. But not always. At other times, it will mean storing, then analysing massive data sets over significant time periods to identify strategic indicators that lead to policy changes. It will also mean exchanging data – and insights – efficiently with other parties. That might be to influence a connected supply chain, based on new analysis of equipment reliability. Or it might be to provide analysis to Governments and Local Authorities with interest as diverse as city congestion and, say… national security.

It’s clear: getting the data and analytics right is at least as important to the future success of companies like Ford and Siemens as any other aspect of their business. It’s not just about new gadgets. It’s not just about marketing. It’s not even just about fundamentally redefining operating models. And it’s certainly not just about the Internet of Things.

21st Century change is data driven. Embrace it, or fail.

The post The IoT: Redefining industry, again. appeared first on International Blog.

Silicon Valley Data Science

TensorFlow Image Recognition on a Raspberry Pi

Editor’s note: This post is part of our Trainspotting series, a deep dive into the visual and audio detection components of our Caltrain project. You can find the introduction to the series here.

SVDS has previously used real-time, publicly available data to improve Caltrain arrival predictions. However, the station-arrival time data from Caltrain was not reliable enough to make accurate predictions. Using a Raspberry PiCamera and USB microphone, we were able to detect trains, their speed, and their direction. When we set up a new Raspberry Pi in our Mountain View office, we ran into a big problem: the Pi was not only detecting Caltrains (true positive), but also detecting Union Pacific freight trains and the VTA light rail (false positive). In order to reliably detect Caltrain delays, we would have to reliably classify the different trains.

Traditional contextual image classification techniques would not suffice, as the Raspberry Pis were placed throughout the Caltrain system at different distances, heights, and orientations from the train tracks. We were also working on a short deadline, and did not have enough time to manually select patterns and features for every Raspberry Pi in our system.

TensorFlow to the rescue

2016 was a good year to encounter this image classification problem, as several deep learning image recognition technologies had just been open sourced to the public. We chose to use Google’s TensorFlow convolutional neural networks because of its handy Python libraries and ample online documentation. I had read TensorFlow for Poets by Pete Warden, which walked through how to create a custom image classifier on top of the high performing Inception V3 model. Moreover, I could use my laptop to train an augmented version of this new model overnight. It was useful to not need expensive GPU hardware, and to know that I could fine tune the model in the future.

I started with the Flowers tutorial on the TensorFlow tutorials page. I used the command line interface to classify images in the dataset, as well as custom images like Van Gough’s Vase With Twelve Sunflowers.

Now that I had experience creating an image classifier using TensorFlow, I wanted to create a robust unbiased image recognition model for trains. While I could have used previous images captured by our Raspberry Pis, I decided to train on a larger more varied dataset. I also included cars and trucks, as these could also pass by the Raspberry Pi detectors at some locations. To get a training data set, I utilized Google Images to find 1000 images for the Vehicle classifier:

  • Caltrains
  • Freight Trains
  • Light Rail
  • Trucks
  • Cars

Testing and deploying the model

After letting the model train overnight, I returned to my desk the next morning to see how the model performed. I first tested against images not included in the training set, and was surprised to see that the classifier always seemed to pick the correct category. This included images withheld from the training set that were obtained from google images, and also included images taken from the Raspberry Pi.

I performed the image classification on the Raspberry Pi to keep the devices affordable. Additionally, as I couldn’t guarantee a speedy internet connection, I needed to perform the classification on the device to avoid delays in sending images to a central server.

The Raspberry Pi3 has enough horsepower to do on-device stream processing so that we could send smaller, processed data streams over internet connections, and the parts are cheap. The total cost of the hardware for this sensor is $130, and the code relies only on open source libraries. I used JupyterHub for the testing, allowing me to control Raspberry Pis in multiple locations. With a working Vehicle classifier set, I next loaded the model onto a Raspberry Pi and implemented it in the audiovisual streaming architecture.

In order to compile TensorFlow on the Raspberry Pi 32 Bit ARM chip, I followed directions from Sam Abraham’s small community of Pi-TensorFlow enthusiasts, and chatted with Pete Warden and the TensorFlow team at Google.

Troubleshooting TensorFlow on the Raspberry Pi

While it is well documented how to install TensorFlow on an Android or other small computer devices, most existing examples are for single images or batch processes, not for streaming image recognition use cases. Single images could be easily and robustly scored on the Pi, as a successful classification shows below.

However, it was taking too long to load the 85 MB model into memory, therefore I needed to load the classifier graph to memory. With the graph now in memory, and the Raspberry Pi having a total of 1 GB of memory, plenty of computational resources exist to continuously run a camera and microphone on our custom train detection Python application.

def create_and_persist_graph():
    with tf.Session() as persisted_sess:
        # Load Graph
        with tf.gfile.FastGFile(modelFullPath,'rb') as f:
            graph_def = tf.GraphDef()
            tf.import_graph_def(graph_def, name='')
        return persisted_sess.graph

That said, it was not feasible to analyze every image captured image from the PiCamera using TensorFlow, due to overheating of the Raspberry Pi when 100% of the CPU was being utilized In the end, only images of moving objects were fed to the image classification pipeline on the Pi, and TensorFlow was used to reliably discern between different types of vehicles.


If you are interested in classifying images in realtime using an IoT device, this is what you will need to start:

In the next post in this series, we’ll look at connecting an IoT device to the cloud. Keep an eye out, and let us know in the comments if you have any questions or if there’s something in particular you’d like to learn more about. To stay in touch, sign up for our newsletter.

The post TensorFlow Image Recognition on a Raspberry Pi appeared first on Silicon Valley Data Science.

Revolution Analytics

Retail customer analytics with SQL Server R Services

In the hyper-competitive retail industry, intelligence about your customers is key. You need to be able to find the right customers, understand what types of customers you have, and know how to keep...


February 07, 2017


Tyson Johnson: Selling Deep Web Data-as-a-Service in 2017

Tyson Johnson, our Vice President of Business Development, begins his mornings reading news sources and blogs to keep his eye on data industry trends. He then catches up on our new and potential leads and reviews everything from existing projects to potential engagements. His daily goal is straight-forward: develop Deep Web Data-as-a-Service strategies to make sure […] The post Tyson Johnson: Selling Deep Web Data-as-a-Service in 2017 appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (February 7, 2017)

Here’s this week’s news in Data Science and Big Data. iotanalytics_embed

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Featured Courses From BDU

  • Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
  • Predictive Modeling Fundamentals I
    – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data.
  • Using R with Databases
    – Learn how to unleash the power of R when working with relational databases in our newest free course.

Upcoming Data Science Events

Cool Data Science Videos

The post This Week in Data Science (February 7, 2017) appeared first on BDU.

Mario Meir-Huber

SEO Company

SEO Company in New York City

Do you reside in New York City and searching for professional SEO experts that convert? Well, there are many reasons for you to seek the service of a search engine optimization specialist in NYC. If your quest is for an SEO company in NYC, then read through the rest part of this content and find some helpful information.

Rank Higher On Search Engines:

One of the benefits of hiring NYC SEO experts is to help rank your website high on search engines or other marketplaces. The main purpose of SEO is to rank a website, product or service high on search engines. With New York City specialists, your website will rank quickly on Google First Page because these experts understand the best techniques to use.

Reputation Management:

It is important to know that people will not do business with you unless they have trust in your product, brand or service. New York City search engine optimization companies are experts in helping to boost your management reputation quickly. When this occurs, you will discover that both veteran customers and prospective visitors will cast their trust on your brand or online business.

Quick Visibility:

To successfully run an online business, your brand or product needs to be visible to prospective visitors. One of the most effective ways to make your product or presence visible online is by hiring an SEO company in NYC. These companies understand the latest trends that can help push your website, brand or product in the front of prospective visitors.


Traffic is the blood of any business. Without traffic, no business will be able to thrive or last for a long time. New York City SEO experts know the right techniques that can help to bring target traffic to your website in an ephemeral of time. If you reside in NYC and want your online business to gain quick traffic, then contact an SEO company.

High Conversion Rates:

If your website conversion rate is low, it implies visitors quickly leave as they visit. When your website conversion rate is low, then the ROI will also be reduced or drastically drive to nothing. NYC search engine optimization companies are the best expert to contact if you want your business conversion rate to bolster. This will also help to increase the return on investment of your online business accordingly.


Apart from experiencing increased ROI, NYC SEO experts will help your online business to stand out of the crowd. You can contact a professional search engine optimization company in New York City today for more information.
visit these site:

The post SEO Company appeared first on Techblogger.


February 06, 2017

Revolution Analytics

In case you missed it: January 2017 roundup

In case you missed them, here are some articles from January of particular interest to R users. The Data Science Virtual Machine on Azure has been updated with the latest Microsoft R Server, and adds...

Ronald van Loon

Corporate Self Service Analytics: 4 Questions You Should Ask Yourself Before You Start

Today’s customers are socially driven and more value conscious than they were ever before. Believe it or not, everyday customer interactions create a whopping 2.5 exabytes of data, which is equal to 1,000,000 terabytes, and this figure has been predicted to grow by 40 percent with every passing year. As organisations face the mounting challenges of coping with the surge in the amount of data and number of customer interactions, it has become extremely difficult to manage the huge quantities of information, whilst providing a satisfying customer experience. It is imperative for businesses and corporations to create a customer-centric experience by adopting a data-driven approach, based on predictive analytics.

Integrating an advanced self-service analytics (SSA) environment for strengthening your analytics and data handling strategy can prove to be beneficial for your business, regardless of the type and size of your enterprise. A corporate SSA environment can assist in dramatically improving your operations capabilities, as it provides an in-depth understanding of consumer data. This, in turn, facilitates your workforce in taking up a more responsive, nimble approach to analyzing data, and fosters fact-based decision making rather than on predictions and guesswork. Self-service analytics offers a wealth of intelligence and insights into how to make sense out of data and build more intimate relationships for better customer experience.

Why Businesses Need Self Service Analytics

With the increasing costs of effectively managing Big Data being the reason of perturbation, businesses need a platform that can aid in scaling without breaking the bank. In addition, there is a major concern for the security level of data. Most businesses lack the talent and knowledge regarding different business intelligence and analytics (BI&A), and often end up choosing the wrong model unfitting for the size and operations of their business. This results in inaccurate data insights, leading to IT bottlenecks, disconnected analytics experiences, security and governance risks, and additional expenses.

What businesses need is a comprehensive IT solution offering a broader range of data sources and self-service analytics capabilities. In addition, the analytics platform must be uncomplicated and easy-to-use, while at the same time it should be able to meticulously handle complex analytics functions.

To ensure that the self-service analytics platform you are considering choosing is the right one for your business, you need to ask yourself these four questions before you start:


1. How do I Select the Right BI&A Architecture for My Business?

You need to choose a platform that offers deeper insights, accurate analytics, and complete autonomy trust to help your workforce develop a better understanding of data and extract crucial information, whilst reducing the amount of work and costs. For selecting the right BI&A architecture for your business, you need to determine the relative importance of these three attributes:

  1. Insight: Advanced, agile BI&A platforms offer quick insights and analytics in different areas of your organisation. They allow you to improve your performance by offering innovative solutions. In addition, they accurately identify data patterns and present them in an easy-to-understand way, enabling businesses to make decisions based on solid facts and with more confidence. These insights enable businesses to predict and test potential outcomes, greatly reducing the risk of failure and loss.
  2. Autonomy: Analytics should be more widespread and easily accessible at different levels of your organisation. This will allow you to explore critical information and devise insights with the help of self-service data discovery and data prep tools. Doing so will allow you to promote an internal, information-driven culture, making your business more responsive, assertive, and nimble, while the decisions will be more fact-based.
  3. Analytics Trust: The analytics platform should be capable of providing trustworthy, reliable, consistent insights. However, businesses need to keep in mind that transitioning to an advanced BI&A platform shouldn’t be done on the expense of inaccurate, untrustworthy insights and information. In any case, ensuring the credibility of analytics platform’s outputs is of essence before you can go for an organisation-wide implementation.

2. How Do I Choose the Right Analytics Platform?

There are a few things you need to keep in mind for choosing the right analytics platform:

  • Approach: Based on the type and magnitude of your business operations, you must decide whether you should keep your data on premise, host services in the public cloud, or opt for a hybrid approach.
  • Cost: Another important aspect to consider is that your BI&A platform must be capable of catering to the needs of multiple users without incurring additional costs related to customization. The platform should natively support data prep and migration. Moreover, the expenses should only cover the costs of what you use.
  • Scalability: Make sure you evaluate the capability of the analytics platform to support any number of users, ranging from a few hundreds to thousands. Enterprise-level businesses require a complete set of features to fulfil their different business intelligence needs.

3. How Can I be Sure that My Data is Secure?

Most organizations face problems in coping with two key needs: IT needs for ensuring secure operations and business user needs where they have to interact in real-time with their own data. Businesses shouldn’t let BI restrict their functionality; they need to figure out ways to bridge the gap between legacy BI systems and desktop tools. One practical way is to implement a single complete BI&A platform. This will ensure that all your business users and data are centralized in a managed and self-service secure environment.

4. What Operations Capabilities Are Recommended?

This is probably the most important question you need to have a clear understanding about. To ensure the successful implementation your BI&A initiative it must be easy-to-use, while capable of handling complex analysis and generate accurate results in a simplified manner. It is important that your workforce, without formal knowledge or technical background, is be able to use the BI&A platform, which will save time and energy spent in regularly engaging tech support for trivial issues.

When dealing with complex combinations of data, your BI&A platform should apply a range of analytics techniques and come up with better, more impactful insights. Broader sharing of data insights and quick response to user queries for data will enable achieving business benefits relatively easy. Moreover, it should offer high product support, top-notch product quality, and ease of upgrade and migration.

The breadth of analytical computations, along with the number of data sources and volume of data, is growing at an exceptional pace. Businesses and enterprises require flexibility in order to manage the analytical life cycle, from beginning to the implementation of huge numbers of existing and new analytical models that address industry-specific and functional issues of your business in a scalable, secure manner. For this, data scientists need SSA environments instead of simple BI solutions to conduct predictive analytics in an effective manner.


For more insights and information on Self Service Analytics Subscribe to an informative webinar on with Ronald van Loon and Ian MacDonald, hosted by Brighttalk.






Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Corporate Self Service Analytics: 4 Questions You Should Ask Yourself Before You Start appeared first on Ronald van Loons.


February 05, 2017

Mario Meir-Huber

ZGAMES – Multi Platform Developers

A Closer Look At zGames Multi Platform Developers

Are you looking to try and make the time for marketing your next game? Then you should definitely consider using zGames multi platform developers. zGames provide you with high quality and compelling exposure when creating amazing 3 dimensional games.
It is worth noting down that all their games are powered by one of the best and extremely productive ecosystem developers.
All the game titles that are from zGames are all packed with 3 dimensional graphics that are actually quite compelling. The mechanics of this game are also quite powerful and for this reason, are able to perform quite well in most global multi platform developer.
multi platforms zgames
Background Of Developing Various Games
zGames has a game developing team that is actually united due to their high level of experience. Each field expert has approximately three years in unity 3d developer and five years experience in game development and design.
They also have highly experienced graphic and game designers, sound engineers and also 3 dimensional modelers. The game development crew at zGames are enthusiastic and capable individuals that are able to develop a hooking and highly engaging 3 dimensional game.
The team is also very strong and has the skills clearly described below.
– The team has a lot of experience in programming the physics of the game.
– They also have a lot of skills in the implementation of some game mechanics that are quite complex.
– The team also has some technology background when it comes to creating some special effects.
– When it comes to developing some very complex features like integration of facebook, virtual reality and gambling, the team has a high level of knowledge meaning that this shouldn’t pose any problem.
– The team or crew is also experienced in the integration of ad frameworks and also in the development of various in-app purchase models.
zGames are always ready to handle any of your projects regardless of how complex they may seem. They’re able to conceptualize your project, release it and also submit it to any app store of your choice.
The crew at zGames is able to take any work that you give them. If you’re looking to launch a particular gaming app, all you have to know is that the web development team has approximately 6 to 12 years of experience. This means that they’re able to work with it, boost the power of the app and ensure that it has a web-back end that is highly advanced.

Make sure that you visit zGames so that you can get high quality service from highly experienced game developers.

The post ZGAMES – Multi Platform Developers appeared first on Techblogger.


February 04, 2017

Simplified Analytics

All You Need To Know About Business Models in Digital Transformation

In very simple terms, Business model is how you plan to make money from your business.  A refined version is how you create and deliver value to customers. Your strategy tells you where you want...


February 03, 2017

Revolution Analytics

Because it's Friday: Jump!

In this modern social-media world of quick takes, 7-second animated GIFs and 30-second viral videos, it's unusual to find a longer video that hold your attention. Well, that's the case for me anyway,...


Revolution Analytics

Superheat: supercharged heatmaps for R

The heatmap is a useful graphical tool in any data scientist's arsenal. It's a useful way of representing data that naturally aligns to numeric data in a 2-dimensional grid, where the value of each...


Forrester Blogs

The Data Economy Is Going To Be Huge. Believe Me.

Are they serious? I've just finished reading the recent Communication on Building a European Data Economy published by the European Commission. And, it's a good thing they're seeking advice....


February 02, 2017

Revolution Analytics

fst: Fast serialization of R data frames

If you want to get data out of R and into another application or system, simply copying the data as it resides in memory generally isn't an option. Instead you have to serialize the data (into a...