decor
 

Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.

 

November 21, 2017


Forrester Blogs

Ready To Get Serious About Customer Obsession? Start By Benchmarking Against Peers

Firms everywhere are struggling to translate a ‘customer first’ vision into an actionable customer obsessed strategy. Local firms are no exception. As the marketing director for an Australian...

...
Silicon Valley Data Science

Analyzing Sentiment in Caltrain Tweets

Many of us here at SVDS rely on Caltrain, Silicon Valley’s commuter rail line, to commute to and from the office every day. As part of an ongoing R&D effort, we have been collecting and analyzing various sources of Caltrain-related data with the goal of determining where each train is, and how far behind schedule it’s currently running. In addition to video cameras and microphones, we use social networking platforms as sources of signal to detect service disruptions or delays as they’re reported on by the affected passengers.

Twitter is a popular place for people to vent their frustrations or update their fellow passengers on the current state of public transportation, and as such it’s a potentially valuable source of real-time data on Caltrain service. As a first step to using Twitter activity as one of the data sources for train prediction, we start with a simple question: How do Twitter users currently feel about Caltrain?

This type of problem is well-studied in natural language processing (NLP) and machine learning, and can be cast as a classification task: given the content of a document (e.g., a tweet), classify its overall sentiment as either positive or negative (aka polarity classification); we give a few examples of this at the end of the post.

The applications of sentiment analysis range from monitoring customer reactions to products and services, to providing a signal to predict future purchase behavior. In this post, we outline the different types of sentiment analysis, explore typical approaches, and walk you through an introductory example of training and evaluating a classifier using Python’s scikit-learn.

Introduction to Sentiment Analysis

The earliest sentiment analysis research dates to 2003 (Nasukawa and Yi), but has its roots in academic research in analysis of subjectivity as early as the 1990s. Its growth in more recent years parallels the growth of social media and other forms of user generated content. Sentiment analysis uses computational techniques to study people’s opinions, emotions, and attitudes as expressed by their language including emojis and emoticons.

Like the broader field of natural language processing (NLP), sentiment analysis is challenging due to many factors such as the ambiguity of language and the need for world knowledge in order to understand the context of many statements. Sentiment analysis includes the additional challenge that two people can read the same text and differ in their interpretation of its sentiment. In this post, we add another challenge, which is the short snippets of text, as well as slang, hashtags, and emoticons, which characterize tweets.

We focus here on polarity classification at the tweet level, but note that there are other levels of granularity that could be studied:

  • At the least granular, document-level sentiment classification evaluates the polarity of entire documents such as a product review or news article.
  • The next level is to decide whether a given sentence within a document is positive, negative, or neutral. In addition to polarity classification, other types of classification include assigning the number of “stars” to the text and assigning emotions such as fear or anger.
  • Finally, aspect-based sentiment analysis (also called targeted or granular sentiment analysis) reveals deeper information about the content of the likes or dislikes expressed in text. For example, “The pizza was delicious but the service was shoddy” includes positive sentiment about pizza but negative sentiment about the service.

Approaches to sentiment analysis include those from NLP, statistics, and machine learning (ML). We focus here on a mix of NLP and ML approaches. There are two main aspects to this: feature engineering and model creation. We discuss the latter further below. Feature engineering in NLP is always tricky, and this is the case with social media-based sentiment analysis as well. Our example will skim the surface with bag-of-words (BOW), but other possibilities include TF-IDF, negation handling, n-grams, part-of-speech tagging, and the use of sentiment lexicons. More recently, there has been an explosion of approaches using deep learning and related techniques such as word2vec and doc2vec.

Finally, though we don’t touch on neural network techniques such as word2vec in this post, we note that they have made an impact in sentiment analysis as with other parts of NLP.

Gathering the Training Data

In order to train a classification model, we first need a large set of tweets that have been pre-labeled with their sentiment. Although it’s possible to generate such a dataset manually, we’ll save time and effort by using the Twitter US Airline Sentiment dataset, which consists of around 15,000 tweets about six major US airlines, hand-labeled as expressing either positive, negative, or neutral sentiment. Even though this dataset is about air travel, its tweets often express ideas relevant to Caltrain—complaints about delays, status updates, and (sometimes) expressions of gratitude for a smooth trip.

Once we’ve downloaded, reading in and checking out the data is straightforward:

# read the data
tweets = pd.read_csv('../data/Tweets.csv.zip')
# show the first few rows
tweets[['tweet_id','airline_sentiment','text']].head()

A table showing tweet_id, airline_sentiment, and text of the tweet

Before training a classifier in any domain, it’s important to understand the distribution of classes present in the training data. The total height of each bar in the figure below represents the count of tweets in each class. Unsurprisingly for this domain, most of the tweets are negative. Because we want to keep this introductory post simple, we will balance the classes. To read more about considerations for imbalanced data, check out this post. To address the class imbalance, we randomly sample 3,000 tweets from the “negative” class, and use this balanced set in the subsequent analysis. The blue region in the figure represents the distribution of classes after sampling.

Number of Tweets vs. Sentiment

Generating Feature Representations

Classification algorithms generally require their input to be in the form of a fixed-length feature vector. Thus, our first task is to ‘vectorize’ each tweet in a way that preserves as much of the meaning of the text as possible. As mentioned in the introduction, there are a number of ways of generating such feature vectors, and we will keep things simple for this introductory post.

For our purposes, we’ll start with a simple yet effective bag-of-words approach: represent each tweet as an N-dimensional vector, where N is number of unique words in our vocabulary. In the simplest version of this scheme, this vector is binary—each component is 1 if that specific word is present in the tweet, or 0 otherwise; as discussed in Chapter 6 of Jurafsky and Martin, 17, binary vectors typically outperform counts for sentiment analysis. Scikit-learn’s CountVectorizer is a handy tool generating this representation from an input set of strings:

# extract the training data, and sentiment labels
txt_train, sentiment = tweets_balanced['text'], tweets_balanced['airline_sentiment'])
# instantiate the count vectorizer
cv = CountVectorizer(binary=True)
# transform text
txt_train_bow = cv.fit_transform(txt_train)

Training a Classifier

With our feature vectors in hand, we now have everything we need to train a classifier. But which one? The Naive Bayes classifier is a good starting point: not only is it conceptually simple and quick to train on a large data set, but it has also been shown to perform particularly well on text classification tasks like these.1 This model assumes that each word in a tweet independently contributes to the probability that the tweet belongs to a particular class c (i.e. ‘negative’, ‘neutral’, or ‘positive’). In particular, the conditional probability of a tweet d belonging to class c is modeled as:

The conditional probability of a tweet d belonging to class c

Where P(c) is the overall prior probability of the class, t_k is the kth word in the tweet, and n_d is the number of words in the tweet. To train the model, probabilities are computed by simply counting the relative frequencies of words or tweets in the training set, which is fast even on large training sets. Below, we instantiate scikit-learn’s out-of-the-box implementation of this model:

# our naive bayes model
nb_sent_multi = MultinomialNB()

Evaluation

In order to evaluate the performance of this model on new tweets, we’ll look at two diagnostics, both computed using a five-fold cross validation procedure. It’s beyond the scope of this post to discuss cross validation; if you’re interested, the scikit-learn site has a basic overview. The first is a confusion matrix, shown below. The columns in the matrix represent the true classes, the rows represent the classes predicted by the model, and each entry is the number of tweets in the test set(s) that correspond to those class assignments.

# Run confusion matrix on transformed text and cross-validation predictions
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

sentiment_predict = cross_val_predict(nb_sent_multi, txt_train_bow, sentiment, cv=5)
confusion_mat = confusion_matrix(sentiment, sentiment_predict)

Confusion matrix

The first thing to notice from the confusion matrix is that the counts are mostly concentrated along the diagonal, indicating that the model is doing a reasonable job of assigning the true sentiment to tweets. Nevertheless, a sizeable portion are incorrectly classified in one way or another. In particular, the model falsely classifies a large portion of neutral tweets as positive (bottom row, middle column).

As a second diagnostic we generate a set of ROC curves, shown in the figure above. As before, a thorough explanation of ROC analysis is beyond the scope of this post (see Ch 8 in Data Science for Business for details). Panels A, B, and C show the performance of the classifier for each class in a one-vs-all fashion, indicating the model’s ability to distinguish that class from all others. Panel D overlays the mean ROC curves for each class. In comparing performance between classes, notice that the model exhibits a lower AUC value for the neutral class, compared with the others. This agrees with the observations from the confusion matrix—the model struggles to distinguish neutral tweets from the rest.

Considering the structure of our model, the observations above may not be too surprising: the distribution of words in neutral tweets may not distinguish it well from the other classes. For instance, a purely informational (neutral) tweet about a delay in service may share many of the same words as a (negative) complaint about someone’s delayed flight.
As a final qualitative evaluation of our model’s performance in the Caltrain domain, we used it to classify some recent Caltrain-related tweets.

Below are some tweets it classified as positive:

…some tweets it classified as negative:

…and some tweets it classified as Neutral:

This is a reasonable baseline model, but there are multiple options for investigating possible improvements. For example, as mentioned in the introduction, we could improve the feature representation. We can investigate the types of errors made to prioritize these investigations.

Conclusion and Resources

Social media is a commonly consulted resource for commuters, and Caltrain riders are no different. As we’ve seen in this post, this means that a platform like Twitter can act as valuable source of signal for a sentiment classifier. In this context, a properly deployed model might help keep riders informed, and may even help Caltrain operations understand passenger reactions to delays.

Introductions to sentiment analysis include:

Aggarwal, C. C. and Zhai, C. “A Survey of Text Classification Algorithms.” In Aggarwal, Charu C. and Zhai, ChengXiang (Eds.), Mining Text Data, pp. 163–222. Springer: 2012.

The post Analyzing Sentiment in Caltrain Tweets appeared first on Silicon Valley Data Science.


Forrester Blogs

AI Will Change Organizations From Within

Over the past year I’ve spoken formally and informally with hundreds of companies about their AI initiatives. The biggest AH-HA moment comes when these companies realize the difference between...

...

BrightPlanet

Monitor the Future with Open Source Intelligence Tools Using Example Data from the Auto Industry

In recent years, BrightPlanet has worked with AMPLYFI to harness leading Artificial Intelligence (AI) through web harvest activities that interpret the wealth of untapped open source content within the Internet. On October 4, 2017, BrightPlanet’s Vice President of Development, Will Bushee, and Mark Woods, CTO of AMPLYFI, held a webinar to demonstrate how AMPLYFI’s leading […] The post Monitor the Future with Open Source Intelligence Tools Using Example Data from the Auto Industry...

Read more »

Forrester Blogs

Private Cloud Platforms And Hosting Grow Faster Than Expected

In the past year, we’ve seen an interesting change in the private cloud space: Cloud platforms and private cloud hosting are gaining share at the expense of private cloud software suites. In...

...
 

November 20, 2017


Revolution Analytics

How to make Python easier for the R user: revoscalepy

by Siddarth Ramesh, Data Scientist, Microsoft I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge...

...

Revolution Analytics

R charts in a Tweet

Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet...

...

Forrester Blogs

Royal Caribbean and EY Embark On A Digital Transformation To Put Cruising Back On The Grid

All companies face a new digital reality, where the accelerated pace of change today is the slowest it’s ever going to be. Partnering with EY, Royal Caribbean is taking on this challenge by...

...

Forrester Blogs

Which Digital Agency Should You Use?

I believe most firms have an existential need for a great digital experience service provider — an agency or consultancy — that can help them design, build, and sometimes operate the...

...

Forrester Blogs

Another Serious Player Is Entering The Data Governance 2.0 Market

erwin, the very well known data modeling vendor separated from CA in 2016, is announcing a data governance module in addition to their existing suite of  Data modeling, Enterprise Architecture and...

...
 

November 19, 2017

Principa

Why a learning culture is so critical in the workplace

As a company passionate about innovation we are regularly evaluating and re-evaluating knowledge – whether it’s our collective own, an employee’s or a client’s. Knowledge is a funny thing.  Common sense might suggest that the more one learns about a subject the more confident one becomes. However, this is not entirely true, at least not in the beginning.

The Dunning-Kruger Effect

The Dunning-Kruger effect (DKE) is a cognitive bias that has been known for some time, but was only formalised in 1999 by two Cornell psychologists.  It involves the seemingly contradictory idea that often those with little knowledge on a subject may come across as exceedingly confident about the subject.  Conversely those with more knowledge may be less confident. 


Simplified Analytics

The role of Analytics in Digital Transformation

Today every industry is talking about Digital Transformation and affected by technologies like the Internet of Things, Blockchain, Microservices and Cloud. Every company like Apple, Nike, and Nestle,...

...
 

November 18, 2017


Forrester Blogs

Executives Overestimate DevOps Maturity

2017 is the year of DevOps! The team here at Forrester is seeing this momentum grow every day with increased inquiries, solution proliferation and a growing number of calls from executives on scaling...

...
 

November 17, 2017


Revolution Analytics

Because it's Friday: Better living through chemistry

This video is a compilation of some spectacular chemical reactions, with a few physics demonstrations thrown in for good measure. (But hey, chemistry is just applied physics, right?). That's all from...

...

Revolution Analytics

Highlights from the Connect(); conference

Connect();, the annual Microsoft developer conference, is wrapping up now in New York. The conference was the venue for a number of major announcements and talks. Here are some highlights related to...

...

Revolution Analytics

In case you missed it: October 2017 roundup

In case you missed them, here are some articles from October of particular interest to R users. A recent survey of competitors on the Kaggle platform reveals that Python (76%) and R (59%) are the...

...

Revolution Analytics

Scale up your parallel R workloads with containers and doAzureParallel

by JS Tan (Program Manager, Microsoft) The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric...

...

Forrester Blogs

Help Forrester Benchmark The Privacy/Personalization Paradox

Modern day marketing and advertising has created a paradox: customers want to be recognized and rewarded for their loyalty, but they also want their privacy to be respected and their data used...

...
Jean Francois Puget

2nd Prize Winning Solution to Web Traffic Forecasting competition on Kaggle

image

 

I'm very proud to have finished 2nd in the latest Kaggle competition, organized by Google Research.  Pardon my team name, but the joke was too tempting given this was a Web Traffic Forecasting competition ;)

 

The competition was about predicting number of visits for Wikipedia pages.  Here is a short description of the competition, from Kaggle site.

The training dataset consists of approximately 145k time series. Each of these time series represent a number of daily views of a different Wikipedia article, starting from July, 1st, 2015 up until December 31st, 2016. The leaderboard during the training stage is based on traffic from January, 1st, 2017 up until March 1st, 2017.

The second stage will use training data up until September 1st, 2017. The final ranking of the competition will be based on predictions of daily views between September 13th, 2017 and November 13th, 2017 for each article in the dataset. You will submit your forecasts for these dates by September 12th.

For each time series, you are provided the name of the article as well as the type of traffic that this time series represent (all, mobile, desktop, spider). You may use this metadata and any other publicly available data to make predictions. Unfortunately, the data source for this dataset does not distinguish between traffic values of zero and missing values. A missing value may mean the traffic was zero or that the data is not available for that day.

To reduce the submission file size, each page and date combination has been given a shorter Id. The mapping between page names and the submission Id is given in the key files.

 

My solution is a combination of deep learning and xgboost.  I described it here.  What made that competition quite challenging is that some participants quickly found that the median of the visits for the previous 7 weeks was a very good predictor of future visits.  Doing better than that was quite challenging, and only 50 participants or so over 1,000 managed to beat the median benchmark.

 

Another difficulty was the metric used to evaluate submissions.  It is the SMAPE metric, which is a rather weird metric.  What makes it weird is that it is discontinuous at 0, and it is non convex.  I discuss its weirdness at length in this notebook

 

A third difficulty was that we had about 24 hours only between the release of final train data and the competition deadline.  It meant that we had to do all our data processing, all our model training, and all prediction computation within a short lapse of time, and had to share that time with mundane activities like sleep, eating, day job, etc.  I carefully planned for that, and only had to run predefined code.  My fear was to introduce a last minute bug during that final 24 period.  I didn't thankfully, but other strong competitors lost their chance to win a prize because of last minute bugs.

 

I learned a lot during this competition. In particular, it was the first time I used deep learning seriously.  I will share my code on github very soon, but don't expect very sophisticated deep learning models in it.  My model is rather clumsy and full of beginner shortcomings.  I compensated with feature engineering and carefully designed cross validation setting.
 

Edited on Nov 17, 2017.  The code for this solution is available on GitHub.

 

November 16, 2017


Revolution Analytics

The City of Chicago uses R to issue beach safety alerts

Among the many interesting talks I saw a the Domino Data Science Pop-Up in Chicago earlier this week was the presentation by Gene Lynes and Nick Lucius from the City of Chicago. The City of Chicago...

...

Forrester Blogs

Inside Infosec Teams

Announcing Our New Security & Risk Staffing Survey! Information security is one of the hottest fields around. Data abounds about how awesome it is to work in infosec, how many jobs are available,...

...
Silicon Valley Data Science

Learning from Imbalanced Classes

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! You can find the original post here.

If you’re fresh from a machine learning course, chances are most of the datasets you used were fairly easy. Among other things, when you built classifiers, the example classes were balanced, meaning there were approximately the same number of examples of each class. Instructors usually employ cleaned up datasets so as to concentrate on teaching specific algorithms or techniques without getting distracted by other issues. Usually you’re shown examples like the figure below in two dimensions, with points representing examples and different colors (or shapes) of the points representing the class:

cleaned dataset

The goal of a classification algorithm is to attempt to learn a separator (classifier) that can distinguish the two. There are many ways of doing this, based on various mathematical, statistical, or geometric assumptions:

mathematical, statistical, and geometric assumptions

But when you start looking at real, uncleaned data one of the first things you notice is that it’s a lot noisier and imbalanced. Scatterplots of real data often look more like this:

scatterplot of imbalanced classes

The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.

Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. —Here are some examples:

  1. About 2% of credit card accounts are defrauded per year1. (Most fraud detection domains are heavily imbalanced.)
  2. Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).
  3. Disk drive failures are approximately ~1% per year.
  4. The conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
  5. Factory production defect rates typically run about 0.1%.

Many of these domains are imbalanced because they are what I call needle in a haystack problems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.

When you encounter such problems, you’re bound to have difficulties solving them with standard algorithms. Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration2. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.

This might seem like pathological behavior but it really isn’t. Indeed, if your goal is to maximize simple accuracy (or, equivalently, minimize error rate), this is a perfectly acceptable solution. But if we assume that the rare class examples are much more important to classify, then we have to be more careful and more sophisticated about attacking the problem.

If you deal with such problems and want practical advice on how to address them, read on.

Note: The point of this blog post is to give insight and concrete advice on how to tackle such problems. However, this is not a coding tutorial that takes you line by line through code. I have Jupyter Notebooks (also linked at the end of the post) useful for experimenting with these ideas, but this blog post will explain some of the fundamental ideas and principles.

Handling imbalanced data

Learning from imbalanced data has been studied actively for about two decades in machine learning. It’s been the subject of many papers, workshops, special sessions, and dissertations (a recent survey has about 220 references). A vast number of techniques have been tried, with varying results and few clear answers. Data scientists facing this problem for the first time often ask What should I do when my data is imbalanced? This has no definite answer for the same reason that the general question Which learning algorithm is best? has no definite answer: it depends on the data.

That said, here is a rough outline of useful approaches. These are listed approximately in order of effort:

  • Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on the so-called natural (or stratified) distribution and sometimes it works without need for modification.
  • Balance the training set in some way:
    • Oversample the minority class.
    • Undersample the majority class.
    • Synthesize new minority classes.
  • Throw away minority examples and switch to an anomaly detection framework.
  • At the algorithm level, or after it:
    • Adjust the class weight (misclassification costs).
    • Adjust the decision threshold.
    • Modify an existing algorithm to be more sensitive to rare classes.
  • Construct an entirely new algorithm to perform well on imbalanced data.

Digression: evaluation dos and don’ts

First, a quick detour. Before talking about how to train a classifier well with imbalanced data, we have to discuss how to evaluate one properly. This cannot be overemphasized. You can only make progress if you’re measuring the right thing.

  1. Don’t use accuracy (or error rate) to evaluate your classifier! There are two significant problems with it. Accuracy applies a naive 0.50 threshold to decide between classes, and this is usually wrong when the classes are imbalanced. Second, classification accuracy is based on a simple count of the errors, and you should know more than this. You should know which classes are being confused and where (top end of scores, bottom end, throughout?). If you don’t understand these points, it might be helpful to read The Basics of Classifier Evaluation, Part 2. You should be visualizing classifier performance using a ROC curve, a precision-recall curve, a lift curve, or a profit (gain) curve.
    Sample-ROC for imbalanced classes

    ROC curve

    recall for imbalanced classes

  2. Don’t get hard classifications (labels) from your classifier (via score3 or predict). Instead, get probability estimates via proba or predict_proba.
  3. When you get probability estimates, don’t blindly use a 0.50 decision threshold to separate classes. Look at performance curves and decide for yourself what threshold to use (see next section for more on this). Many errors were made in early papers because researchers naively used 0.5 as a cut-off.
  4. No matter what you do for training, always test on the natural (stratified) distribution your classifier is going to operate upon. See sklearn.cross_validation.StratifiedKFold.
  5. You can get by without probability estimates, but if you need them, use calibration (see sklearn.calibration.CalibratedClassifierCV)

The two-dimensional graphs in the first bullet above are always more informative than a single number, but if you need a single-number metric, one of these is preferable to accuracy:

  1. The Area Under the ROC curve (AUC) is a good general statistic. It is equal to the probability that a random positive example will be ranked above a random negative example.
  2. The F1 Score is the harmonic mean of precision and recall. It is commonly used in text processing when an aggregate measure is sought.
  3. Cohen’s Kappa is an evaluation statistic that takes into account how much agreement would be expected by chance.

Oversampling and undersampling

The easiest approaches require little change to the processing steps, and simply involve adjusting the example sets until they are balanced. Oversampling randomly replicates minority instances to increase their population. Undersampling randomly downsamples the majority class. Some data scientists (naively) think that oversampling is superior because it results in more data, whereas undersampling throws away data. But keep in mind that replicating data is not without consequence—since it results in duplicate data, it makes variables appear to have lower variance than they do. The positive consequence is that it duplicates the number of errors: if a classifier makes a false negative error on the original minority data set, and that data set is replicated five times, the classifier will make six errors on the new set. Conversely, undersampling can make the independent variables look like they have a higher variance than they do.

Because of all this, the machine learning literature shows mixed results with oversampling, undersampling, and using the natural distributions.

Oversampling Undersampling imbalanced classes

Most machine learning packages can perform simple sampling adjustment. The R package unbalanced implements a number of sampling techniques specific to imbalanced datasets, and scikit-learn.cross_validation has basic sampling algorithms.

Bayesian argument of Wallace et al.

Possibly the best theoretical argument of—and practical advice for—class imbalance was put forth in the paper Class Imbalance, Redux, by Wallace, Small, Brodley and Trikalinos4. They argue for undersampling the majority class. Their argument is mathematical and thorough, but here I’ll only present an example they use to make their point.

They argue that two classes must be distinguishable in the tail of some distribution of some explanatory variable. Assume you have two classes with a single dependent variable, x. Each class is generated by a Gaussian with a standard deviation of 1. The mean of class 1 is 1 and the mean of class 2 is 2. We shall arbitrarily call class 2 the majority class. They look like this:

Standard deviation imbalanced classes

Given an x value, what threshold would you use to determine which class it came from? It should be clear that the best separation line between the two is at their midpoint, x=1.5, shown as the vertical line: if a new example x falls under 1.5 it is probably Class 1, else it is Class 2. When learning from examples, we would hope that a discrimination cutoff at 1.5 is what we would get, and if the classes are evenly balanced this is approximately what we should get. The dots on the x axis show the samples generated from each distribution.

But we’ve said that Class 1 is the minority class, so assume that we have 10 samples from it and 50 samples from Class 2. It is likely we will learn a shifted separation line, like this:

shifted separation line imbalanced classes

We can do better by down-sampling the majority class to match that of the minority class. The problem is that the separating lines we learn will have high variability (because the samples are smaller), as shown here (ten samples are shown, resulting in ten vertical lines):

high variability imbalanced classes

So a final step is to use bagging to combine these classifiers. The entire process looks like this:

bagging imbalanced classes

This technique has not been implemented in Scikit-learn, though a file called blagging.py (balanced bagging) is available that implements a BlaggingClassifier, which balances bootstrapped samples prior to aggregation.

Neighbor-based approaches

Over- and undersampling selects examples randomly to adjust their proportions. Other approaches examine the instance space carefully and decide what to do based on their neighborhoods.

For example, Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together.

Tomek links imbalanced classes

Tomek’s algorithm looks for such pairs and removes the majority instance of the pair. The idea is to clarify the border between the minority and majority classes, making the minority region(s) more distinct. The diagram above shows a simple example of Tomek link removal. The R package unbalanced implements Tomek link removal, as does a number of sampling techniques specific to imbalanced datasets. Scikit-learn has no built-in modules for doing this, though there are some independent packages (e.g., TomekLink).

Synthesizing new examples: SMOTE and descendants

Another direction of research has involved not resampling of examples, but synthesis of new ones. The best known example of this approach is Chawla’s SMOTE (Synthetic Minority Oversampling TEchnique) system. The idea is to create new minority examples by interpolating between existing ones. The process is basically as follows. Assume we have a set of majority and minority examples, as before:

SMOTE approach imbalanced classes

SMOTE was generally successful and led to many variants, extensions, and adaptations to different concept learning algorithms. SMOTE and variants are available in R in the unbalanced package and in Python in the UnbalancedDataset package.

It is important to note a substantial limitation of SMOTE. Because it operates by interpolating between rare examples, it can only generate examples within the body of available examples—never outside. Formally, SMOTE can only fill in the convex hull of existing minority examples, but not create new exterior regions of minority examples.

Adjusting class weights

Many machine learning toolkits have ways to adjust the “importance” of classes. Scikit-learn, for example, has many classifiers that take an optional class_weight parameter that can be set higher than one. Here is an example, taken straight from the scikit-learn documentation, showing the effect of increasing the minority class’s weight by ten. The solid black line shows the separating border when using the default settings (both classes weighed equally), and the dashed line after the class_weight parameter for the minority (red) classes changed to ten.

weighing imbalanced classes

As you can see, the minority class gains in importance (its errors are considered more costly than those of the other class) and the separating hyperplane is adjusted to reduce the loss.

It should be noted that adjusting class importance usually only has an effect on the cost of class errors (False Negatives, if the minority class is positive). It will adjust a separating surface to decrease these accordingly. Of course, if the classifier makes no errors on the training set errors then no adjustment may occur, so altering class weights may have no effect.

And beyond

This post has concentrated on relatively simple, accessible ways to learn classifiers from imbalanced data. Most of them involve adjusting data before or after applying standard learning algorithms. It’s worth briefly mentioning some other approaches.

New algorithms

Learning from imbalanced classes continues to be an ongoing area of research in machine learning with new algorithms introduced every year. Before concluding I’ll mention a few recent algorithmic advances that are promising.

In 2014 Goh and Rudin published a paper Box Drawings for Learning with Imbalanced Data5 which introduced two algorithms for learning from data with skewed examples. These algorithms attempt to construct “boxes” (actually axis-parallel hyper-rectangles) around clusters of minority class examples:

minority classes examples imbalanced classes

Their goal is to develop a concise, intelligible representation of the minority class. Their equations penalize the number of boxes and the penalties serve as a form of regularization.

They introduce two algorithms, one of which (Exact Boxes) uses mixed-integer programming to provide an exact but fairly expensive solution; the other (Fast Boxes) uses a faster clustering method to generate the initial boxes, which are subsequently refined. Experimental results show that both algorithms perform very well among a large set of test datasets.

Earlier I mentioned that one approach to solving the imbalance problem is to discard the minority examples and treat it as a single-class (or anomaly detection) problem. One recent anomaly detection technique has worked surprisingly well for just that purpose. Liu, Ting and Zhou introduced a technique called Isolation Forests6 that attempted to identify anomalies in data by learning random forests and then measuring the average number of decision splits required to isolate each particular data point. The resulting number can be used to calculate each data point’s anomaly score, which can also be interpreted as the likelihood that the example belongs to the minority class. Indeed, the authors tested their system using highly imbalanced data and reported very good results. A follow-up paper by Bandaragoda, Ting, Albrecht, Liu and Wells7 introduced Nearest Neighbor Ensembles as a similar idea that was able to overcome several shortcomings of Isolation Forests.

Buying or creating more data

As a final note, this blog post has focused on situations of imbalanced classes under the tacit assumption that you’ve been given imbalanced data and you just have to tackle the imbalance. In some cases, as in a Kaggle competition, you’re given a fixed set of data and you can’t ask for more.

But you may face a related, harder problem: you simply don’t have enough examples of the rare class. None of the techniques above are likely to work. What do you do?

In some real world domains you may be able to buy or construct examples of the rare class. This is an area of ongoing research in machine learning. If rare data simply needs to be labeled reliably by people, a common approach is to crowdsource it via a service like Mechanical Turk. Reliability of human labels may be an issue, but work has been done in machine learning to combine human labels to optimize reliability. Finally, Claudia Perlich in her Strata talk All The Data and Still Not Enough gives examples of how problems with rare or non-existent data can be finessed by using surrogate variables or problems, essentially using proxies and latent variables to make seemingly impossible problems possible. Related to this is the strategy of using transfer learning to learn one problem and transfer the results to another problem with rare examples, as described here.

Comments or questions?

Here, I have attempted to distill most of my practical knowledge into a single post. I know it was a lot, and I would value your feedback. Did I miss anything important? Any comments or questions on this blog post are welcome.

Resources and further reading

  1. Several Jupyter notebooks are available illustrating aspects of imbalanced learning.
    • A notebook illustrating sampled Gaussians, above, is at Gaussians.ipynb.
    • A simple implementation of Wallace’s method is available at blagging.py. It is a simple fork of the existing bagging implementation of sklearn, specifically ./sklearn/ensemble/bagging.py.
    • A notebook using this method is available at ImbalancedClasses.ipynb. It loads up several domains and compares blagging with other methods under different distributions.
  2. Source code for Box Drawings in MATLAB is available from: http://web.mit.edu/rudin/www/code/BoxDrawingsCode.zip
  3. Source code for Isolation Forests in R is available at: https://sourceforge.net/projects/iforest/

Thanks to Chloe Mawer for her Jupyter Notebook design work.

1. Natalie Hockham makes this point in her talk Machine learning with imbalanced data sets, which focuses on imbalance in the context of credit card fraud detection.
2. By definition there are fewer instances of the rare class, but the problem comes about because the cost of missing them (a false negative) is much higher.
3. The details in courier are specific to Python’s Scikit-learn.
4. “Class Imbalance, Redux”. Wallace, Small, Brodley and Trikalinos. IEEE Conf on Data Mining. 2011.
5. Box Drawings for Learning with Imbalanced Data.” Siong Thye Goh and Cynthia Rudin. KDD-2014, August 24–27, 2014, New York, NY, USA.
6. “Isolation-Based Anomaly Detection”. Liu, Ting and Zhou. ACM Transactions on Knowledge Discovery from Data, Vol. 6, No. 1. 2012.
7. “Efficient Anomaly Detection by Isolation Using Nearest Neighbour Ensemble.” Bandaragoda, Ting, Albrecht, Liu and Wells. ICDM-2014

The post Learning from Imbalanced Classes appeared first on Silicon Valley Data Science.


Forrester Blogs

Predictions 2018: IoT Will Move From Experimentation To Business Scale

The term IoT can be confusing, and depending on who you talk to they might choose to focus on one element like connectivity or another. At Forrester we believe IoT extends beyond devices and...

...

Forrester Blogs

Dear Indian Brand, Heads Up! Your Customer Experience Score Is En Route

Dear Brand, We know you love your customers. You do your best – to catch their eye, hold their attention, serve and pamper them, and help them in need. And then hope they keep coming back for more....

...

Forrester Blogs

The New W3C Payments Request API Standard Means Better Web Checkout Experiences

All major browser makers are implementing a new World Wide Web Consortium (W3C) Payment Request API standard, which means faster, more secure web checkouts are right around the corner as early as...

...
 

November 15, 2017


Forrester Blogs

2017 Saw Record-Breaking Breaches – And There’s More Where That Came From In 2018

In late 2016, the security and risk team at Forrester made its annual predictions for 2017. Let’s take a quick look at how we did. Prediction No. 1: The incoming Trump administration would face a...

...
 

November 14, 2017


Forrester Blogs

Predictions 2018: Customer-Obsessed, Data-Driven Retailers Will Thrive In 2018

In a world of hyper-adoption – and hyper-abandonment – successful retailing in 2018 comes down to obsessing about your customer’s experience. It’s a tall order: Digital and physical touchpoints now...

...

Revolution Analytics

An update for MRAN

MRAN, the Microsoft R Application Network has been migrated to a new high-performance, high-availability server, and we've taken the opportunity to make a few upgrades along the way. You shouldn't...

...
Jean Francois Puget

Fast Computation of AUC-ROC score

Area under ROC curve  (AUC-ROC) is one of the most common evaluation metric for binary classification problems.  We show here a simple and very efficient way to compute it with Python.  Before showing the code, let's briefly describe what an evaluation metric is, and what AUC-ROC is in particular. 

An evaluation metric is a way to assess how good a machine learning model is.  It is used to compute one or more numbers that summarize how the machine learning model predictions compare to reality.  In order to use an evaluation metric, one has to go thought these steps:

  1. Start with a set of labelled examples: each example is described by a set of features, and a target value.  The goal is to learn how to compute from the features a value as close as possible to the target.
  2. Split the available examples set in a training set and a test set
  3. Build a model using the train set
  4. Use the model to predict the values for the test set
  5. Use an evaluation metric to summarize the difference between the predictions on the test set and the target for the test set.

When the target only takes two values we have a binary classification problem at hand.  Example of binary classification are very common. For instance fraud detection where examples are credit card transactions, features are time, location, amount, merchant id, etc., and target is fraud or not fraud.  Spam detection is also a binary classification where examples are emails, features are the email content as a string of words, and target is spam or not spam.  Without loss of generality we can assume that the target values are 0 and 1, for instance 0 means no fraud or no spam, while 1 means fraud or spam.

For binary classification, predictions are also binary.  Therefore, a prediction is either equal to the target, or is off the mark.  A simple way to evaluate model performance is accuracy: how many predictions are right? For instance, if our test set has 100 examples in it, how many times is the prediction correct?  Accuracy seems a logical way to evaluate performance: a higher accuracy obviously means a better model.  At least this is what people think when they are exposed to the first time to binary classification problems.  Issue is that accuracy can be extremely misleading.

Let's see why.  Assume I have a binary classification problem, for instance fraud detection, and that I have a model with 99% accuracy.  My model predicts the correct target correctly for 99 of the examples in the test set.  It looks like I got a near perfect model, isn't it?

Well, what if reality is the following?

  • There is about 1% fraud in general, and in my test set there is exactly one fraudulent transaction.
  • My model predicts that no transaction in a fraud, ever

If you look at it, my model is correct 99% of the time.  Yet it is absolutely useless.

In order to cope with this issue several alternative metrics have been proposed to replace accuracy, like precision, recall, F1 score.  But these metrics, as well as accuracy, do not apply to many interesting and effective algorithms.  These are algorithms that output a probability rather than a binary value.  A probability close to 0 means that the algorithm thinks the target is 0, while a probability close to 1 means that the algorithm thinks the target is 1.  Algorithms in this class include logistic regression, gradient boosted trees with log loss, and neural networks with cross entropy loss.  One way to use these algorithm is to threshold their output: a probability under 0.5 is transformed into a 0, and a value above 0.5 is transformed into a 1.  After thresholding any of the above metric can be used. 

We used 0.5 as the threshold, but we could have used any other value between 0 and 1.  A conservative value would be to use a threshold close to 0, for instance 0.1.  This amounts to classify as non fraud or non spam only the examples that the algorithm is very confident about.  And of course, depending on the threshold you use the evaluation metric will yield different values.

It would be nice to be able to evaluate the performance of a model without the need to select an arbitrary threshold.  This is precisely what AUC-ROC is providing.  I'll refer to wikipedia for the classical way of defining that metric.  I will use a much simpler way here.

Let's first define some entities.

  • pos is the set of examples with target 1.  These are the positive examples
  • neg is the set of examples with target 0. These are the negative examples.
  • p(i) is the prediction for example ip(i) is a number between 0 and 1.
  • A pair of examples (i, j) is labelled the right way if i is a positive example, j is a negative example, and the prediction for i is higher than the prediction for j
  • | s | is the number of elements in set s.

Then AUC-ROC is the count of pairs labelled the right way divided by the number of pairs:

  • AUC-ROC = | {(i,j), i in pos, j in neg, p(i) > p(j)} | / (| pos | x | neg |)

A naive code to compute this would be to consider each possible pair and count those labelled the right way.  A much better way is to sort the predictions first, then visit the examples in increasing order of predictions.  Each time we see a positive example we add the number of negative examples we've seen so far.  We use the numba compiler to make it run fast:

import numpy as np 
from numba import jit

@jit
def fast_auc(y_true, y_prob):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_prob)]
    nfalse = 0
    auc = 0
    n = len(y_true)
    for i in range(n):
        y_i = y_true[i]
        nfalse += (1 - y_i)
        auc += y_i * nfalse
    auc /= (nfalse * (n - nfalse))
    return auc

On my macbook pro it runs about twice as fast as the corresponding sckit-learn  function.  A notebook with the code and a benchmark is available on github.

 

Edited on Nov 14, 2017.  If you are looking for a fast AUC-ROC code in R then have a look at Ben Gorman's code.

 

November 13, 2017


Revolution Analytics

Updated curl package provides additional security for R on Windows

There are many R packages that connect to the internet, whether it's to import data (readr), install packages from Github (devtools), connect with cloud services (AzureML), or many other...

...
Ronald van Loon

AI In Telecom: Intelligent Operations is the New Norm

The move towards an intelligent world is faster and more rapid than it ever was before. The increase in this transition has been propagated through the role of several high key stakeholders that have redefined the way we look at technology. One of the key players in this transition is Huawei.

Huawei’s recent UBBF conference held in Hangzhou on October 18-19 was a step towards awareness in this regard. Being personally present at this conference, there were numerous intakes that I noted down and would like to present to my readers.

An Intelligent World

One cannot stay oblivious to the fact that we are indeed moving towards an intelligent world. The world of the future is being propelled through the use of Artificial Intelligence (AI) and Machine Learning (ML). Both these technologies conflate to form the basis of numerous intelligent technologies that help make life easier for us. A few of these developments that affect the usage of Telecom networks substantially are:

  • Smart Homes: Through the use of AI or the Internet of Things (IoT), as it is known in more common language, the concept of Smart Homes is spreading at a rapid pace. We have homes where we can monitor and control things as minute as the cooling through the AC and as important as the security of our assets. The role of video in this revolution cannot be compromised, as video ensures the effectiveness of most of the data that is being used here.
  • Smart Buildings: Smart Buildings is another concept that makes the future more secure and peaceful for the average user. Smart Buildings are also influenced through the use of analytics garnered through data from videos.
  • Self-Driving Cars: Self Driving cars are already on the roads and are nothing but the pinnacle of how AI can augment us in doing routine tasks. Self Driving cars make life easier for all drivers and also suggest that accidents due to neglect on our parts may decrease during the future.
  • Smart Cities: All the effectiveness and efficiency garnered through the intelligence in AI and ML would eventually lead to the creation of Smart Cities. The concept of Smart Cities is already envisioned by thought leaders such as Huawei. An example of life within Smart Cities would be that of an efficient traffic management system. The current traffic system within metropolitan cities leaves much to be desired. With the implementation of intelligence in traffic control, we will experience better management when it comes to traffic.

With this wave towards a smarter society in general, everything is interconnected to generate data. The data that is generated through this interconnection would then be used to set the base for the system to be followed in the smart society.

Growth in the Use of Video

Another point discussed in great detail during the conference was the increasing role of video as the major Internet traffic driver. Internet traffic is increasing by the minute, but another factor within that is experiencing smart growth is that of the video. Such is the scope of this growth that it has been called the Smart Video Revolution. The growth is expected to go into the next year as experts have predicted that video traffic will be 80 percent of all Internet traffic during the year 2019.

While this rapid growth indicates numerous possibilities, it also presents challenges for Telco operators. Can this high demand for video be catered to by all Telco operators? This growth in the use of video will among many other things require the presence of smart homes and smart buildings; for this, video quality needs to drastically grow. This growth in technology can only be experienced through the use of Virtual Reality and 4k and 8k technology. Telco operators would have to meet this demand for better video quality through provision of better and quality data. Other than providing data with better quality, most of the traffic on the Internet will be directed towards videos. Customers have shown a fondness for video and Telco operators need to realize that it is now or never.

To fit in this wave of technology, Telcos would have to change the way they operate. Sticking with the traditional methods means that they would be unable to feed the users’ hunger for data and better quality. The call of the day is to innovate along with the passage of time. These innovations are something that was talked about in the conference and I will also be delving on them later on in this article.

How Could Efficient Internet Solutions Save Costs?

The Industrial Internet of Things has been called the future of technology, but the truth behind claims regarding costs savings is still doubted. One major question in the mind of all users is regarding what these solutions are doing and what they could do to provide miraculous cost savings in the major industries. Notable speakers at the Huawei UBBF conference delved into these cost savings and presented the impact of this in the industries. One percent of savings from efficient use of Internet solutions in numerous industries can translate into the following cost savings:

Source: Industrial Internet: Pushing the Boundaries: Evans & Annuziata.

There is no doubt in my mind and the mind of many others that this move towards these Internet solutions should not be hindered or stopped due to any reasons, especially with the monetary gains that are expected. What this means is that we now have added pressure on Telco operator’s to deliver, or to be a hindrance in the way of what can be called the smart revolution.

Why Telco Operators Should Use AI and ML to Optimize Their Work

The only reasonable course of action is for Telco operators to implement AI and ML to get the most out of the smart revolution. A famous quote going around is that Data is the next oil. But, how could something so common be given such a high monetary value? It is indeed AI that is the oil for the future. The use of AI will provide data with the identity it needs. For Telco operators, the solution to the problem lies in grasping AI and ML and meeting the needs of the future through unparalleled technology.

David Wang from Huawei emphasized on this during the UBBF Conference and mentioned that it is imperative for Telecom operators to become intelligent. The responsibility of following the road to Intelligence falls upon all Telecom operators. David Wang stressed that the road to network intelligence has 3 steps that should be adhered to if success is to be achieved. Telecom operators should preferably follow a model that is:

  1. Automatic: They should be able to transition from an order driven approach to a model driven approach.
  2. Adaptive: Deep analysis should form the key to move from open-loop to closed-loop.
  3. Autonomous: Implementation of self-learning to move from a static policy towards a phase of enhanced self-learning and implementation

The world is progressing towards the video revolution and the intelligence of Telecom operators will feed this revolution. The Intelligence should be present in all stage of the operations, service provisioning and maintenance. The presence of intelligence, through AI and ML, in telecom will ensure that intelligent operations become the new norm in the future.

About The Author

Ronald van Loon is an Advisory Board Member and Big Data & Analytics course advisor for Simplilearn. He contributes his expertise towards the rapid growth of Simplilearn’s popular Big Data & Analytics category.

If you would like to read more from Ronald van Loon on the possibilities of Big Data and Artificial Intelligence please click “Follow” and connect on LinkedIn and Twitter.

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post AI In Telecom: Intelligent Operations is the New Norm appeared first on Ronald van Loons.


Forrester Blogs

Data Engineers Will Be More Important Than Data Scientists

Does it seem like the ability to find, hire and retain data scientists is a losing battle? Is spending $500K+ per year for a Data Scientist worth it? What is a data scientist anyway? Those a real...

...

Forrester Blogs

Allocate Your Marketing Dollars More Efficiently

Marketers are increasingly under pressure to show that their digital marketing investments are generating returns. Large firms like Unilever have moved to more regularly reevaluate their marketing...

...
 

November 12, 2017

Principa

Automation and machine learning: how much is too much?

At Principa we’ve become quite passionate about Artificial Intelligence and Machine Learning.  Recently quite a bit has been published in the press about how automated machines should be allowed to get.  Most famously perhaps there have been the warnings from the likes of South African born Elon Musk and theoretical physicist Professor Stephen Hawking.


Simplified Analytics

User Experience Design - a key to Digital Transformation

Digital Transformation is moving any company into future. Well, this statement is true for Disney, Nestle, Apple, Amazon and other leaders as they focused mostly on the customer/user...

...
 

November 10, 2017


Revolution Analytics

Because it's Friday: Drone impact

On July 2 this year, a remotely-piloted drone wandered into active airspace ... twice. Not only does the video below illustrate the significant impact of this event, but it's also an interesting...

...

Revolution Analytics

Developing AI applications on Azure: learning plans at three levels

If you're looking to expand your skills as an AI developer, or just getting started, these learning plans for AI Developers on Azure provide a wealth of information to get you up to speed. The...

...

Forrester Blogs

Why The Right Web Analytics Platform Is (Even More) Business Critical

Why The Right Web Analytics Platform Is (Even More) Business Critical Forrester just recently published The Forrester Wave™: Web Analytics, Q4 2017. In it we rank the leading web analytics platform...

...

Forrester Blogs

Predictions 2018: The Crisis Of Trust And How Smart Brands Will Shape CX In Response

Around the world, CX quality has largely stalled. Why? Multiple data sources show that customer confidence is up, spending is up, and expectations are rising as consumers interact with brands more...

...

Forrester Blogs

Predictions 2018: Digital Strategy Reaches The Heartwood Of The Business

Digital business leaders know that customer obsession is the winning mantra, however meeting the needs of today’s empowered customer requires both laser focus and responsiveness. Experimentation,...

...

Forrester Blogs

Web Summit 2017 Points Toward A More Reflective Tech Crowd

Web Summit 2017 took place from November 7 to 9 in Lisbon. It can come across as a chaotic event without any clear direction, and, as at last year’s Web Summit, there was no single overarching theme....

...
Sumendar Karupakala

The most common errors of R programmers


In [1]:
#load package
library(SDSFoundations)

Error occured while concatenating or (c()) function usage for vectors

In [2]:
fivenum(C(2, 2, 3, 5, 6, 7, 9, 13,15,18))
Error in C(2, 2, 3, 5, 6, 7, 9, 13, 15, 18): object not interpretable as a factor
Traceback:

1. fivenum(C(2, 2, 3, 5, 6, 7, 9, 13, 15, 18))
2. C(2, 2, 3, 5, 6, 7, 9, 13, 15, 18)
3. stop("object not interpretable as a factor")
In [3]:
#fixed the above error
fivenum(c(2, 2, 3, 5, 6, 7, 9, 13,15,18)) # by simply changing the upper case letter C to lower case c
  1. 2
  2. 3
  3. 6.5
  4. 13
  5. 18

Error in View the data

In [4]:
#load the data into workspace
animaldata <- AnimalData
In [5]:
view(mtcars) #throws an error
Error in view(mtcars): could not find function "view"
Traceback:
In [6]:
#fixed the above error
View(mtcars) # the letter v must be is in upper case in the View function(this function is not
# working in jupyter but works in Rstudion and base R)
Error in View(mtcars): &aposView()&apos not yet supported in the Jupyter R kernel
Traceback:

1. View(mtcars)
2. stop(sQuote("View()"), " not yet supported in the Jupyter R kernel")

Error while indexing

In [7]:
#pull out the animals which are dogs
animaldata[animaldata$Animal.Type == "Dog" ] # throuws an error
Error in `[.data.frame`(animaldata, animaldata$Animal.Type == "Dog"): undefined columns selected
Traceback:

1. animaldata[animaldata$Animal.Type == "Dog"]
2. `[.data.frame`(animaldata, animaldata$Animal.Type == "Dog")
3. stop("undefined columns selected")
In [8]:
#fixed error
animaldata[animaldata$Animal.Type == "Dog", ] # missedout comma with in the bracket

Impound.NoIntake.DateIntake.TypeAnimal.TypeNeutered.StatusSexAge.IntakeConditionBreedAggressive...Good.with.KidsMax.Life.ExpectancyMax.WeightDog.GroupColorWeightLab.TestOutcome.DateOutcome.TypeDays.Shelter
1K12-0000311/1/12StrayDogSpayedFemale10Injured or SickChihuahua Sh MixN...N186ToyTan & White3.30Heartworm Negative1/7/12Adoption6
2K12-0000371/1/12StrayDogIntactFemale3NormalRat Terrier MixN...Y1425TerrierWhite & Brown7.50No Lab Test1/3/12Return to Owner2
3K12-0001081/1/12StrayDogIntactMale2NormalPit Bull MixN...Y1490TerrierBlue & White74.00Heartworm Negative1/13/12Humane Euthanasia12
4K12-0001251/1/12StrayDogNeuteredMale0NormalLabrador Retr & Border CollieN...Y1279SportingWhite & Black22.00No Lab Test1/8/12Adoption7
5K12-0001571/1/12StrayDogNeuteredMale3Injured or SickLabrador RetrN...Y1279SportingBlack & White54.00Heartworm Negative4/4/12Adoption94
6K12-0002861/1/12StrayDogNeuteredMale5NormalYorkshire TerrN...Y157TerrierSilver & Tan4.80Heartworm Negative1/10/12Return to Owner9
7K12-0002911/1/12Owner SurrenderDogIntactMale2NormalWeimaraner MixN...Y1380SportingFawn & White47.00Heartworm Positive1/26/12Humane Euthanasia25
8K12-0003841/8/12StrayDogSpayedFemale1NormalChihuahua Sh MixN...N186ToyWhite10.00Heartworm Negative1/21/12Transfer13
9K12-0004161/8/12Public AssistDogSpayedFemale1Injured or SickChihuahua Sh MixN...N186ToyBlack & White6.00No Lab Test1/20/12Return to Owner12
10K12-0004971/8/12StrayDogSpayedFemale6NormalChihuahua Sh MixN...N186ToyTan & White9.50Heartworm Negative1/22/12Adoption14
11K12-0004981/8/12StrayDogNeuteredMale1NormalFox Terrier Smooth MixN...N1520TerrierWhite & Red16.50Heartworm Negative1/19/12Adoption11
12K12-0005991/8/12StrayDogSpayedFemale1NormalBlue Lacy MixN...Y1535HoundTan & White46.00Heartworm Negative1/20/12Adoption12
13K12-0006161/8/12StrayDogNeuteredMale8NormalBeagle MixN...Y1535HoundWhite & Black34.00Heartworm Positive1/19/12Transfer11
16K12-0007121/15/12Owner SurrenderDogSpayedFemale5NormalBeagle & Basset HoundN...Y1535HoundTricolor62.00Heartworm Positive1/17/12Transfer2
18K12-0008041/15/12StrayDogSpayedFemale3NormalPit BullN...Y1490TerrierBr Brindle & White48.25Heartworm Negative4/16/12Adoption92
21K12-0009571/15/12Owner SurrenderDogNeuteredMale3NormalPit Bull MixN...Y1490TerrierBlue55.30Heartworm Negative1/31/12Adoption16
22K12-0009671/22/12StrayDogSpayedFemale7NormalPit Bull & RottweilerN...Y1490TerrierBr Brindle & White57.00Heartworm Negative1/22/12Return to Owner0
24K12-0010151/22/12StrayDogSpayedFemale2NormalRhod Ridgeback MixN...Y1285HoundRed & Black57.00Heartworm Negative2/9/12Transfer18
26K12-0010801/22/12StrayDogNeuteredMale7NormalChihuahua Sh MixN...N186ToyBrown & White4.30Heartworm Positive2/1/12Return to Owner10
27K12-0010921/22/12StrayDogSpayedFemale3NormalPit BullN...Y1490TerrierWhite & Tan43.00Heartworm Negative1/25/12Return to Owner3
28K12-0011091/22/12StrayDogNeuteredMale5NormalYorkshire Terr MixN...Y157TerrierSilver & Tan15.30Heartworm Negative1/31/12Adoption9
29K12-0012131/22/12Owner SurrenderDogNeuteredMale0NormalChihuahua Sh MixN...N186ToyRed & Black12.30No Lab Test2/6/12Adoption15
30K12-0012281/22/12StrayDogNeuteredMale2NormalPit BullN...Y1490TerrierWhite & Red47.50Heartworm Positive3/17/12Transfer55
31K12-0012821/29/12StrayDogNeuteredMale7NormalChihuahua Sh MixN...N186ToyTan3.30Heartworm Negative2/3/12Transfer5
32K12-0012951/29/12Owner SurrenderDogNeuteredMale0NormalBasset Hound & Boston TerrierN...Y1265HoundBr Brindle & White15.50No Lab Test1/31/12Adoption2
34K12-0014291/29/12StrayDogSpayedFemale0NormalWelsh Corgi Car MixN...Y1534HerdingTan & White9.50No Lab Test2/9/12Adoption11
35K12-0014491/29/12StrayDogNeuteredMale0NormalNewfoundland & Labrador RetrN...Y10180WorkingBlack13.00No Lab Test2/11/12Adoption13
36K12-0014511/29/12StrayDogNeuteredMale4NormalChihuahua Sh MixN...N186ToyTricolor15.30Heartworm Negative2/7/12Transfer9
37K12-0014781/29/12StrayDogIntactMale13Injured or SickGerm ShepherdN...Y1288WorkingBrown & Black85.00Heartworm Negative2/5/12Return to Owner7
38K12-0016712/5/12Owner SurrenderDogSpayedFemale1NormalGreat Dane MixN...Y10180WorkingBlack & White63.00Heartworm Negative2/10/12Adoption5
...............................................................
431K12-02120311/25/12StrayDogIntactFemale3NormalPit Bull MixN...Y1490TerrierBr Brindle & White46.00Heartworm Negative12/4/12Humane Euthanasia9
432K12-02122411/25/12StrayDogIntactMale3NormalChihuahua Sh MixN...N186ToyBrown13.00Heartworm Negative12/15/12Transfer20
433K12-02123011/25/12StrayDogIntactFemale12Injured or SickCollie Rough MixN...Y1555HerdingTricolor34.00Heartworm Negative12/3/12Humane Euthanasia8
434K12-02125211/25/12StrayDogIntactMale4NormalChow Chow & Aust Cattle DogY...N1570Non-SportingRed & Black42.00Heartworm Negative12/26/12Transfer31
436K12-02150712/2/12Owner SurrenderDogSpayedFemale3NormalPit BullN...Y1490TerrierFawn & White63.00Heartworm Positive12/7/12Humane Euthanasia5
439K12-02159612/2/12Public AssistDogNeuteredMale4NormalWelsh Corgi Car & Chihuahua ShN...Y1534HerdingTan & White17.25No Lab Test12/16/12Return to Owner14
441K12-02167312/2/12Public AssistDogSpayedFemale2NormalSc Wheat TerrN...Y1545TerrierBuff26.00No Lab Test12/7/12Return to Owner5
444K12-02186412/9/12StrayDogSpayedFemale8NormalLabrador Retr MixN...Y1279SportingBlack & White94.00No Lab Test12/12/12Return to Owner3
445K12-02190312/9/12StrayDogSpayedFemale1NormalGerm Shepherd MixN...Y1288WorkingWhite49.00Heartworm Negative12/17/12Adoption8
446K12-02190812/9/12StrayDogSpayedFemale1NormalManchester Terr MixY...N1522TerrierBlack & Brown18.00Heartworm Negative12/16/12Adoption7
447K12-02199712/9/12Public AssistDogNeuteredMale3NormalCatahoulaN...Y1380HerdingBlue Merle45.50Heartworm Negative12/20/12Return to Owner11
448K12-02201312/9/12StrayDogNeuteredMale3NormalDandie Dinmont MixY...Y1524TerrierGray & Cream20.50Heartworm Negative12/13/12Return to Owner4
451K12-02212012/9/12StrayDogSpayedFemale4NormalChihuahua Sh MixN...N186ToyChocolate & White6.25Heartworm Negative12/20/12Adoption11
453K12-02221612/16/12StrayDogIntactFemale8NormalPit Bull & Labrador RetrN...Y1490TerrierBlack60.50Heartworm Negative1/12/13Adoption27
454K12-02232812/16/12StrayDogNeuteredMale2NormalYorkshire Terr MixN...Y157TerrierBrown8.50Heartworm Negative12/23/12Adoption7
455K12-02236312/16/12Public AssistDogNeuteredMale2NormalCanaan Dog MixN...Y1655HerdingBrown & Tan51.00No Lab Test12/27/12Return to Owner11
456K12-02237912/16/12StrayDogSpayedFemale3NormalNorfolk Terrier MixY...Y1511TerrierCream10.00Heartworm Negative12/23/12Adoption7
457K12-02239912/16/12StrayDogNeuteredMale2NormalPoodle Min MixN...Y149ToyWhite & Black9.25Heartworm Negative12/24/12Adoption8
458K12-02243412/16/12Owner SurrenderDogNeuteredMale7NormalLhasa Apso MixN...Y2019Non-SportingTan29.00Heartworm Negative12/26/12Transfer10
459K12-02249512/16/12StrayDogSpayedFemale4NormalMaltese MixN...Y147ToyWhite8.50Heartworm Negative12/27/12Adoption11
460K12-02249612/16/12Owner SurrenderDogIntactFemale0NormalChihuahua Sh MixN...N186ToyTan8.75Heartworm Negative12/23/12Transfer7
461K12-02250212/16/12StrayDogSpayedFemale5NormalWest Highland & Pit BullN...Y1420TerrierBl Brindle45.00No Lab Test12/22/12Return to Owner6
463K12-02253212/16/12StrayDogNeuteredMale2NormalDachshundY...N1428HoundTan & White14.25Heartworm Negative1/4/13Transfer19
464K12-02253812/16/12StrayDogNeuteredMale2NormalPit Bull MixN...Y1490TerrierTan & White53.50Heartworm Positive1/29/13Transfer44
465K12-02258512/23/12StrayDogNeuteredMale8NormalLabrador Retr MixN...Y1279SportingYellow82.50No Lab Test12/24/12Return to Owner1
466K12-02259512/23/12StrayDogNeuteredMale0NormalNorfolk Terrier MixY...Y1511TerrierCream & White7.50No Lab Test12/31/12Adoption8
467K12-02259612/23/12Owner SurrenderDogIntactFemale1Injured or SickShih Tzu MixN...Y1616Non-SportingWhite & Black5.75No Lab Test12/24/12Transfer1
471K12-02281412/23/12Owner SurrenderDogNeuteredMale4Injured or SickBruss Griffon MixN...Y1510ToyBrown10.50No Lab Test12/29/12Humane Euthanasia6
472K12-02288512/30/12StrayDogIntactMale2NormalPoodle Min MixN...Y149ToyWhite13.50Heartworm Negative1/4/13Return to Owner5
473K12-02288912/30/12StrayDogNeuteredMale2NormalDachshund MixY...N1428HoundRed12.00Heartworm Negative1/4/13Adoption5

In [9]:
#pull out the animals which are dogs
animaldata[animaldata$Animal.Type == " Dog", ] # throws a warning message
Warning message in cbind(parts$left, ellip_h, parts$right, deparse.level = 0L):
"number of rows of result is not a multiple of vector length (arg 2)"Warning message in cbind(parts$left, ellip_h, parts$right, deparse.level = 0L):
"number of rows of result is not a multiple of vector length (arg 2)"Warning message in cbind(parts$left, ellip_h, parts$right, deparse.level = 0L):
"number of rows of result is not a multiple of vector length (arg 2)"Warning message in cbind(parts$left, ellip_h, parts$right, deparse.level = 0L):
"number of rows of result is not a multiple of vector length (arg 2)"

Impound.NoIntake.DateIntake.TypeAnimal.TypeNeutered.StatusSexAge.IntakeConditionBreedAggressive...Good.with.KidsMax.Life.ExpectancyMax.WeightDog.GroupColorWeightLab.TestOutcome.DateOutcome.TypeDays.Shelter


In [10]:
#fixed above warning message
animaldata[animaldata$Animal.Type == "Dog", ] # remove the space before the value Dog

Impound.NoIntake.DateIntake.TypeAnimal.TypeNeutered.StatusSexAge.IntakeConditionBreedAggressive...Good.with.KidsMax.Life.ExpectancyMax.WeightDog.GroupColorWeightLab.TestOutcome.DateOutcome.TypeDays.Shelter
1K12-0000311/1/12StrayDogSpayedFemale10Injured or SickChihuahua Sh MixN...N186ToyTan & White3.30Heartworm Negative1/7/12Adoption6
2K12-0000371/1/12StrayDogIntactFemale3NormalRat Terrier MixN...Y1425TerrierWhite & Brown7.50No Lab Test1/3/12Return to Owner2
3K12-0001081/1/12StrayDogIntactMale2NormalPit Bull MixN...Y1490TerrierBlue & White74.00Heartworm Negative1/13/12Humane Euthanasia12
4K12-0001251/1/12StrayDogNeuteredMale0NormalLabrador Retr & Border CollieN...Y1279SportingWhite & Black22.00No Lab Test1/8/12Adoption7
5K12-0001571/1/12StrayDogNeuteredMale3Injured or SickLabrador RetrN...Y1279SportingBlack & White54.00Heartworm Negative4/4/12Adoption94
6K12-0002861/1/12StrayDogNeuteredMale5NormalYorkshire TerrN...Y157TerrierSilver & Tan4.80Heartworm Negative1/10/12Return to Owner9
7K12-0002911/1/12Owner SurrenderDogIntactMale2NormalWeimaraner MixN...Y1380SportingFawn & White47.00Heartworm Positive1/26/12Humane Euthanasia25
8K12-0003841/8/12StrayDogSpayedFemale1NormalChihuahua Sh MixN...N186ToyWhite10.00Heartworm Negative1/21/12Transfer13
9K12-0004161/8/12Public AssistDogSpayedFemale1Injured or SickChihuahua Sh MixN...N186ToyBlack & White6.00No Lab Test1/20/12Return to Owner12
10K12-0004971/8/12StrayDogSpayedFemale6NormalChihuahua Sh MixN...N186ToyTan & White9.50Heartworm Negative1/22/12Adoption14
11K12-0004981/8/12StrayDogNeuteredMale1NormalFox Terrier Smooth MixN...N1520TerrierWhite & Red16.50Heartworm Negative1/19/12Adoption11
12K12-0005991/8/12StrayDogSpayedFemale1NormalBlue Lacy MixN...Y1535HoundTan & White46.00Heartworm Negative1/20/12Adoption12
13K12-0006161/8/12StrayDogNeuteredMale8NormalBeagle MixN...Y1535HoundWhite & Black34.00Heartworm Positive1/19/12Transfer11
16K12-0007121/15/12Owner SurrenderDogSpayedFemale5NormalBeagle & Basset HoundN...Y1535HoundTricolor62.00Heartworm Positive1/17/12Transfer2
18K12-0008041/15/12StrayDogSpayedFemale3NormalPit BullN...Y1490TerrierBr Brindle & White48.25Heartworm Negative4/16/12Adoption92
21K12-0009571/15/12Owner SurrenderDogNeuteredMale3NormalPit Bull MixN...Y1490TerrierBlue55.30Heartworm Negative1/31/12Adoption16
22K12-0009671/22/12StrayDogSpayedFemale7NormalPit Bull & RottweilerN...Y1490TerrierBr Brindle & White57.00Heartworm Negative1/22/12Return to Owner0
24K12-0010151/22/12StrayDogSpayedFemale2NormalRhod Ridgeback MixN...Y1285HoundRed & Black57.00Heartworm Negative2/9/12Transfer18
26K12-0010801/22/12StrayDogNeuteredMale7NormalChihuahua Sh MixN...N186ToyBrown & White4.30Heartworm Positive2/1/12Return to Owner10
27K12-0010921/22/12StrayDogSpayedFemale3NormalPit BullN...Y1490TerrierWhite & Tan43.00Heartworm Negative1/25/12Return to Owner3
28K12-0011091/22/12StrayDogNeuteredMale5NormalYorkshire Terr MixN...Y157TerrierSilver & Tan15.30Heartworm Negative1/31/12Adoption9
29K12-0012131/22/12Owner SurrenderDogNeuteredMale0NormalChihuahua Sh MixN...N186ToyRed & Black12.30No Lab Test2/6/12Adoption15
30K12-0012281/22/12StrayDogNeuteredMale2NormalPit BullN...Y1490TerrierWhite & Red47.50Heartworm Positive3/17/12Transfer55
31K12-0012821/29/12StrayDogNeuteredMale7NormalChihuahua Sh MixN...N186ToyTan3.30Heartworm Negative2/3/12Transfer5
32K12-0012951/29/12Owner SurrenderDogNeuteredMale0NormalBasset Hound & Boston TerrierN...Y1265HoundBr Brindle & White15.50No Lab Test1/31/12Adoption2
34K12-0014291/29/12StrayDogSpayedFemale0NormalWelsh Corgi Car MixN...Y1534HerdingTan & White9.50No Lab Test2/9/12Adoption11
35K12-0014491/29/12StrayDogNeuteredMale0NormalNewfoundland & Labrador RetrN...Y10180WorkingBlack13.00No Lab Test2/11/12Adoption13
36K12-0014511/29/12StrayDogNeuteredMale4NormalChihuahua Sh MixN...N186ToyTricolor15.30Heartworm Negative2/7/12Transfer9
37K12-0014781/29/12StrayDogIntactMale13Injured or SickGerm ShepherdN...Y1288WorkingBrown & Black85.00Heartworm Negative2/5/12Return to Owner7
38K12-0016712/5/12Owner SurrenderDogSpayedFemale1NormalGreat Dane MixN...Y10180WorkingBlack & White63.00Heartworm Negative2/10/12Adoption5
...............................................................
431K12-02120311/25/12StrayDogIntactFemale3NormalPit Bull MixN...Y1490TerrierBr Brindle & White46.00Heartworm Negative12/4/12Humane Euthanasia9
432K12-02122411/25/12StrayDogIntactMale3NormalChihuahua Sh MixN...N186ToyBrown13.00Heartworm Negative12/15/12Transfer20
433K12-02123011/25/12StrayDogIntactFemale12Injured or SickCollie Rough MixN...Y1555HerdingTricolor34.00Heartworm Negative12/3/12Humane Euthanasia8
434K12-02125211/25/12StrayDogIntactMale4NormalChow Chow & Aust Cattle DogY...N1570Non-SportingRed & Black42.00Heartworm Negative12/26/12Transfer31
436K12-02150712/2/12Owner SurrenderDogSpayedFemale3NormalPit BullN...Y1490TerrierFawn & White63.00Heartworm Positive12/7/12Humane Euthanasia5
439K12-02159612/2/12Public AssistDogNeuteredMale4NormalWelsh Corgi Car & Chihuahua ShN...Y1534HerdingTan & White17.25No Lab Test12/16/12Return to Owner14
441K12-02167312/2/12Public AssistDogSpayedFemale2NormalSc Wheat TerrN...Y1545TerrierBuff26.00No Lab Test12/7/12Return to Owner5
444K12-02186412/9/12StrayDogSpayedFemale8NormalLabrador Retr MixN...Y1279SportingBlack & White94.00No Lab Test12/12/12Return to Owner3
445K12-02190312/9/12StrayDogSpayedFemale1NormalGerm Shepherd MixN...Y1288WorkingWhite49.00Heartworm Negative12/17/12Adoption8
446K12-02190812/9/12StrayDogSpayedFemale1NormalManchester Terr MixY...N1522TerrierBlack & Brown18.00Heartworm Negative12/16/12Adoption7
447K12-02199712/9/12Public AssistDogNeuteredMale3NormalCatahoulaN...Y1380HerdingBlue Merle45.50Heartworm Negative12/20/12Return to Owner11
448K12-02201312/9/12StrayDogNeuteredMale3NormalDandie Dinmont MixY...Y1524TerrierGray & Cream20.50Heartworm Negative12/13/12Return to Owner4
451K12-02212012/9/12StrayDogSpayedFemale4NormalChihuahua Sh MixN...N186ToyChocolate & White6.25Heartworm Negative12/20/12Adoption11
453K12-02221612/16/12StrayDogIntactFemale8NormalPit Bull & Labrador RetrN...Y1490TerrierBlack60.50Heartworm Negative1/12/13Adoption27
454K12-02232812/16/12StrayDogNeuteredMale2NormalYorkshire Terr MixN...Y157TerrierBrown8.50Heartworm Negative12/23/12Adoption7
455K12-02236312/16/12Public AssistDogNeuteredMale2NormalCanaan Dog MixN...Y1655HerdingBrown & Tan51.00No Lab Test12/27/12Return to Owner11
456K12-02237912/16/12StrayDogSpayedFemale3NormalNorfolk Terrier MixY...Y1511TerrierCream10.00Heartworm Negative12/23/12Adoption7
457K12-02239912/16/12StrayDogNeuteredMale2NormalPoodle Min MixN...Y149ToyWhite & Black9.25Heartworm Negative12/24/12Adoption8
458K12-02243412/16/12Owner SurrenderDogNeuteredMale7NormalLhasa Apso MixN...Y2019Non-SportingTan29.00Heartworm Negative12/26/12Transfer10
459K12-02249512/16/12StrayDogSpayedFemale4NormalMaltese MixN...Y147ToyWhite8.50Heartworm Negative12/27/12Adoption11
460K12-02249612/16/12Owner SurrenderDogIntactFemale0NormalChihuahua Sh MixN...N186ToyTan8.75Heartworm Negative12/23/12Transfer7
461K12-02250212/16/12StrayDogSpayedFemale5NormalWest Highland & Pit BullN...Y1420TerrierBl Brindle45.00No Lab Test12/22/12Return to Owner6
463K12-02253212/16/12StrayDogNeuteredMale2NormalDachshundY...N1428HoundTan & White14.25Heartworm Negative1/4/13Transfer19
464K12-02253812/16/12StrayDogNeuteredMale2NormalPit Bull MixN...Y1490TerrierTan & White53.50Heartworm Positive1/29/13Transfer44
465K12-02258512/23/12StrayDogNeuteredMale8NormalLabrador Retr MixN...Y1279SportingYellow82.50No Lab Test12/24/12Return to Owner1
466K12-02259512/23/12StrayDogNeuteredMale0NormalNorfolk Terrier MixY...Y1511TerrierCream & White7.50No Lab Test12/31/12Adoption8
467K12-02259612/23/12Owner SurrenderDogIntactFemale1Injured or SickShih Tzu MixN...Y1616Non-SportingWhite & Black5.75No Lab Test12/24/12Transfer1
471K12-02281412/23/12Owner SurrenderDogNeuteredMale4Injured or SickBruss Griffon MixN...Y1510ToyBrown10.50No Lab Test12/29/12Humane Euthanasia6
472K12-02288512/30/12StrayDogIntactMale2NormalPoodle Min MixN...Y149ToyWhite13.50Heartworm Negative1/4/13Return to Owner5
473K12-02288912/30/12StrayDogNeuteredMale2NormalDachshund MixY...N1428HoundRed12.00Heartworm Negative1/4/13Adoption5

In [11]:
#pull out the animals which are dogs
animaldata[animaldata$Animal.Type == " Dog" ] # throws a warning message
Warning message in rbind(parts$upper, ellip_v, parts$lower, deparse.level = 0L):
"number of columns of result is not a multiple of vector length (arg 2)"
In [12]:
#fixed above warning message
animaldata[animaldata$Animal.Type == "Dog", ] # remove the space and add comma after the value Dog

Impound.NoIntake.DateIntake.TypeAnimal.TypeNeutered.StatusSexAge.IntakeConditionBreedAggressive...Good.with.KidsMax.Life.ExpectancyMax.WeightDog.GroupColorWeightLab.TestOutcome.DateOutcome.TypeDays.Shelter
1K12-0000311/1/12StrayDogSpayedFemale10Injured or SickChihuahua Sh MixN...N186ToyTan & White3.30Heartworm Negative1/7/12Adoption6
2K12-0000371/1/12StrayDogIntactFemale3NormalRat Terrier MixN...Y1425TerrierWhite & Brown7.50No Lab Test1/3/12Return to Owner2
3K12-0001081/1/12StrayDogIntactMale2NormalPit Bull MixN...Y1490TerrierBlue & White74.00Heartworm Negative1/13/12Humane Euthanasia12
4K12-0001251/1/12StrayDogNeuteredMale0NormalLabrador Retr & Border CollieN...Y1279SportingWhite & Black22.00No Lab Test1/8/12Adoption7
5K12-0001571/1/12StrayDogNeuteredMale3Injured or SickLabrador RetrN...Y1279SportingBlack & White54.00Heartworm Negative4/4/12Adoption94
6K12-0002861/1/12StrayDogNeuteredMale5NormalYorkshire TerrN...Y157TerrierSilver & Tan4.80Heartworm Negative1/10/12Return to Owner9
7K12-0002911/1/12Owner SurrenderDogIntactMale2NormalWeimaraner MixN...Y1380SportingFawn & White47.00Heartworm Positive1/26/12Humane Euthanasia25
8K12-0003841/8/12StrayDogSpayedFemale1NormalChihuahua Sh MixN...N186ToyWhite10.00Heartworm Negative1/21/12Transfer13
9K12-0004161/8/12Public AssistDogSpayedFemale1Injured or SickChihuahua Sh MixN...N186ToyBlack & White6.00No Lab Test1/20/12Return to Owner12
10K12-0004971/8/12StrayDogSpayedFemale6NormalChihuahua Sh MixN...N186ToyTan & White9.50Heartworm Negative1/22/12Adoption14
11K12-0004981/8/12StrayDogNeuteredMale1NormalFox Terrier Smooth MixN...N1520TerrierWhite & Red16.50Heartworm Negative1/19/12Adoption11
12K12-0005991/8/12StrayDogSpayedFemale1NormalBlue Lacy MixN...Y1535HoundTan & White46.00Heartworm Negative1/20/12Adoption12
13K12-0006161/8/12StrayDogNeuteredMale8NormalBeagle MixN...Y1535HoundWhite & Black34.00Heartworm Positive1/19/12Transfer11
16K12-0007121/15/12Owner SurrenderDogSpayedFemale5NormalBeagle & Basset HoundN...Y1535HoundTricolor62.00Heartworm Positive1/17/12Transfer2
18K12-0008041/15/12StrayDogSpayedFemale3NormalPit BullN...Y1490TerrierBr Brindle & White48.25Heartworm Negative4/16/12Adoption92
21K12-0009571/15/12Owner SurrenderDogNeuteredMale3NormalPit Bull MixN...Y1490TerrierBlue55.30Heartworm Negative1/31/12Adoption16
22K12-0009671/22/12StrayDogSpayedFemale7NormalPit Bull & RottweilerN...Y1490TerrierBr Brindle & White57.00Heartworm Negative1/22/12Return to Owner0
24K12-0010151/22/12StrayDogSpayedFemale2NormalRhod Ridgeback MixN...Y1285HoundRed & Black57.00Heartworm Negative2/9/12Transfer18
26K12-0010801/22/12StrayDogNeuteredMale7NormalChihuahua Sh MixN...N186ToyBrown & White4.30Heartworm Positive2/1/12Return to Owner10
27K12-0010921/22/12StrayDogSpayedFemale3NormalPit BullN...Y1490TerrierWhite & Tan43.00Heartworm Negative1/25/12Return to Owner3
28K12-0011091/22/12StrayDogNeuteredMale5NormalYorkshire Terr MixN...Y157TerrierSilver & Tan15.30Heartworm Negative1/31/12Adoption9
29K12-0012131/22/12Owner SurrenderDogNeuteredMale0NormalChihuahua Sh MixN...N186ToyRed & Black12.30No Lab Test2/6/12Adoption15
30K12-0012281/22/12StrayDogNeuteredMale2NormalPit BullN...Y1490TerrierWhite & Red47.50Heartworm Positive3/17/12Transfer55
31K12-0012821/29/12StrayDogNeuteredMale7NormalChihuahua Sh MixN...N186ToyTan3.30Heartworm Negative2/3/12Transfer5
32K12-0012951/29/12Owner SurrenderDogNeuteredMale0NormalBasset Hound & Boston TerrierN...Y1265HoundBr Brindle & White15.50No Lab Test1/31/12Adoption2
34K12-0014291/29/12StrayDogSpayedFemale0NormalWelsh Corgi Car MixN...Y1534HerdingTan & White9.50No Lab Test2/9/12Adoption11
35K12-0014491/29/12StrayDogNeuteredMale0NormalNewfoundland & Labrador RetrN...Y10180WorkingBlack13.00No Lab Test2/11/12Adoption13
36K12-0014511/29/12StrayDogNeuteredMale4NormalChihuahua Sh MixN...N186ToyTricolor15.30Heartworm Negative2/7/12Transfer9
37K12-0014781/29/12StrayDogIntactMale13Injured or SickGerm ShepherdN...Y1288WorkingBrown & Black85.00Heartworm Negative2/5/12Return to Owner7
38K12-0016712/5/12Owner SurrenderDogSpayedFemale1NormalGreat Dane MixN...Y10180WorkingBlack & White63.00Heartworm Negative2/10/12Adoption5
...............................................................
431K12-02120311/25/12StrayDogIntactFemale3NormalPit Bull MixN...Y1490TerrierBr Brindle & White46.00Heartworm Negative12/4/12Humane Euthanasia9
432K12-02122411/25/12StrayDogIntactMale3NormalChihuahua Sh MixN...N186ToyBrown13.00Heartworm Negative12/15/12Transfer20
433K12-02123011/25/12StrayDogIntactFemale12Injured or SickCollie Rough MixN...Y1555HerdingTricolor34.00Heartworm Negative12/3/12Humane Euthanasia8
434K12-02125211/25/12StrayDogIntactMale4NormalChow Chow & Aust Cattle DogY...N1570Non-SportingRed & Black42.00Heartworm Negative12/26/12Transfer31
436K12-02150712/2/12Owner SurrenderDogSpayedFemale3NormalPit BullN...Y1490TerrierFawn & White63.00Heartworm Positive12/7/12Humane Euthanasia5
439K12-02159612/2/12Public AssistDogNeuteredMale4NormalWelsh Corgi Car & Chihuahua ShN...Y1534HerdingTan & White17.25No Lab Test12/16/12Return to Owner14
441K12-02167312/2/12Public AssistDogSpayedFemale2NormalSc Wheat TerrN...Y1545TerrierBuff26.00No Lab Test12/7/12Return to Owner5
444K12-02186412/9/12StrayDogSpayedFemale8NormalLabrador Retr MixN...Y1279SportingBlack & White94.00No Lab Test12/12/12Return to Owner3
445K12-02190312/9/12StrayDogSpayedFemale1NormalGerm Shepherd MixN...Y1288WorkingWhite49.00Heartworm Negative12/17/12Adoption8
446K12-02190812/9/12StrayDogSpayedFemale1NormalManchester Terr MixY...N1522TerrierBlack & Brown18.00Heartworm Negative12/16/12Adoption7
447K12-02199712/9/12Public AssistDogNeuteredMale3NormalCatahoulaN...Y1380HerdingBlue Merle45.50Heartworm Negative12/20/12Return to Owner11
448K12-02201312/9/12StrayDogNeuteredMale3NormalDandie Dinmont MixY...Y1524TerrierGray & Cream20.50Heartworm Negative12/13/12Return to Owner4
451K12-02212012/9/12StrayDogSpayedFemale4NormalChihuahua Sh MixN...N186ToyChocolate & White6.25Heartworm Negative12/20/12Adoption11
453K12-02221612/16/12StrayDogIntactFemale8NormalPit Bull & Labrador RetrN...Y1490TerrierBlack60.50Heartworm Negative1/12/13Adoption27
454K12-02232812/16/12StrayDogNeuteredMale2NormalYorkshire Terr MixN...Y157TerrierBrown8.50Heartworm Negative12/23/12Adoption7
455K12-02236312/16/12Public AssistDogNeuteredMale2NormalCanaan Dog MixN...Y1655HerdingBrown & Tan51.00No Lab Test12/27/12Return to Owner11
456K12-02237912/16/12StrayDogSpayedFemale3NormalNorfolk Terrier MixY...Y1511TerrierCream10.00Heartworm Negative12/23/12Adoption7
457K12-02239912/16/12StrayDogNeuteredMale2NormalPoodle Min MixN...Y149ToyWhite & Black9.25Heartworm Negative12/24/12Adoption8
458K12-02243412/16/12Owner SurrenderDogNeuteredMale7NormalLhasa Apso MixN...Y2019Non-SportingTan29.00Heartworm Negative12/26/12Transfer10
459K12-02249512/16/12StrayDogSpayedFemale4NormalMaltese MixN...Y147ToyWhite8.50Heartworm Negative12/27/12Adoption11
460K12-02249612/16/12Owner SurrenderDogIntactFemale0NormalChihuahua Sh MixN...N186ToyTan8.75Heartworm Negative12/23/12Transfer7
461K12-02250212/16/12StrayDogSpayedFemale5NormalWest Highland & Pit BullN...Y1420TerrierBl Brindle45.00No Lab Test12/22/12Return to Owner6
463K12-02253212/16/12StrayDogNeuteredMale2NormalDachshundY...N1428HoundTan & White14.25Heartworm Negative1/4/13Transfer19
464K12-02253812/16/12StrayDogNeuteredMale2NormalPit Bull MixN...Y1490TerrierTan & White53.50Heartworm Positive1/29/13Transfer44
465K12-02258512/23/12StrayDogNeuteredMale8NormalLabrador Retr MixN...Y1279SportingYellow82.50No Lab Test12/24/12Return to Owner1
466K12-02259512/23/12StrayDogNeuteredMale0NormalNorfolk Terrier MixY...Y1511TerrierCream & White7.50No Lab Test12/31/12Adoption8
467K12-02259612/23/12Owner SurrenderDogIntactFemale1Injured or SickShih Tzu MixN...Y1616Non-SportingWhite & Black5.75No Lab Test12/24/12Transfer1
471K12-02281412/23/12Owner SurrenderDogNeuteredMale4Injured or SickBruss Griffon MixN...Y1510ToyBrown10.50No Lab Test12/29/12Humane Euthanasia6
472K12-02288512/30/12StrayDogIntactMale2NormalPoodle Min MixN...Y149ToyWhite13.50Heartworm Negative1/4/13Return to Owner5
473K12-02288912/30/12StrayDogNeuteredMale2NormalDachshund MixY...N1428HoundRed12.00Heartworm Negative1/4/13Adoption5

Error while doesn&apost pass appropriate object into the function

In [13]:
fivenum(animaldata)
Error in x[floor(d)] + x[ceiling(d)]: non-numeric argument to binary operator
Traceback:

1. fivenum(animaldata)
In [14]:
#fix the above error by giving vector to the function
fivenum(animaldata$Weight)
  1. 0.25
  2. 4
  3. 10.25
  4. 35.5
  5. 131

Error while visualising categorical variable in the barplot

In [15]:
barplot(animaldata$Intake.Type)
Error in barplot.default(animaldata$Intake.Type): &aposheight&apos must be a vector or a matrix
Traceback:

1. barplot(animaldata$Intake.Type)
2. barplot.default(animaldata$Intake.Type)
3. stop("&aposheight&apos must be a vector or a matrix")
In [16]:
# fix the above error by applying table function on top of the categorical variable
par(mar=c(1, 5, 6, 4))
barplot(table(animaldata$Intake.Type))

Error in prop.table while not taking factor as parameter

In [17]:
acl <- AustinCityLimits
prop.table(acl$Grammy)
Error in Summary.factor(structure(c(2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, : &apossum&apos not meaningful for factors
Traceback:

1. prop.table(acl$Grammy)
2. Summary.factor(structure(c(2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
. 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
. 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
. 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L,
. 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L,
. 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L,
. 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
. 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L), .Label = c("N",
. "Y"), class = "factor"), na.rm = FALSE)
3. stop(gettextf("%s not meaningful for factors", sQuote(.Generic)))
In [18]:
#fix the above error by taking table/factor variable inside the prop.table function
prop.table(table(acl$Grammy))
N Y
0.5775862 0.4224138

Error in plot.new() : figure margins too large

In [19]:
bull<-BullRiders
plot(bull$YearsPro, bull$BuckOuts12) # in R Studio you will get an error like below
Error in plot.new() : figure margins too largeIn addition: Warning messages:1: In doTryCatch(return(expr), name, parentenv, handler) : display list redraw incomplete2: In doTryCatch(return(expr), name, parentenv, handler) : invalid graphics state3: In doTryCatch(return(expr), name, parentenv, handler) : invalid graphics state
In [20]:
#some data
x<-1:100
y<-runif(100,-2,2)
#a usual plot with per default settings
plot(x,y)
#changing the axis title is pretty straightforward
plot(x,y,xlab="Index",ylab="Uniform draws")
In [21]:
#change the sizes of the axis labels and axis title
op<-par(no.readonly=TRUE) #this is done to save the default settings
par(cex.lab=1.5,cex.axis=1.3)
plot(x,y,xlab="Index",ylab="Uniform draws")
#if we want big axis titles and labels we need to set more space for them
par(mar=c(6,6,3,3),cex.axis=1.5,cex.lab=2)
plot(x,y,xlab="Index",ylab="Uniform draws")
In [22]:
par(mfcol=c(5,3))
hist(RtBio, main="Histograma de Bio Pappel")
boxplot(RtBio, main="Diagrama de Caja de Bio Pappel")
stem(RtBio)
plot(RtBio, main="Gráfica de Dispersión")

hist(RtAlsea, main="Histograma de Alsea")
boxplot(Alsea, main="Diagrama de caja de Alsea")
stem(RtAlsea)
plot(RtTelev, main="Gráfica de distribución de Alsea")

hist(RtTelev, main="Histograma de Televisa")
boxplot(telev, main="Diagrama de Caja de Televisa")
stem(Telev)
plot(Telev, main="Gráfica de dispersión de Televisa")

hist(RtWalmex, main="Histograma de Walmex")
boxplot(RtWalmex, main="Diagrama de caja de Walmex")
stem(RtWalmex)
plot(RtWalmex, main="Gráfica de dispersión de Walmex")

hist(RtIca, main="Histograma de Ica")
boxplot(RtIca, main="Gráfica de caja de Ica")
stem(RtIca)
plot(RtIca, main="Gráfica de dispersión de Ica")
Error in hist(RtBio, main = "Histograma de Bio Pappel"): object &aposRtBio&apos not found
Traceback:

1. hist(RtBio, main = "Histograma de Bio Pappel")

Errors while doing copy and paste the data from other sources

In [23]:
if (length(LETTERS) > 0) {
sample_df <- LETTERS[1:4]
sample_df
Error in parse(text = x, srcfile = src): <text>:5:0: unexpected end of input
3: sample_df
4:
^
Traceback:
In [24]:
#we will get unexpected end of input because we are not included final bracket after sample_df

if (length(LETTERS) > 0) {
sample_df <- LETTERS[1:4]
sample_df
} # now we will get the right output
  1. &aposA&apos
  2. &aposB&apos
  3. &aposC&apos
  4. &aposD&apos
In [25]:
## Errors while copy and paste the code (Even it looks like good)
In [26]:
vignette(“quickstart", package = “data.world") # error due to copied double quote not work due to recognised symbols
Error in parse(text = x, srcfile = src): <text>:1:10: unexpected input
1: vignette(“
^
Traceback:
In [27]:
vignette("quickstart", package = "data.world") # it will work when we replace the copied double quotes into actual double quotes with writen hand
starting httpd help server ... done

Error/ Weired results from for loop

In [28]:
i <- c(1,2,3,4)
for(i in i:length(i))
{
print(i)
}
Warning message in i:length(i):
"numerical expression has 4 elements: only the first used"
[1] 1
[1] 2
[1] 3
[1] 4
In [29]:
i
4
In [30]:
for(i in i:length(i))
{
print(i)
}
[1] 4
[1] 3
[1] 2
[1] 1
In [31]:
for(i in i:length(i))
{
print(i)
}
[1] 1

Don&apost forget to Run the loops by select and include top initialisation line


 

November 09, 2017


Forrester Blogs

Go Ahead, Freak Out About Snap’s Earnings- For The Right Reasons

The sky is falling, right on top of Snap Inc, and the sky’s name is CPM. Since they launched their self-service, auction-based buying environment, CPMs plunged 60%.  60%!?!?!?  Cue the panic, right?...

...

Revolution Analytics

Recap: EARL Boston 2017

By Emmanuel Awa, Francesca Lazzeri and Jaya Mathew, data scientists at Microsoft A few of us got to attend EARL conference in Boston last week which brought together a group of talented users of R...

...

Forrester Blogs

TIP of the Iceberg: Research Announcement on Threat Intel Platforms

A common feature in the threat intelligence platform (TIP) space is aggregation of data and providing an interface for managing threat intelligence — this seems to be where the product visions...

...
Silicon Valley Data Science

Exploring the Possibilities of Artificial Intelligence

I recently spoke with Paco Nathan, Director of the Learning Group at O’Reilly Media. In the interview below, we discuss making life more livable, AI fears, and more.

I want to start by just asking what are you up to and what are you working on in the data space right now?

My job kind of changed a bit back in February. We were out at our own AI conference about a year ago in New York and talking to people and recognizing that some of what we were showcasing had a lot of applicability for our own products and services.

So we started up an effort to do AI applications in media. Mostly for Safari, but partly for editorial also, partly for conferences. And so I have been building out a different kind of business unit, if you will, that not only leverages the machine learning, but really is applying some of the available AI technology to make life more livable in a world where we are swimming in data and media.

What does making it more livable look like?

One example is when we do a conference—we just got done with AI SF, and are going to go do Strata New York next week. We’ll come out of a conference with maybe a couple hundred hours of video after the professional video editors have worked with it. So the notion is that someone who is an acquisitions editor would be reviewing the product. The talks given at our conferences are very leading edge, very notable people quite often saying things that are surprises, that are announcements, hot off the press kind of stuff. When those videos come out, they are very popular for our products on Safari—for our enterprise customers, and B2C customers to use on Safari.

One of the issues, though, is that we do 20 conferences a year and growing. So if I do the math of how many hours per conference, post-edit, and how many conferences per year, it comes out that a development editor would have to have a finger on the fast forward button for 10 months out of the year to be able to review those videos.

Instead, we have been working with Google and talking to some other cloud services that have AI APIs—we can do speech to text and then parsing that allows us to index video for search and content recommendation. What we are going towards is rather than having to watch 40 minutes of video, we can give you a one-page summary where some of the key parts are time coded and you can click to go directly to that portion of the video. And hopefully that makes life more livable as an editor because now we can put what’s important up front.

Do you think that using detection, using data, using technology in that way is either the future or needs to be the future for content producers? I know O’Reilly tends to be a little bit ahead of the curve with that kind of stuff, but do you think in general content producers are going to need to start to become more comfortable with using technology in that way?

Definitely. I have been going out and giving talks in general about what we are doing in Safari, or maybe how we are using Jupyter in a few different ways that are novel. But lately I have been talking more about how we approach AI, and particularly the management—there is a design pattern called human in the loop you are probably familiar with, the idea of active learning.

The gist of it is that rather than try to automate everything, if there is a complex thing that needs to be done a lot, like annotating categories on content, we can build up machine learning models. In particular we build up ensembles. So we have a machine learning pipeline that will take new content and then start to say what topics we think are most important in it. If the ensembles can agree and have confidence in what they are predicting, then great, that’s all automated. It just goes right through the pipeline, but when the ensembles disagree, or they have low confidence, then we feed that back to a human and somebody makes a decision.

They exercise judgment and the example—the decision that they make—gets fed back in as an example to train the machine learning pipeline. So we have built up these content annotation pipelines purely based off of examples. The people that are making those decisions never really touch anything having to do with machine learning model parameters. It is just like this chapter is about XYZ, and about ABC, and about QPW. And so by building up examples pro and con then we can say, okay, you are searching for iOS operating system. Are you talking about Apple smartphones or Cisco switches? Because we have content about a lot of both on Safari. And those kinds of ambiguities are really hard to solve with deep learning because they involve a lot of expertise and context.

We do the human-in-the-loop part and that allows us to have the machines do the heavy lifting, but people are able to interject for the exceptions. So it is not an overwhelming amount of people time, and at the same time we don’t have to automate 100% of what we are doing. You know, we can get by automating 90 to 95%. And as we are showing this, it is really resonating. We are finding enterprise customers who are running into similar conditions and needs and taking similar approaches. We are finding in academia this is a pain point particularly for digital humanities and so there is some pretty good innovation there. I can’t really point to any other content publisher, per se.

Definitely people working with media are running into this. I think the publishers may not be quite up to deploying the latest in deep reinforcement learning and I wouldn’t expect them to. We are kind of enjoying the privilege of being in touch with a lot of leading researchers and developers. But I do think it is the way things are shaping up. And we have definitely seen several talks like that at AI SF this last week. We had several talks about human-in-the-loop, some of them from large consulting firms like Deloitte, others from smaller consulting firms that are really hot properties now like CrowdFlower.

Part of it is if you want to get in the game of using deep learning, you must have labeled data sets. To get labeled data sets, active learning is a really good technique. We are at that stumbling block, and a lot of other companies are too. So I do think this is a trend that’s picking up.

When you are out there talking to people, what kinds of questions or fears do they come up with? I know it is natural for people to occasionally be resistant when technology starts in new areas, or takes on jobs that humans were doing.

Definitely. I think the two that really come up—one of them is that AI has hit an inflection point. The majors all use it, anybody who is producing smartphones and doing search or all of that, there are a lot of AI use cases. And we are seeing more now start to percolate out into big use cases in manufacturing, transportation, energy, etc.

I think there are more people coming in who haven’t traditionally had as much exposure to machine learning or data science or any of the computer science side of things. They are coming out of mechanical engineering and how to run a factory or build a factory—Peter Norvig has a great way of calling it uncertain domains—working with uncertainty. Because they are leveraging uncertainty to be able to do this kind of work.

I think people coming out of more traditional engineering fields start to see the uncertainty aspects to it, and they kind of balk. So how can they really begin to trust the systems and also change out the tools? I mean, when we talk about testing, we are not talking about running a match-all unit test anymore, we are talking about doing statistical testing. So I think in some ways even software engineering is struggling with AI. Again, Peter has given some great talks about that.

The other thing that I think is a real problem that people stumble on right now is transparency. If you have these machine learning pipelines that are doing fantastic work, and even if you have got humans-in-the-loop, how much can we explain the decisions that the automated part is doing? How much model transparency do we have? There are some good resources—datascience.com has a getup project called Skater that does model interpretation, kind of working towards general case. I think we’ll be seeing more efforts like that.

I’m always curious about how people predict how new ideas will take hold. Not to put you on the spot and hold you to anything, but do you feel that this is something that is going to move pretty rapidly, or do you think five years from now we are going to be kind of getting people comfortable? I know AI in general—who knows what is going to happen—but just with this particular usage of it.

This kind of usage, where it is mostly automated but people can jump in where they need to, that speaks to what customer service departments do.

I have had this conversation repeatedly. We have AI experts going in and making the judgment calls on the edge cases right now, but realistically our customer service department does that all day every day when they are talking to customers. So maybe we just need to tilt our user interface so we have our customer service organization going in and clicking the buttons in the right places to train the AI because that’s basically what we have built. I’m talking with other large organizations that are coming to a very similar conclusion.

Because of that I think, okay, this is kind of a lab thing right now, but it is being used in production and I think it will be rolled out for us—looking ahead a year we will have essentially customer service people training AI and I don’t think it will be much longer before we see a lot of other cases like that. So I would put that definitely in the two to five year horizon where experts training by example will be rolled out even in pretty mainstream enterprise.

As a side note, enterprise customers could see significant tax benefits from this. Customer service organizations are usually cost centers, but could claim R&D tax credits because they’re helping train AIs and accumulate knowledge for use in automation, which is arguably R&D.

The idea of customer service training AI, how do you think that relates to the chatbot craze right now?

Yeah, it is interesting. I think they have some natural overlap because the kind of dialogue that customer service people have over and over can start to be portioned, segmented, and categorized in ways that start to fit with bot development. And certainly I think some people who are in particular customer support roles sometimes feel like they are bots. I have definitely heard that from friends before. So I think there is some natural overlap there.

But doing bots is difficult, and parsing language is different. Natural language generation is still pretty early but there are a lot of good advances. I was involved in a project in 1995 doing bots for customer service, and we did have a way to fail over to a human, like a retail clerk, if something got interesting. That team ended up competing internationally in something called the Loebner Prize. So I’ll give props to Robby Garner who is the real competitor there, who actually took first in Loebner Price a few times. I guess the reason I’m saying this, is this is not entirely a new field. We saw inklings of it at least 20 years ago.

Making convincing bots is kind of an art form, but it is not impossible. I have definitely seen great examples of it. I do worry now that some of the bot toolkits are kind of over-promising, because again it is kind of an art form, and unless you can really get in and do the artistry the results are going to be iffy. I am pretty hopeful in that area.

We are doing chatbots in production on Safari. Oddly enough, when we do live online training one of the hard things is having hundreds of people in a course break up into groups of four to do group exercises. If you are a professor and you are standing in front of 300 people and you say break into groups of four, you can pretty much stare them down until they do. Been there, done that. But when you are online you can’t. You get this dead space. And people are like “What do I do now?”

We made a chatbot that would basically put people into private DMs, in groups of four and then rebalance it if some people didn’t show up. It works great, because otherwise that is a huge amount of overhead for online training.

So we are doing some work with bots, and some information retrieval work with chatbots also. And I’m certain that Alexa, Google Assistant, and others are going to continue to grow and push demand for transforming it from an art to a practice.

A couple of my colleagues wrote a post on chatbots and banking. I am intrigued but, especially with the recent Equifax situation, do I trust any of these people messing with my personal data? But it is definitely seeping into everything. I have had very pleasant interactions with chatbots and I always speak nicely to them because of Skynet.

[laughing] That’s great, you never know when it will come back.

I have a friend at Concur Labs down here for AI SF, and we are talking about that because they are partnering with Slack and doing some really interesting things on getting enterprise services into chatbots and I hope to be participating in some of that soon too.

One thing I will predict is that there has been this real emphasis on deep learning, and when we did the first AI conference in New York last year, I think I counted—over 80% of the talks were about deep learning. And a lot of it was just people bringing up their NIPS talks and just recycling it for industry.

What we are seeing this year is it’s really branching out in a much more balanced approach. There are the deep learning talks, but then there are also people doing really important work in evolutionary software, and the people doing graph algorithms, and the people doing ontology, and the people doing reinforcement learning and on and on. If I were to roll the clock back and hear a lot of industry buzz about chatbots and deep learning I wouldn’t believe half of it.

Because the thing is, to really do chatbots well you do need to have a kind of knowledge graph. You have to have context to focus the conversation in smart ways. And that context is not something that neural networks are going to be very good at giving you. But it is something that if you are working with ontology, it is much more easier to focus a conversation.

We are already seeing it—we are seeing where Amazon, Microsoft, Google on their chat APIs or speech-to-text APIs, they are taking in controlled vocabularies. They are taking in bits of ontology to help refine the quality. And I think that is my main prediction is—you will see a lot more of the ontology work driving that direction.

We are almost at time so I want to ask my favorite question because it results in such random answers. This can be related to what we have been talking about, or something else in data, or nothing to do with data at all, but is there anything you are looking forward to in the future that you are going to jump on as a project?

One thing that I would put into my crystal ball—I think we have been able to do a lot of work with machine learning, and particularly with deep learning AI techniques, lately to do metadata cleanup. It is always a problem when you have a business that has been going on for decades and you have made acquisitions and a lot of that has come forward in the last couple of years. So cleaning up metadata is a lot harder than it sounds, and also extremely valuable.

Lukas Biewald, founder of CrowdFlower, gave a great talk about active learning at AI SF, using the timeline of algorithms introduction versus its first “killer data set” as an example. He stressed the need for metadata cleanup too. And it was this repeated theme that you may have the best algorithm in the world, but until your data sets can really get leveraged it is not going to have a lot of impact.

Right now I’m really hopeful about doing a lot more metadata cleanup, it has big value because then we can deploy the AI side of it. But the thing I see on the horizon right beyond that is, for instance, topological data analysis (TDA). There has been a lot of interesting work—not just AI but others as well—a lot of interesting work. And frankly, the math behind that is so compute intensive that until you could run GPU clusters in the cloud, it didn’t make sense. So I think we are on the horizon that once you can clean up your metadata then you can get into stuff like TDA and get to some really interesting complex insights.

Editor’s note: The above has been edited for length and clarity.

The post Exploring the Possibilities of Artificial Intelligence appeared first on Silicon Valley Data Science.


Forrester Blogs

Predictions 2018: Blended AI Will Disrupt Your Customer Service And Sales Strategy

The promise of artificial intelligence (AI) has permeated across the enterprise giving hopes of amping up automation, enriching insights, streamlining processes, augmenting workers, and in many ways...

...

Forrester Blogs

Predictions 2018: The Blockchain Revolution Will Have To Wait A Little Longer

The visionaries will forge ahead, those hoping for immediate industry and process transformation will give up. This is the answer I usually give when asked for a one-sentence summary of how I see...

...

Forrester Blogs

Predictions 2018: Automation Will Alter The Global Workforce

Until now, automation has gotten a bad rap. In 2013, Oxford professors Carl Frey and Michael Osborne analyzed 702 occupations and declared 69 million US jobs, or 47% of the workforce, would be lost....

...

Revolution Analytics

Calculating the house edge of a slot machine, with R

Modern slot machines (fruit machine, pokies, or whatever those electronic gambling devices are called in your part of the world) are designed to be addictive. They're also usually quite complicated,...

...

Forrester Blogs

Give Your Customers A Present This Year: The Gift Of Customer Acknowledgement

Have you taken note of the holiday emails frequenting your inbox? I have, and as much as I love a good deal, I’m starting to get a case of “oh boy the holidays are here.” My recognition of the...

...
 

November 08, 2017


Forrester Blogs

Predictions 2018: Who Will Win And Who Will Lose Across The Media Landscape

Media, so vital to the wellbeing of the economy, has endured a turbulent year. Marketers questioned the quality of the supply chain. Agencies reorganized to reinforce their value proposition. Ad tech...

...

Forrester Blogs

Predictions 2018: The Year Of Breakout Marketers, And Those They Leave Behind

Forrester’s been writing about customer obsession for nearly half a decade, but until recently, most companies merely paid lip service to it. What’s changing? Marketers’ jobs...

...
decor