Posted onEdited onWord count in article: 37kReading time ≈34 mins.
Machine learning is getting everywhere. This article introduce some
basic concepts in machine learning.
Introduction
What is machine learning?
Well-posed learning problem: A computer program is said to learn
from experience E with respect to some task T and some performance
measure P, if its performance on T, as measured by P, improves with
experience E.
Machine learning algorithms:
Supervised learning
Unsupervised learning
Others
Reinforcement learning
Recommender systems
...
Supervised learning
Regression problem: a continuous output
Classification problem: a discrete output
Unsupervised learning
There is no feedback based on the prediction results
For example clustering the data
Linear regression with one
variable
Model representation
Training set
= Number of training
examples
= "input" variable /
feature
= "output" variable /
"target" variable
The goal:
Given a training set, to learn a function (hypothesis) so that is a "good" predictor for the
corresponding value of .
Cost function
Hypothesis:
Parameters:
Cost function: This function is called the "square error function" or "mean
square error". The mean is halved as a convenience for the computation
of the gradient descent.
Goal:
Gradient descent algorithm
Have some function
Want minimize
The algorithm:
Repeat until convergence { }
needs to be
simultaneously updated instead of updated in turn
Gradient descent for
linear regression
So the algorithm is to repeat the following until convergence
Linear regression
with multiple variables
Multiple features
We introduce the following notations:
= value of the
feature in the training example
= the input (features) of
the training example
= the number of training
examples
= the number of features
The multivariable form of the hypothesis function accommodating
these multiple features is as follow:
The gradient descent algorithm works for multiple features as
well
Logistic regression
Classification
First binary classification problem: to classify in positive class
and negative class
The linear regression does not suit this problem well
The logistic regression is better:
Hypothesis representation
Logistic regression model
Want
For linear regression, we have which
can be less than 0 and greater than 1. This is not desired
Thus we use the sigmoid (logistic) function:
where which gives
Interpretation of hypothesis output
estimated
probability that on input
"Probability that , given
, parameterized by "
Decision boundary
In order to get our discrete 0 or 1 classificiation, we can translate
the output of the hypothesis function as follows: The way our logistic function behaves is that when its input is
greater than or equal to zero, its output is greater than or equal to
: So if our input to is
, then that means: The decision boundary is the line that separates the area
where and where . It is created by our hypothesis
function.
Cost function
Training set: .
How to choose parameters ?
For linear regression, the cost function is: However, this cost function does not work for logistic
regression because its is
highly non-linear. The
will have many local minimum (non-convex) which makes the optimization
difficult. It might never converge to the global minimum.
Logistic regression cost
function
Its properties:
if . But as , . Captures
intuition that if ,
(predict , but
), we'll penalize learning
algorithm by a very large cost.
Simplified cost function
This function is equivalent to the cased one.
The logistic regression cost function will then be: To fit parameters :
To make a prediction given new :
Gradient
descent on the simplified cost function
Repeat { }
Algorithm looks identical to linear regression! But since the
hypothesis function changed, these two algorithms are actually not the
same.
A vectorized implementation is:
Advanced
optimization on the simplified cost function
Optimization algorithms:
Gradient descent
Conjugate gradient
BFGS
L-BFGS
Advantages
No need to manually pick
Often faster than gradient descent
Disadvantages
More complex
Example
We want to minimize Here:
Then this optimization problem can be solved with the following
Octave code:
We divide our problem into
binary classification problems. In each one, we predict the probability
that is a member of one of our
classes. We are basically choosing one class and then lumping all the
others into a single second class. We do this repeatedly, applying
binary logistic regression to each case, and then use the hypothesis
that returned the highest value as our prediction.
Train a logistic regression classifier for each class to predict the probability that . On a new input , to make a prediction, pick the class
that maximizes:
The problem of overfitting
Overfitting: If we have too many features, the learned hypothesis may
fit the training set very well, but fail to generalize to new
examples.
Addressing overfitting
Reduce number of features
Manually select which features to keep.
Model selection algorithm.
Regularization
Keep all the features, but reduce magnitude/values of parameters
.
Works well when we have a lot of features, each of which contributes
a bit to predicting .
Cost function
Here is the regularization parameter. It determines how
much the costs of our
parameters are inflated. Using the above cost function with the extra
summation, we can smooth the output of our hypothesis function to reduce
overfitting. If is chosen
to be too large, it may smooth out the function too much and cause
underfitting.
Regularized linear
regression
The modified algorithm become:
Repeat { }
The term performs our
regularization. With some manipulation our update rule can also be
represented as: The first term in the above equation, will always be
less than 1.
Normal equation
Now let's approach regularization using the alternate method of the
non-iterative normal equation. To add in regularization, the equation is
the same as our original, except that we add another term inside the
parentheses: should jave dimension
. Recall that if
, then is non-invertible. However, when we
add the term , then
becomes
invertible.
Regularized logistic
regression
The algorithm looks the same like the linear regression one with a
difference hypothesis function .
Neural Networks
Motivations
Some problems are fundamentally non-linear and solving them by
linear regression will require gigantic feature space which is often not
feasible.
Origins: algorithms that try to mimic the brain. Was very widely
used in 80s and early 90s. Popularity diminished in late 90s.
Recent resurgence: state-of-the-art technique for many
applications.
The one-learning-algorithm hypothesis.
Model representation
Neuron model: Logistic unit
At a very simple level, neurons are basically computational units
that takes inputs (dendrites) as electrical inputs that are channeled to
outputs. In out model, our dendrites are like the input features , and the output is the
result of our hypothesis function. In this model our input node is sometimes called the
"bias unit". It is always equal to 1. In neural networks, we use the
same logistic function as in classification, , yet we
sometimes call it a sigmoid activation function. In this situation, our
parameters are sometimes
called weights.
For example, . our input nodes, also known as
the "input layer", go into another node, which finally outputs the
hypothesis function, known as the "output layer". We can have
intermediate layers of nodes betwee the input and output layers called
the "hidden layers". In this example, we label these intermediate or
"hidden" layer nodes and call them "activation units". If we had one hidden layer, it would look like: The values for each of the activation nodes is obtained as
follows: This is saying that we compute our activation nodes by using a
matrix of parameters. We
apply each row of the parameters to our inputs to obtain the value for
one activation node. Our hypothesis output is the logistic function
applied to the sum of the values of our activation nodes, which have
been multiplied by yet another parameter matrix containing the weights for
our second layer of nodes. Each layer gets its own matrix of weights,
. The dimensions of
these matrices of weights is determined as follows: If network has units in layer and units in layer , then will be of dimension . The comes from the addition in of the "bias nodes" and . In other words the output
nodes will not include the bias nodes while the inputs will.
Vectorized implementation
So in general:
Applications
Simple example: AND
Our neural network can be for example: Which will give the following table:
0
0
0
1
1
0
1
1
FYI: and
Example: OR
Our neural network will then be:
Example: Negation
Putting it together:
Multiclass classification
Multiple output units: one-vs-all
To classify data into multiple classes, we let our hypothesis
function return a vector of values. Say we want to classify our data
into one of four categories. We can define our set of resulting classes
as : Each represents a
different image corresponding to either one of the four categories. The
inner layers, each provides us with some new information which leads to
our final hypothesis function. The setup looks like:
Learning
Cost function
For a neural network for a classification problem, let's denote:
Inputs
total number of layers in
network
number of units (not
counting bias unit) in layer
Two types of
classification problems
Binary classification:
output unit:
Multi-class classification (
classes)
output units
Cost function for neural
network
In the regularization part, after the square brackets, we must
account for multiple
matrices. The number of columns in our current matrix is equal to the number of
nodes in our current layer (including the bias unit). The number of rows
in our current matrix is
equal to the number of nodes in the next layer (excluding the bias
unit).
Note:
The double sum simply adds up the logistic regression costs
calculated for each cell in the output layer.
The triple sum simply adds up the squares of all the individual
s in the entire network
The in the triple sum does
not refer to training example .
Backpropagation algorithm
We need to compute and
. The backpropagation algorithm is then:
{
So we set
for all .
For to :
Set
Perform forward propagation to compute for .
Using , compute
Compute
: The vectorized implementation looks
like:
Then if
and if .
}
It can be shown that, when these computation is finished, we have
Backpropagation intuition
"error" of cost
for (unit in layer ). Formally, for , where .
Implement numerical gradient check to compute
gradApprox.
Make sure they give similar values.
Turn off gradient checking. Using backprop code for learning.
Be sure to disable your gradient checking code before training your
classifier. The numerical estimation of gradients is much slower than
the backpropagation algorithm.
Random initialization
To initialize all elements of to 0 does not work for neural
work, the iteration will be stuck in a loop. We need random
initialization to perform symmetry breaking. So we initialize each to a random value in
:
function[J grad] = nnCostFunction(nn_params, ... input_layer_size, ... hidden_layer_size, ... num_labels, ... X, y, lambda) %NNCOSTFUNCTION Implements the neural network cost function for a two layer %neural network which performs classification % [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ... % X, y, lambda) computes the cost and gradient of the neural network. The % parameters for the neural network are "unrolled" into the vector % nn_params and need to be converted back into the weight matrices. % % The returned parameter grad should be a "unrolled" vector of the % partial derivatives of the neural network. %
% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices % for our 2 layer neural network Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ... hidden_layer_size, (input_layer_size + 1));
% Unroll gradients grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
Machine Learning System
Design
Evaluating a learning
algorithm
Debugging a learning
algorithm
Suppose you have implemented regularized linear regression to predict
housing prices. However, when you test your hypothesis on a new set of
houses, you find that is makes unacceptably large errors in its
predictions. What should you try next?
Get more training examples
Try smaller sets of features
Try getting additional features
Try adding polynomial features (, etc.)
Try decreasing
Try increasing
Machine learning diagnostic
Diagnostic: A test that you can run to gain insight what is/isn't
working with a learning algorithm, and gain guidance as to how best to
improve its performance.
Diagnostics can take time to implement, but doing so can be a very
good use of your time.
Evaluating a hypothesis
Training/testing procedure for linear regression
Learn parameter from
training data (minimizing training error )
Compute test set error for linear regression:
Training/testing procedure for logistic regression
Learn parameter from
training data
Compute test set error:
Or the misclassification error (aka. 0/1 misclassification
error): This gives us a binary 0 or 1 error result based on a
misclassification. The average test error for the test set is: This gives us the proportion of the test data that was
misclassified.
Model selection
and train/validation/test sets
Once parameters were fit to some set of data (training set),
the error of the parameters as measured on that data (the training error
) is likely to be lower
than the actual generalization error.
We thus to divide the data into three sets:
Training set
60% of the data
with training error:
Cross validation set
20% of the data
with cross validation error:
Testing set
20% of the data
with test error:
We can now calculate three separate error values for the three
different sets using the following method:
Optimize the parameters in using the training set for each
polynomial degree.
Find the polynomial degree
with the least error using the cross validation set.
Estimate the generalization error using the test set with .
Diagnosing bias vs. variance
Suppose your learning algorithm is performing less well than you were
hoping. Is it a bias problem or variance problem?
High bias (underfit):
will be
high
If a learning algorithm is suffering from high bias, getting more
training data will not (by itself) help much.
Low training set size: causes to be low and to be high.
Large training set size: causes both and to be high with .
Variance (overfit):
will be
low
If a learning algorithm is suffering from high variance, getting
more training data is likely to help.
Low training set size: will be low and will be high.
Large training set size: increases with training
set size and
continues to decrease without leveling off. Also but
the difference between them remains significant.
Choosing the
regularization parameter
In order to choose the model and regularization term , we need to:
Create a list of lambdas (i.e. )
Create a set of models with different degrees or any other
variants
Iterate through the s
and for each go through all
the models to learn some .
Compute the cross validation error using the learned (computed with ) on the without regularization or
.
Select the best combo that produces the lowest error on the cross
validation set.
Using the best combo and
, apply it on to see if it has a good
generalization of the problem.
Deciding what to do next
revisited
Get more training examples: fixes high variance
Try smaller sets of features: fixes high variance
Try getting additional features: fixes high bias
Try adding polynomial features (, etc.): fixes high bias
Try decreasing : fixes
high bias
Try increasing : fixes
high variance
Neural networks and
overfitting
"Small" neural network (fewer parameters; more prone to
underfitting)
Computationally cheaper.
"Large" neural network (more parameters; more prone to overfitting)
Computationally more expensive.
Use regularization () to
address overfitting.
Using a single hidden layer is a good starting default. You can
train your neural network on a number of hidden layers using your cross
validation set. You can then select the one that performs best.
Machine
learning system design: Spam classification example
Prioritizing what to work on
Supervised learning. features
of email. spam or not spam . Features : Choose 100 words indicative of spam /
not spam. In practice, take most frequently occurring words (10000 to 50000) in training set,
rather than manually pick 100 words.
Building a spam classifier
How to spend your time to make it have low error?
Collect lots of data
E.g. "honeypot" project
Develop sophisticated features based on email routing information
(from email header).
Develop sophisticated features for message body, e.g. should
"discount" and "discounts" be treated as the same word? How about "deal"
and "Dealer"? Features about punctuation?
Develop sophisticated algorithm to detect misspellings (e.g.
m0rtgage, med1cine, w4tches.)
Error analysis
Recommended approach
Start with a simple algorithm that you can implement quickly.
Implement it and test it on your cross-validation data.
Plot learning curves to decide if more data, more features, etc. are
likely to help.
Error analysis: Manually examine the examples (in cross validation
set) that your algorithm made errors on. See if you spot any systematic
trend in what type of examples it is making errors on.
Error analysis example
examples in cross
validation set. Algorithm misclassifies 100 emails. Manually examine the
100 errors, and categorize them base on:
What type of email it is.
What cues (features) you think would have helped the algorithm
classify them correctly.
The importance of
numerical evaluation
Should discount/discounts/discounted/discounting be treated as the
same word? Can use "stemming" software (e.g. "porter stemmer"). It can
hurt as well (universe/university). Error analysis may not be helpful
for deciding if this is likely to improve performance. Only solution is
to try it and see if it works. Need numerical evaluation (e.g., cross
validation error) of algorithm's performance with and without
stemming.
Handling skewed data
Cancer classification
example
Train logistic regression model . ( if cancer, otherwise). Find that you got error on test set. However of patients have cancer. The
number is too small so it create a skewed classes. (You can predict
all the time to archive a error
rate at ).
Precision/recall
in presence of rate class
that we want to detect.
Precision : of all
patients where we predicted ,
what fraction actually has cancer?
Recall : of all patients
that actually have cancer, what fraction did we correctly detect as
having cancer?
Predicted class Actual class
1
0
1
True positive
False positive
0
False negative
True negative
A good classifier should have high precision and high recall. Even
for a skew class, it is very difficult to cheat to a high precision and
a high recall.
Trading off precision and
recall
Logistic regression: . Predict 1 if , predict 0 if .
Suppose we want to predict
only if we are very confident. Then we could only predict 1 if . This gives a higher
precision, lower recall.
Suppose we want to avoid missing too many cases of cancer (avoid
false negatives). Then we could only predict 1 if . This gives a higher
recall, lower precision.
score (F score)
How to compare precision / recall numbers? The score is: There are more way to compare precision and recall values
other than score.
Data for machine learning
Assume feature has sufficient information to predict accurately. Useful test: Given the
input , can a human expert
confidently predict ?
Use a learning algorithm with many parameters (e.g. logistic
regression / linear regression with many features; neural network with
many hidden units), thus low bias algorithms. Use a very large training
set (unlikely to overfit).
Large Margin Classification
Support vector machine
Logistic regression: Support vector machine: Hypothesis:
If , we want (not just )
If , we want (not just )
If is very large, then the
first term should must be . Which
gives the SVM decision boundary as below. The SVM is sometimes also
called the large margin classification.
SVM decision boundary
Kernels
Given , compute new feature
depending on proximity to landmarks . We let: This similarity is called Gaussian kernels.
If : If is far from : We predict "1" when Then the decision boundary will be close to the where is large.
Choosing the landmarks
We can choose to put landmark at exact where the training examples
are. Concretely:
Given . Choose . Given example : and let For training example :
Hypothesis: Given , compute
features ,
predict if .
Training:
SVM parameters
:
Large : Lower bias, higher
variance.
Small : Higher bias, lower
variance.
:
Large : Features vary more smoothly.
Higher bias, lower variance.
Smaller : Features
vary less smoothly.
Lower bias, higher variance.
SVM in practice
Use SVM software package (e.g. liblinear, libsvm, ...) to solve for
paramters . Need to
specify:
Choice of parameter .
Choice of kernel (similarity function).
E.g. no kernel ("linear kernel"): Predict if .
You can do this when you have a large number of features but a small
training set. You want to try to avoid over-fitting.
E.g. Gaussian kernel
Need to choose .
Kernel functions in code
1 2
function f = kernel(x1, x2) f = exp(-norm(x1-x2)^2 / 2 / sigma^2);
Note: Do perform feature scaling before using the Gaussian
kernel.
Other choices of kernel
Not all similarity functions make valid kernels. Need to satisfy
technical condition called "Mercer's theorem" to make sure SVM packages'
optimizations run correctly, and do not diverge.
Many off-the-shelf kernel available:
Polynomial kernel: .
More esoteric: String kernel, chi-square kernel, histogram
intersection kernel, ...
Multi-class classification
Many SVM packages already have built-in multi-class classification
functionality. Otherwise, use one-vs-all method. Train SVMs, one to distinguish from the rest, for , get . Pick class
with largest .
Logistic regression vs. SVMs
number of features (, number of training examples.
If is large (relative to
):
E.g. .
Use logistic regression, or SVM without a kernel ("linear
kernel").
If is small, is intermediate:
E.g. .
Use SVM with Gaussian kernel.
If is small, is large:
E.g. .
SVM will be slow to run.
Create/add more features, then use logistic regression or SVM
without a kernel.
Neural network likely to work well for most of these settings, but
may be slower to train.
The SVM gives a convex optimization problem so the numerical
algorithm will always gives a global minima. This is a advantage of SVM
over neural network.
Unsupervised Learning
Introduction
Clustering
Clustering is an example of unsupervised learning. In unsupervised
learning, you are given an unlabeled dataset and are asked to find
"structure" in the data.
Applications of clustering
Market segmentation
Social network analysis
Organize computing clusters
Astronomical data analysis
K-means algorithm
Input:
(number of clusters).
Training set . , drop
convention.
The algorithm:
Randomly initialize cluster
centroids .
Repeat {
for i = 1 to m
index (from 1 to
) of cluster centroid closest to
.
This means:
for k = 1 to
average of points
assigned to cluster .
}
If there is a cluster with no point assigned to it. Commonly we can
eliminate it. Or sometimes we can recreate a new one randomly.
Optimization
objective for K-means algorithm
index of cluster
() to which example
is currently assigned
cluster centroid ()
cluster centroid
of cluster to which example
has been assigned
The optimization objective:
Random
initialization for K-means algorithm
Should have . Randomly
pick training examples. Set equal to these examples.
Local optima
There is a possibility that the K-means stuck at the local optima. To
avoid this, we can run K-means with different random initializations for
1 - 100 times, then pick clustering that gave lowest cost . This method works pretty well when the
number of clusters is small, 2 - 10. With a very large number of
clusters, > 1000, running multiple times often does not help.
Choosing the number of
clusters
The most common thing to do is actually to choose the number of
clusters manually by looking at the data graphically or
statistically.
Sometimes, you're running K-means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for how
well it performs for that later purpose.
Dimensionality reduction
Motivation
Data compression
If you have a highly correlated features, you may want to reduce the
dimension. Suppose we apply dimensionality reduction to a dataset of
examples ,
where . As
a result of this, we will get out a lower dimension dataset of
examples where for some value of
and .
Visualization
To visualize a data, we need to apply dimensionality reduction so
that the plotted data has dimension 2 or 3 since we can plot 2D or 3D
but don't have ways to visualize higher dimensional data.
Principal component analysis
(PCA)
Find a direction onto which to project the data so as to minimize the
projection error. Reduce from -dimension to -dimension: Find vectors onto
which project project the data, so as to minimize the projection
error.
PCA is not linear regression
For example, in 2D, we have a problem . The linear regression is to
minimize for some linear
function . PCA finds a vector
pointing a line that minimize the projection distance between point
and some linear function
.
PCA algorithm
Training set:
Preprocessing (feature scaling/mean normalization): Replace each
with . If different
features on different scales (e.g., size of house, number of bedrooms), scale features
to have comparable range of values by Here is some measure
of range of values, it can be the maximum value or the standard
derivation of .
Then we compute the covariance matrix: Then compute the eigenvectors of matrix for example with Octave:
1
[U, S, V] = svd(Sigma);
Here We are going to take out the first columns of the matrix and call it . Then we have
Reconstruction
from compressed representation
Applying PCA
Choosing the
number of principal components
Average squared projection error: .
Total variation in the data:
Typically, choose to be the
smallest value so that "99% of variance is retained". It can be 95% or 90% as
well.
Algorithm:
Try PCA with , compute then check if the inequality holds. This algorithm
is slow.
With the Octave code [U, S, V] = svd(Sigma), we get
the matrix S, then the inequality above can be expressed
as:
Advice for applying PCA
Supervised learning speedup
Mapping
should be defined by running PCA only on the training set. This mapping
can be applied as well to the examples and in the cross validation
and test sets.
Bad use of PCA: To
prevent overfitting
Use instead of to reduce the number of features
to . Thus, fewer features,
less likely to overfit.
This might work OK, but isn't a good way to address overfitting. Use
regularization instead.
PCA is sometimes
used where it shouldn't be
Design of ML system:
Get training set
Run PCA to reduce in
dimension to get
Train logistic regression on the data
Test on test sets.
How about doing the whole thing without using PCA? Before
implementing PCA, first try running whatever you want to do with the
original/raw data . Only if
that doesn't do what you want, then implement PCA and consider using
.
Anomaly Detection
Density Estimation
Problem motivation
Anomaly detection example
Aircraft engine features:
heat generated
vibration intensity
...
Dataset: . We get a new engine , is it anomalous?
Fraud detection
features of user 's activities. Model from data. Identify unusual users by
checking which have .
Monitoring computers in a
data center
features of machine
memory use
number of disk
accesses/sec
CPU load
CPU load / network
traffic
...
Gaussian distribution
Say . If is a distributed Gaussian with mean
, variance .
Parameter estimation
Dataset: .
Algorithm
Density estimation
Training set: . Even if this independence assumption doesn't hold, this
algorithm works fine.
Anomaly detection algorithm
Choose features that you
think might be indicative of anomalous examples.
Fit parameters .
Given new examples ,
compute . Anomaly if .
Building an anomaly
detection system
Developing
and evaluating an anomaly detection system
The importance of real-number evaluation: When developing a learning
algorithm (choosing features, etc.), making decisions is much easier if
we have a way of evaluating our learning algorithm.
Assume we have some labeled data, of anomalous and non-anomalous
examples. ( if normal, is anomalous).
Training set: (assume normal examples / not
anomalous)
Cross validation set: .
Test set: .
Aircraft engines motivating
example
10000 good (normal) engines.
20 flawed engines (anomalous).
Training set: 6000 good engines.
CV: 2000 good engines, 10 anomalous.
Test: 2000 good engines, 10 anomalous.
Algorithm evaluation
Fit model on training set
.
On a cross validation / test ,
predict Possible evaluation metrics:
Accuracy not good since the dataset is very skewed
Can also use cross validation set to choose parameter .
Anomaly detection vs
Supervised learning
Use anomaly detection when
Very small number of positive examples (0 - 20 is common).
Large number of negative examples.
Many different "types" of anomalies. Hard for any algorithm to learn
from positive examples what the anomalies look like; future anomalies
may look nothing like any of the anomalous examples we've seen so
far.
Anomaly detection is used
for
Fraud detection
Manufacturing (e.g. aircraft engines)
Monitoring machines in a data center
...
Use supervised learning when
Large number of positive and negative examples.
Enough positive examples for algorithm to get a sense of what
positive examples are like, future positive examples likely to be
similar to ones in training set.
Supervised learning is used
for
Email spam classification
Weather prediction (sunny / rainy / etc.)
Cancer classification
...
Choosing what features to
use
If the features are not
Gaussian-like, then we can map to the following data to make it
Gaussian-like:
...
Error analysis for anomaly
detection
Want large for normal
examples , want small for anomalous examples .
Most common problem:
is comparable (say, both
large) for normal and anomalous examples.
Try coming up with more features to distinguish between the normal
and the anomalous examples. It sometimes can be found from the anomaly
example.
For example: You have two features vibration intensity, and heat generated. Both
and take on values between
and (and are strictly greater than 0), and
for most "normal" engines you expect that . One of the suspected
anomalies is that a flawed engine may vibrate very intensely even
without generating much heat (large , small ), even though the particular values
of and may not fall outside their typical
ranges of values. An additional feature can be
When choosing features for an anomaly detection system, it is a good
idea to look for features that take on unusually large or small values
for (mainly the) anomalous examples.
Multivariate Gaussian
distribution
Multivariate Gaussian
distribution
. Don't model
etc.
separately. Model all in one
go.
Parameters: (Covariance matrix) Here is the
determinant of .
Anomaly
detection using the Multivariate Gaussian distribution
Parameter fitting: Given training set
The algorithm
Fit model by setting and to the value above.
Given a new example , compute
with and above.
Flag an anomaly if .
Relationship to original
model
The original model is a special case of the multivariate Gaussian
distribution with the constraint This means that the contours of are axis-aligned.
Original model vs.
Multivariate Gaussian
Use original model when
Manually create features to capture anomalies where take unusual combinations
of values.
Computationally cheaper. Scales better to large . ().
Works OK even if (training
set size) is small.
Use multivariate Gaussian
model
Automatically captures correlations between features.
Computationally more expensive.
Must have , or else is non-invertible. Better with
.
If you have redundant features, will be non-invertible.
Recommender Systems
Predicting movie ratings
Problem formulation
User rates movies using zero to five stars.
number of users
number of movies
if user has rated movie
rating given by user
to movie (defined only if )
Content based
recommendations
For example:
Movie
Alice (1)
Bob (2)
Carol (3)
Dave (4)
(romance)
(action)
Love at last
5
5
0
0
0.9
0
Romance forever
5
?
?
0
1.0
0.01
Cute puppies of love
?
4
0
?
0.99
0
Nonstop car chases
0
0
5
4
0.1
1.0
Swords vs karate
0
0
5
?
0
0.9
For each user , learn a
parameter . Predict user as rating movie with stars. Here . Here we are basically doing
linear regression on each users separately.
Problem formulation
parameter
vector for user .
feature vector for
movie .
For user , movie , predicted rating: .
number of movies
rated by user .
To learn
(parameter for user )
To learn all :
Optimization algorithm
Gradient descent update:
Collaborative filtering
We might not have all the features at hand. We can learn the features
by letting each user gives their preferences.
Optimization algorithm
Given , to learn : Given , to learn :
Collaborative filtering
Given
(and movie ratings) can estimate .
Given (and movie ratings) can estimate .
If we have none of them, we can begin with a random , then compute , then compute new again, and again, and again. Until we have a good
result.
Collaborative filtering
algorithm
Minimizing and simultaneously: Then we want This is similarly to the iterative process above.
The algorithm
Initialize to small random values.
This serves as symmetry break (similar to the random initialization
of a neural network's parameters) and ensures the algorithm learns
features
that are different from each other.
Minimize using gradient descent (or
an advanced optimization algorithm). E.g. for every :
For a user with parameters and a movie with (learned)
features , predict a star rating
of .
Low rank matrix
factorization
Vectorization
Predicted ratings: Given Then This is called the low rank matrix factorization.
Finding related movies
For each product , we learn a
feature vector . How to find movies related to movie ? Small movie
and are "similar".
Implementational
detail: mean normalization
We normalize each movie to make them have a average rating of zero.
We call the average . Then for
user , on movie predict: Then for the user that doesn't rated a movie, we shall predict
that the user might give the average rating instead of zero. This is
more reasonable.
Large Scale Machine Learning
Gradient descent with
large datasets
Learning with large datasets
"It's not who has the best algorithm that wins. It's who has the most
data."
Let's say that your dataset has the training set size . This is realistic
nowadays. How can you tell if using all of the data is likely to perform
much better than using a small subset of the data (say )? We can plot a learning curve
for a range of values of and
verify that the algorithm has high variance when is small.
Stochastic gradient descent
Batch gradient descent: You need to look at all the training data
for each gradient descent step.
Cost function for stochastic gradient descent:
Randomly shuffle dataset.
Repeat {
for i = 1, ..., m { }
}
The changing part is the first-order derivative of the cost function
of stochastic gradient descent.
The stochastic gradient descent gives a result that is not really at
the global minimum but closes to it. The cost function might not even go
down with every iteration. However, in practice, this works fine.
Stochastic gradient descent might take 1 - 10 passes depends on your
training set size.
Mini-batch gradient descent
Batch gradient descent: Use all examples in each iteration.
Stochastic gradient descent: Use 1 examples in each
iteration.
Mini-batch gradient descent: Use examples in each iteration.
mini-batch size
Typical mini-batch size: 2 - 100
For , we have 10
examples: . Then .
This algorithm can sometimes be even faster than the stochastic
gradient descent with good vectorization and good choice of .
Stochastic gradient
descent convergence
For batch gradient descent, we plot as a function of the
number of iterations of gradient descent. If it is decreasing, then it
is converging.
But you don't want to pause your computation regularly if you have a
very large training set.
For stochastic gradient descent: During learning, compute before
updating using .
Say for every 1000 iterations, plot averaged
over the last 1000 examples processed by algorithm.
If you want the stochastic gradient descent to converge to the
global minimum, you can slowly decrease over time. (e.g. .
Online learning
For example, shipping service website where user comes, specifies
origin and destination, you offer to ship their package for some asking
price, and users sometimes choose to use your shipping service (), sometimes not ().
If you have a large steam of data, then this online learning
algorithm is pretty effective. If you do not have enough data, then save
them and perform a regular learning. The online learning algorithm can
adopt to changing user preferences.
Features capture properties of
user, of origin / destination and asking price. We want to learn to optimize price.
The algorithm:
Repeat forever {
Get corresponding to
user. is the features, is whether the user use our shipping or
not.
Update using :
for
Then we discard the data .
}
Other online learning example: Product search (learning to
search).
User searches for "Andorid phone 1080p camera"
Have 100 phones in store. Will return 10 results.
features of phone, how many
words in user query match name of phone, how many words in query match
description of phone, etc.
if user clicks on link.
otherwise.
Learn .
Use to show user the 10 phones they're most likely to click on.
Map reduce and data
parallelism
Map-reduce
Batch gradient descent example:
Machine 1: Use .
Get
Machine 2: Use .
Get .
Machine 3: Use .
Get
Machine 4: Use .
Get .
Master server combines these data:
Map-reduce and
summation over the training set
Many learning algorithms can be expressed as computing sums of
functions over the training set. E.g. for advanced optimization, with
logistic regression, need:
If you are applying the map-reduce method to train a neural network
on ten machines. In each iteration, each of the machines will compute
forward propagation and back propagation on 1/10 of the data to compute
the derivative with respect to that 1/10 of the data.
Application example: Photo
OCR
Problem description and
pipeline
The photo OCR pipeline
Image
Text detection
Character segmentation
Character classification
A machine learning pipeline: A system with many stages / components,
several of which may use machine learning.
Sliding windows
For example, supervised learning for pedestrian detection.
pixels in image patches. if a pedestrian appears in an image or
not.
Suppose you are running a text detector using image patches. You run the
classifier on a image
and when using sliding window, you "step" the detector by 4 pixels each
time. Totally you will end up running your classifier about 2500 times
on a single image.
Getting lots of data
and artificial data
Synthesizing data by creating data from fonts available
Synthesizing data by introducing distortions
Distortion introduced should be representation of the type of
noise/distortions in the test set
Audio: background noise, bad cellphone connection
Usually does not help to add purely random / meaningless noise to
your data
For example, intensity
(brightness) of pixel and .
Discussion on getting more data
Make sure you have a low bias classifier before expending the
effort. Plot the learning curves. E.g. keep increasing the number of
features / number of hidden units in neural network until you have a low
bias classifier.
"How much work would it be to get 10x as much data as we currently
have?"
Artificial data synthesis
Collect / label it yourself
"Crowd source" (E.g. Amazon Mechanical Turk)
Ceiling
analysis: What part of the pipeline to work on next
Estimating
the errors due to each component (ceiling analysis)
What part of the pipeline should you spend the most time trying to
improve? We can do each pipeline manually one-by-one and check how much
the ending results get improved.