Basic Concepts in Machine Learning

Posted on 2022-06-22 Edited on 2022-08-29 Word count in article: 37k Reading time ≈ 34 mins.

Machine learning is getting everywhere. This article introduce some basic concepts in machine learning.

Introduction

What is machine learning?

Well-posed learning problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine learning algorithms:

Supervised learning
Unsupervised learning
Others
- Reinforcement learning
- Recommender systems
- ...

Supervised learning

Regression problem: a continuous output
Classification problem: a discrete output

Unsupervised learning

There is no feedback based on the prediction results
For example clustering the data

Linear regression with one variable

Model representation

Training set
- = Number of training examples
- = "input" variable / feature
- = "output" variable / "target" variable
The goal:
- Given a training set, to learn a function (hypothesis) so that is a "good" predictor for the corresponding value of .

Cost function

Hypothesis:

Parameters:

Cost function: This function is called the "square error function" or "mean square error". The mean is halved as a convenience for the computation of the gradient descent.

Goal:

Gradient descent algorithm

Have some function

Want minimize

The algorithm:

Repeat until convergence { }

needs to be simultaneously updated instead of updated in turn

Gradient descent for linear regression

So the algorithm is to repeat the following until convergence

Linear regression with multiple variables

Multiple features

We introduce the following notations:
- = value of the feature in the training example
- = the input (features) of the training example
- = the number of training examples
- = the number of features
The multivariable form of the hypothesis function accommodating these multiple features is as follow:
The gradient descent algorithm works for multiple features as well

Logistic regression

Classification

First binary classification problem: to classify in positive class and negative class
- The linear regression does not suit this problem well
- The logistic regression is better:

Hypothesis representation

Logistic regression model
- Want
  - For linear regression, we have which can be less than 0 and greater than 1. This is not desired
- Thus we use the sigmoid (logistic) function:
where which gives
Interpretation of hypothesis output
- estimated probability that on input
- "Probability that , given , parameterized by "

Decision boundary

In order to get our discrete 0 or 1 classificiation, we can translate the output of the hypothesis function as follows: The way our logistic function behaves is that when its input is greater than or equal to zero, its output is greater than or equal to : So if our input to is , then that means: The decision boundary is the line that separates the area where and where . It is created by our hypothesis function.

Cost function

Training set: .

How to choose parameters ?

For linear regression, the cost function is: However, this cost function does not work for logistic regression because its is highly non-linear. The will have many local minimum (non-convex) which makes the optimization difficult. It might never converge to the global minimum.

Logistic regression cost function

Its properties:

if . But as , . Captures intuition that if , (predict , but ), we'll penalize learning algorithm by a very large cost.

Simplified cost function

This function is equivalent to the cased one.

The logistic regression cost function will then be: To fit parameters : To make a prediction given new :

Gradient descent on the simplified cost function

Repeat { }

Algorithm looks identical to linear regression! But since the hypothesis function changed, these two algorithms are actually not the same.

A vectorized implementation is:

Advanced optimization on the simplified cost function

Optimization algorithms:

Gradient descent
Conjugate gradient
BFGS
L-BFGS

Advantages

No need to manually pick
Often faster than gradient descent

Disadvantages

More complex

Example

We want to minimize Here:

Then this optimization problem can be solved with the following Octave code:

function [jVal, gradient] = costFunction(theta)
	jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
	gradient = zeros(2, 1);
	gradient(1) = 2 * (theta(1) - 5);
	gradient(2) = 2 * (theta(2) - 5);
end
	
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2, 1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass classification: one-vs-all

Multiclass classification examples:

Email foldering / tagging: Work, Friends, Family, Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rainy, Snowy

One-vs-all (one-vs-rest):

We divide our problem into binary classification problems. In each one, we predict the probability that is a member of one of our classes. We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

Train a logistic regression classifier for each class to predict the probability that . On a new input , to make a prediction, pick the class that maximizes:

The problem of overfitting

Overfitting: If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples.

Addressing overfitting

Reduce number of features
- Manually select which features to keep.
- Model selection algorithm.
Regularization
- Keep all the features, but reduce magnitude/values of parameters .
- Works well when we have a lot of features, each of which contributes a bit to predicting .

Cost function

Here is the regularization parameter. It determines how much the costs of our parameters are inflated. Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If is chosen to be too large, it may smooth out the function too much and cause underfitting.

Regularized linear regression

The modified algorithm become:

Repeat { }

The term performs our regularization. With some manipulation our update rule can also be represented as: The first term in the above equation, will always be less than 1.

Normal equation

Now let's approach regularization using the alternate method of the non-iterative normal equation. To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses: should jave dimension . Recall that if , then is non-invertible. However, when we add the term , then becomes invertible.

Regularized logistic regression

The algorithm looks the same like the linear regression one with a difference hypothesis function .

Neural Networks

Motivations

Some problems are fundamentally non-linear and solving them by linear regression will require gigantic feature space which is often not feasible.
Origins: algorithms that try to mimic the brain. Was very widely used in 80s and early 90s. Popularity diminished in late 90s.
Recent resurgence: state-of-the-art technique for many applications.
The one-learning-algorithm hypothesis.

Model representation

Neuron model: Logistic unit

At a very simple level, neurons are basically computational units that takes inputs (dendrites) as electrical inputs that are channeled to outputs. In out model, our dendrites are like the input features , and the output is the result of our hypothesis function. In this model our input node is sometimes called the "bias unit". It is always equal to 1. In neural networks, we use the same logistic function as in classification, , yet we sometimes call it a sigmoid activation function. In this situation, our parameters are sometimes called weights.

For example, . our input nodes, also known as the "input layer", go into another node, which finally outputs the hypothesis function, known as the "output layer". We can have intermediate layers of nodes betwee the input and output layers called the "hidden layers". In this example, we label these intermediate or "hidden" layer nodes and call them "activation units". If we had one hidden layer, it would look like: The values for each of the activation nodes is obtained as follows: This is saying that we compute our activation nodes by using a matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix containing the weights for our second layer of nodes. Each layer gets its own matrix of weights, . The dimensions of these matrices of weights is determined as follows: If network has units in layer and units in layer , then will be of dimension . The comes from the addition in of the "bias nodes" and . In other words the output nodes will not include the bias nodes while the inputs will.

Vectorized implementation

So in general:

Applications

Simple example: AND

Our neural network can be for example: Which will give the following table:


0	0
0	1
1	0
1	1

FYI: and

Example: OR

Our neural network will then be:

Example: Negation

Putting it together:

Multiclass classification

Multiple output units: one-vs-all

To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we want to classify our data into one of four categories. We can define our set of resulting classes as : Each represents a different image corresponding to either one of the four categories. The inner layers, each provides us with some new information which leads to our final hypothesis function. The setup looks like:

Learning

Cost function

For a neural network for a classification problem, let's denote:

Inputs
total number of layers in network
number of units (not counting bias unit) in layer

Two types of classification problems

Binary classification:
- output unit:
Multi-class classification ( classes)
- output units

Cost function for neural network

In the regularization part, after the square brackets, we must account for multiple matrices. The number of columns in our current matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current matrix is equal to the number of nodes in the next layer (excluding the bias unit).

Note:

The double sum simply adds up the logistic regression costs calculated for each cell in the output layer.
The triple sum simply adds up the squares of all the individual s in the entire network
The in the triple sum does not refer to training example .

Backpropagation algorithm

We need to compute and . The backpropagation algorithm is then:

{

So we set for all .

For to :

Set
Perform forward propagation to compute for .
Using , compute

Compute

: The vectorized implementation looks like:
Then if and if .

}

It can be shown that, when these computation is finished, we have

Backpropagation intuition

"error" of cost for (unit in layer ). Formally, for , where .

Backpropagation in practice

Unrolling parameters

In Matlab/Octave, for example , , :

% Gather vectors
thetaVec = [Theta1(:); Theta2(:); Theta3(:)];
DVec = [D1(:); D2(:); D3(:)];

Theta1 = reshape(thetaVec(1:110), 10, 11);
Theta2 = reshape(thetaVec(111:220), 10, 11);
Theta3 = reshape(thetaVec(221:231), 1, 11);

Gradient checking with numerical estimation of gradients

for i = 1:n
	thetaPlus = theta;
	thetaPlus(i) = thetaPlus(i) + EPSILON;
	thetaMinus = theta;
	thetaMinus(i) = thetaMinus(i) - EPSILON;
	gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * EPSION);
end;

Check that gradApprox DVec.

Implementation note:

Implement backprop to compute DVec.
Implement numerical gradient check to compute gradApprox.
Make sure they give similar values.
Turn off gradient checking. Using backprop code for learning.
- Be sure to disable your gradient checking code before training your classifier. The numerical estimation of gradients is much slower than the backpropagation algorithm.

Random initialization

To initialize all elements of to 0 does not work for neural work, the iteration will be stuck in a loop. We need random initialization to perform symmetry breaking. So we initialize each to a random value in :

1 2	Theta1 = rand(10, 11) * (2 * INIT_EPSILON) - INIT_EPSILON; Theta2 = rand( 1, 11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Putting it together

To training a neural network:

Pick a network architecture (connectivity pattern between neurons)
- Number of input units: dimension of features
- Number of output units: number of classes
- Reasonable default: 1 hidden layer, or if hidden layer, have same number of hidden units in every layer (usually the more the better)
Randomly initialize weights.
Implement forward propagation to get for any .
Implement code to compute cost function .
Implement backprop to compute partial derivatives .
- Often with a for loop from to .
Use gradient checking to compare computed using backpropagation vs. using numerical estimate of gradient of . Then disable gradient checking code.
Use gradient descent or advanced optimization method with backpropagation to try to minimize as a function of parameters .
- For neural network, the cost function is non-convex, so theoretically the algorithm can be stuck in a local minimum.

An example code for the cost function can be:

function [J grad] = nnCostFunction(nn_params, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, ...
                                   X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
%   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
%   X, y, lambda) computes the cost and gradient of the neural network. The
%   parameters for the neural network are "unrolled" into the vector
%   nn_params and need to be converted back into the weight matrices.
%
%   The returned parameter grad should be a "unrolled" vector of the
%   partial derivatives of the neural network.
%

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);

% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));

X = [ones(m, 1) X];
Z = sigmoid(X * Theta1');
Z = [ones(size(Z, 1), 1) Z];
h = sigmoid(Z * Theta2');
yi = zeros(size(y, 1), num_labels);
for i = 1:size(y, 1)
  yi(i, y(i)) = 1;
end

for i = 1:num_labels
  J = J + sum(- yi(:, i) .* log(h(:, i)) - (1 - yi(:, i)) .* log(1 - h(:, i))) / m;
end

J = J + (sum(sum(Theta1(:, 2:end).^2)) + sum(sum(Theta2(:, 2:end).^2))) * lambda / 2 / m;


for t = 1:m
  a1 = X(t, :)';
  z2 = Theta1 * a1;
  a2 = [1; sigmoid(z2)];
  z3 = Theta2 * a2;
  a3 = sigmoid(z3);
  d3 = a3 - yi(t, :)';
  Theta2_grad = Theta2_grad + d3 * a2' / m;
  d2 = (Theta2' * d3) .* sigmoidGradient([1; z2]);
  d2 = d2(2:end);
  Theta1_grad = Theta1_grad + d2 * a1' / m;
end

Theta2_grad = Theta2_grad + lambda / m * Theta2;
Theta2_grad(:, 1) = Theta2_grad(:, 1) - lambda / m * Theta2(:, 1);
Theta1_grad = Theta1_grad + lambda / m * Theta1;
Theta1_grad(:, 1) = Theta1_grad(:, 1) - lambda / m * Theta1(:, 1);

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];

end

Machine Learning System Design

Evaluating a learning algorithm

Debugging a learning algorithm

Suppose you have implemented regularized linear regression to predict housing prices. However, when you test your hypothesis on a new set of houses, you find that is makes unacceptably large errors in its predictions. What should you try next?

Get more training examples
Try smaller sets of features
Try getting additional features
Try adding polynomial features (, etc.)
Try decreasing
Try increasing

Machine learning diagnostic

Diagnostic: A test that you can run to gain insight what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.

Diagnostics can take time to implement, but doing so can be a very good use of your time.

Evaluating a hypothesis

Training/testing procedure for linear regression

Learn parameter from training data (minimizing training error )
Compute test set error for linear regression:

Training/testing procedure for logistic regression

Learn parameter from training data
Compute test set error:
Or the misclassification error (aka. 0/1 misclassification error): This gives us a binary 0 or 1 error result based on a misclassification. The average test error for the test set is: This gives us the proportion of the test data that was misclassified.

Model selection and train/validation/test sets

Once parameters were fit to some set of data (training set), the error of the parameters as measured on that data (the training error ) is likely to be lower than the actual generalization error.

We thus to divide the data into three sets:

Training set
- 60% of the data
- with training error:
Cross validation set
- 20% of the data
- with cross validation error:
Testing set
- 20% of the data
- with test error:

We can now calculate three separate error values for the three different sets using the following method:

Optimize the parameters in using the training set for each polynomial degree.
Find the polynomial degree with the least error using the cross validation set.
Estimate the generalization error using the test set with .

Diagnosing bias vs. variance

Suppose your learning algorithm is performing less well than you were hoping. Is it a bias problem or variance problem?

High bias (underfit):
- will be high
- If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.
  - Low training set size: causes to be low and to be high.
  - Large training set size: causes both and to be high with .
Variance (overfit):
- will be low
- If a learning algorithm is suffering from high variance, getting more training data is likely to help.
  - Low training set size: will be low and will be high.
  - Large training set size: increases with training set size and continues to decrease without leveling off. Also but the difference between them remains significant.

Choosing the regularization parameter

In order to choose the model and regularization term , we need to:

Create a list of lambdas (i.e. )
Create a set of models with different degrees or any other variants
Iterate through the s and for each go through all the models to learn some .
Compute the cross validation error using the learned (computed with ) on the without regularization or .
Select the best combo that produces the lowest error on the cross validation set.
Using the best combo and , apply it on to see if it has a good generalization of the problem.

Deciding what to do next revisited

Get more training examples: fixes high variance
Try smaller sets of features: fixes high variance
Try getting additional features: fixes high bias
Try adding polynomial features (, etc.): fixes high bias
Try decreasing : fixes high bias
Try increasing : fixes high variance

Neural networks and overfitting

"Small" neural network (fewer parameters; more prone to underfitting)
- Computationally cheaper.
"Large" neural network (more parameters; more prone to overfitting)
- Computationally more expensive.
- Use regularization () to address overfitting.
Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

Machine learning system design: Spam classification example

Prioritizing what to work on

Supervised learning. features of email. spam or not spam . Features : Choose 100 words indicative of spam / not spam. In practice, take most frequently occurring words (10000 to 50000) in training set, rather than manually pick 100 words.

Building a spam classifier

How to spend your time to make it have low error?

Collect lots of data
- E.g. "honeypot" project
Develop sophisticated features based on email routing information (from email header).
Develop sophisticated features for message body, e.g. should "discount" and "discounts" be treated as the same word? How about "deal" and "Dealer"? Features about punctuation?
Develop sophisticated algorithm to detect misspellings (e.g. m0rtgage, med1cine, w4tches.)

Error analysis

Recommended approach

Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
Plot learning curves to decide if more data, more features, etc. are likely to help.
Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.

Error analysis example

examples in cross validation set. Algorithm misclassifies 100 emails. Manually examine the 100 errors, and categorize them base on:

What type of email it is.
What cues (features) you think would have helped the algorithm classify them correctly.

The importance of numerical evaluation

Should discount/discounts/discounted/discounting be treated as the same word? Can use "stemming" software (e.g. "porter stemmer"). It can hurt as well (universe/university). Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works. Need numerical evaluation (e.g., cross validation error) of algorithm's performance with and without stemming.

Handling skewed data

Cancer classification example

Train logistic regression model . ( if cancer, otherwise). Find that you got error on test set. However of patients have cancer. The number is too small so it create a skewed classes. (You can predict all the time to archive a error rate at ).

Precision/recall

in presence of rate class that we want to detect.

Precision : of all patients where we predicted , what fraction actually has cancer?
Recall : of all patients that actually have cancer, what fraction did we correctly detect as having cancer?

Predicted class Actual class	1	0
1	True positive	False positive
0	False negative	True negative

A good classifier should have high precision and high recall. Even for a skew class, it is very difficult to cheat to a high precision and a high recall.

Trading off precision and recall

Logistic regression: . Predict 1 if , predict 0 if .

Suppose we want to predict only if we are very confident. Then we could only predict 1 if . This gives a higher precision, lower recall.
Suppose we want to avoid missing too many cases of cancer (avoid false negatives). Then we could only predict 1 if . This gives a higher recall, lower precision.

score (F score)

How to compare precision / recall numbers? The score is: There are more way to compare precision and recall values other than score.

Data for machine learning

Assume feature has sufficient information to predict accurately. Useful test: Given the input , can a human expert confidently predict ?

Use a learning algorithm with many parameters (e.g. logistic regression / linear regression with many features; neural network with many hidden units), thus low bias algorithms. Use a very large training set (unlikely to overfit).

Large Margin Classification

Support vector machine

Logistic regression: Support vector machine: Hypothesis:

If , we want (not just )
If , we want (not just )

If is very large, then the first term should must be . Which gives the SVM decision boundary as below. The SVM is sometimes also called the large margin classification.

SVM decision boundary

Kernels

Given , compute new feature depending on proximity to landmarks . We let: This similarity is called Gaussian kernels.

If : If is far from : We predict "1" when Then the decision boundary will be close to the where is large.

Choosing the landmarks

We can choose to put landmark at exact where the training examples are. Concretely:

Given . Choose . Given example : and let For training example :

Hypothesis: Given , compute features , predict if .

Training:

SVM parameters

Large : Lower bias, higher variance.
Small : Higher bias, lower variance.

Large : Features vary more smoothly.
- Higher bias, lower variance.
Smaller : Features vary less smoothly.
- Lower bias, higher variance.

SVM in practice

Use SVM software package (e.g. liblinear, libsvm, ...) to solve for paramters . Need to specify:

Choice of parameter .
Choice of kernel (similarity function).
- E.g. no kernel ("linear kernel"): Predict if .
  - You can do this when you have a large number of features but a small training set. You want to try to avoid over-fitting.
- E.g. Gaussian kernel
  - Need to choose .

Kernel functions in code

1 2	function f = kernel(x1, x2) f = exp(-norm(x1-x2)^2 / 2 / sigma^2);

Note: Do perform feature scaling before using the Gaussian kernel.

Other choices of kernel

Not all similarity functions make valid kernels. Need to satisfy technical condition called "Mercer's theorem" to make sure SVM packages' optimizations run correctly, and do not diverge.

Many off-the-shelf kernel available:

Polynomial kernel: .
More esoteric: String kernel, chi-square kernel, histogram intersection kernel, ...

Multi-class classification

Many SVM packages already have built-in multi-class classification functionality. Otherwise, use one-vs-all method. Train SVMs, one to distinguish from the rest, for , get . Pick class with largest .

Logistic regression vs. SVMs

number of features (, number of training examples.

If is large (relative to ):
- E.g. .
- Use logistic regression, or SVM without a kernel ("linear kernel").
If is small, is intermediate:
- E.g. .
- Use SVM with Gaussian kernel.
If is small, is large:
- E.g. .
- SVM will be slow to run.
- Create/add more features, then use logistic regression or SVM without a kernel.
Neural network likely to work well for most of these settings, but may be slower to train.
- The SVM gives a convex optimization problem so the numerical algorithm will always gives a global minima. This is a advantage of SVM over neural network.

Unsupervised Learning Introduction

Clustering

Clustering is an example of unsupervised learning. In unsupervised learning, you are given an unlabeled dataset and are asked to find "structure" in the data.

Applications of clustering

Market segmentation
Social network analysis
Organize computing clusters
Astronomical data analysis

K-means algorithm

Input:

(number of clusters).
Training set . , drop convention.

The algorithm:

Randomly initialize cluster centroids .

Repeat {

for i = 1 to m

index (from 1 to ) of cluster centroid closest to .

This means:

for k = 1 to

average of points assigned to cluster .

}

If there is a cluster with no point assigned to it. Commonly we can eliminate it. Or sometimes we can recreate a new one randomly.

Optimization objective for K-means algorithm

index of cluster () to which example is currently assigned

cluster centroid ()

cluster centroid of cluster to which example has been assigned

The optimization objective:

Random initialization for K-means algorithm

Should have . Randomly pick training examples. Set equal to these examples.

Local optima

There is a possibility that the K-means stuck at the local optima. To avoid this, we can run K-means with different random initializations for 1 - 100 times, then pick clustering that gave lowest cost . This method works pretty well when the number of clusters is small, 2 - 10. With a very large number of clusters, > 1000, running multiple times often does not help.

Choosing the number of clusters

The most common thing to do is actually to choose the number of clusters manually by looking at the data graphically or statistically.

Sometimes, you're running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later purpose.

Dimensionality reduction

Motivation

Data compression

If you have a highly correlated features, you may want to reduce the dimension. Suppose we apply dimensionality reduction to a dataset of examples , where . As a result of this, we will get out a lower dimension dataset of examples where for some value of and .

Visualization

To visualize a data, we need to apply dimensionality reduction so that the plotted data has dimension 2 or 3 since we can plot 2D or 3D but don't have ways to visualize higher dimensional data.

Principal component analysis (PCA)

Find a direction onto which to project the data so as to minimize the projection error. Reduce from -dimension to -dimension: Find vectors onto which project project the data, so as to minimize the projection error.

PCA is not linear regression

For example, in 2D, we have a problem . The linear regression is to minimize for some linear function . PCA finds a vector pointing a line that minimize the projection distance between point and some linear function .

PCA algorithm

Training set:

Preprocessing (feature scaling/mean normalization): Replace each with . If different features on different scales (e.g., size of house, number of bedrooms), scale features to have comparable range of values by Here is some measure of range of values, it can be the maximum value or the standard derivation of .

Then we compute the covariance matrix: Then compute the eigenvectors of matrix for example with Octave:

1	[U, S, V] = svd(Sigma);

Here We are going to take out the first columns of the matrix and call it . Then we have

Reconstruction from compressed representation

Applying PCA

Choosing the number of principal components

Average squared projection error: .

Total variation in the data:

Typically, choose to be the smallest value so that "99% of variance is retained". It can be 95% or 90% as well.

Algorithm:

Try PCA with , compute then check if the inequality holds. This algorithm is slow.
With the Octave code [U, S, V] = svd(Sigma), we get the matrix S, then the inequality above can be expressed as:

Advice for applying PCA

Supervised learning speedup

Mapping should be defined by running PCA only on the training set. This mapping can be applied as well to the examples and in the cross validation and test sets.

Bad use of PCA: To prevent overfitting

Use instead of to reduce the number of features to . Thus, fewer features, less likely to overfit.

This might work OK, but isn't a good way to address overfitting. Use regularization instead.

PCA is sometimes used where it shouldn't be

Design of ML system:

Get training set
Run PCA to reduce in dimension to get
Train logistic regression on the data
Test on test sets.

How about doing the whole thing without using PCA? Before implementing PCA, first try running whatever you want to do with the original/raw data . Only if that doesn't do what you want, then implement PCA and consider using .

Anomaly Detection

Density Estimation

Problem motivation

Anomaly detection example

Aircraft engine features:

heat generated
vibration intensity
...

Dataset: . We get a new engine , is it anomalous?

Fraud detection

features of user 's activities. Model from data. Identify unusual users by checking which have .

Monitoring computers in a data center

features of machine
memory use
number of disk accesses/sec
CPU load
CPU load / network traffic
...

Gaussian distribution

Say . If is a distributed Gaussian with mean , variance .

Parameter estimation

Dataset: .

Algorithm

Density estimation

Training set: . Even if this independence assumption doesn't hold, this algorithm works fine.

Anomaly detection algorithm

Choose features that you think might be indicative of anomalous examples.
Fit parameters .
Given new examples , compute . Anomaly if .

Building an anomaly detection system

Developing and evaluating an anomaly detection system

The importance of real-number evaluation: When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm.

Assume we have some labeled data, of anomalous and non-anomalous examples. ( if normal, is anomalous).

Training set: (assume normal examples / not anomalous)
Cross validation set: .
Test set: .

Aircraft engines motivating example

10000 good (normal) engines.
20 flawed engines (anomalous).
Training set: 6000 good engines.
CV: 2000 good engines, 10 anomalous.
Test: 2000 good engines, 10 anomalous.

Algorithm evaluation

Fit model on training set .

On a cross validation / test , predict Possible evaluation metrics:

True positive, false positive, false negative, true negative
Precision / Recall
F1-score
Accuracy not good since the dataset is very skewed

Can also use cross validation set to choose parameter .

Anomaly detection vs Supervised learning

Use anomaly detection when

Very small number of positive examples (0 - 20 is common).
Large number of negative examples.
Many different "types" of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen so far.

Anomaly detection is used for

Fraud detection
Manufacturing (e.g. aircraft engines)
Monitoring machines in a data center
...

Use supervised learning when

Large number of positive and negative examples.
Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

Supervised learning is used for

Email spam classification
Weather prediction (sunny / rainy / etc.)
Cancer classification
...

Choosing what features to use

If the features are not Gaussian-like, then we can map to the following data to make it Gaussian-like:

Error analysis for anomaly detection

Want large for normal examples , want small for anomalous examples .

Most common problem:

is comparable (say, both large) for normal and anomalous examples.

Try coming up with more features to distinguish between the normal and the anomalous examples. It sometimes can be found from the anomaly example.

For example: You have two features vibration intensity, and heat generated. Both and take on values between and (and are strictly greater than 0), and for most "normal" engines you expect that . One of the suspected anomalies is that a flawed engine may vibrate very intensely even without generating much heat (large , small ), even though the particular values of and may not fall outside their typical ranges of values. An additional feature can be

When choosing features for an anomaly detection system, it is a good idea to look for features that take on unusually large or small values for (mainly the) anomalous examples.

Multivariate Gaussian distribution

. Don't model etc. separately. Model all in one go.

Parameters: (Covariance matrix) Here is the determinant of .

Anomaly detection using the Multivariate Gaussian distribution

Parameter fitting: Given training set

The algorithm

Fit model by setting and to the value above.
Given a new example , compute with and above.
Flag an anomaly if .

Relationship to original model

The original model is a special case of the multivariate Gaussian distribution with the constraint This means that the contours of are axis-aligned.

Original model vs. Multivariate Gaussian

Use original model when

Manually create features to capture anomalies where take unusual combinations of values.
Computationally cheaper. Scales better to large . ().
Works OK even if (training set size) is small.

Use multivariate Gaussian model

Automatically captures correlations between features.
Computationally more expensive.
Must have , or else is non-invertible. Better with .
If you have redundant features, will be non-invertible.

Recommender Systems

Predicting movie ratings

Problem formulation

User rates movies using zero to five stars.

number of users
number of movies
if user has rated movie
rating given by user to movie (defined only if )

Content based recommendations

For example:

Movie	Alice (1)	Bob (2)	Carol (3)	Dave (4)	(romance)	(action)
Love at last	5	5	0	0	0.9	0
Romance forever	5	?	?	0	1.0	0.01
Cute puppies of love	?	4	0	?	0.99	0
Nonstop car chases	0	0	5	4	0.1	1.0
Swords vs karate	0	0	5	?	0	0.9

For each user , learn a parameter . Predict user as rating movie with stars. Here . Here we are basically doing linear regression on each users separately.

Problem formulation

parameter vector for user .
feature vector for movie .
For user , movie , predicted rating: .
number of movies rated by user .
To learn (parameter for user )
To learn all :

Optimization algorithm

Gradient descent update:

Collaborative filtering

We might not have all the features at hand. We can learn the features by letting each user gives their preferences.

Optimization algorithm

Given , to learn : Given , to learn :

Collaborative filtering

Given (and movie ratings) can estimate .

If we have none of them, we can begin with a random , then compute , then compute new again, and again, and again. Until we have a good result.

Collaborative filtering algorithm

Minimizing and simultaneously: Then we want This is similarly to the iterative process above.

The algorithm

Initialize to small random values.
- This serves as symmetry break (similar to the random initialization of a neural network's parameters) and ensures the algorithm learns features that are different from each other.
Minimize using gradient descent (or an advanced optimization algorithm). E.g. for every :
For a user with parameters and a movie with (learned) features , predict a star rating of .

Low rank matrix factorization

Vectorization

Predicted ratings: Given Then This is called the low rank matrix factorization.

For each product , we learn a feature vector . How to find movies related to movie ? Small movie and are "similar".

Implementational detail: mean normalization

We normalize each movie to make them have a average rating of zero. We call the average . Then for user , on movie predict: Then for the user that doesn't rated a movie, we shall predict that the user might give the average rating instead of zero. This is more reasonable.

Large Scale Machine Learning

Gradient descent with large datasets

Learning with large datasets

"It's not who has the best algorithm that wins. It's who has the most data."

Let's say that your dataset has the training set size . This is realistic nowadays. How can you tell if using all of the data is likely to perform much better than using a small subset of the data (say )? We can plot a learning curve for a range of values of and verify that the algorithm has high variance when is small.

Stochastic gradient descent

Batch gradient descent: You need to look at all the training data for each gradient descent step.

Cost function for stochastic gradient descent:

Randomly shuffle dataset.
Repeat {

for i = 1, ..., m { }

}

The changing part is the first-order derivative of the cost function of stochastic gradient descent.

The stochastic gradient descent gives a result that is not really at the global minimum but closes to it. The cost function might not even go down with every iteration. However, in practice, this works fine.

Stochastic gradient descent might take 1 - 10 passes depends on your training set size.

Mini-batch gradient descent

Batch gradient descent: Use all examples in each iteration.
Stochastic gradient descent: Use 1 examples in each iteration.
Mini-batch gradient descent: Use examples in each iteration.
- mini-batch size
- Typical mini-batch size: 2 - 100
- For , we have 10 examples: . Then .
- This algorithm can sometimes be even faster than the stochastic gradient descent with good vectorization and good choice of .

Stochastic gradient descent convergence

For batch gradient descent, we plot as a function of the number of iterations of gradient descent. If it is decreasing, then it is converging.
- But you don't want to pause your computation regularly if you have a very large training set.
For stochastic gradient descent: During learning, compute before updating using .
- Say for every 1000 iterations, plot averaged over the last 1000 examples processed by algorithm.
- If you want the stochastic gradient descent to converge to the global minimum, you can slowly decrease over time. (e.g. .

Online learning

For example, shipping service website where user comes, specifies origin and destination, you offer to ship their package for some asking price, and users sometimes choose to use your shipping service (), sometimes not ().

If you have a large steam of data, then this online learning algorithm is pretty effective. If you do not have enough data, then save them and perform a regular learning. The online learning algorithm can adopt to changing user preferences.

Features capture properties of user, of origin / destination and asking price. We want to learn to optimize price.

The algorithm:

Repeat forever {

Get corresponding to user. is the features, is whether the user use our shipping or not.

Update using :

for

Then we discard the data .

}

Other online learning example: Product search (learning to search).

User searches for "Andorid phone 1080p camera"
Have 100 phones in store. Will return 10 results.
features of phone, how many words in user query match name of phone, how many words in query match description of phone, etc.
if user clicks on link. otherwise.
Learn .
Use to show user the 10 phones they're most likely to click on.

Map reduce and data parallelism

Map-reduce

Batch gradient descent example:

Machine 1: Use .
- Get
Machine 2: Use .
- Get .
Machine 3: Use .
- Get
Machine 4: Use .
- Get .
Master server combines these data:

Map-reduce and summation over the training set

Many learning algorithms can be expressed as computing sums of functions over the training set. E.g. for advanced optimization, with logistic regression, need:

If you are applying the map-reduce method to train a neural network on ten machines. In each iteration, each of the machines will compute forward propagation and back propagation on 1/10 of the data to compute the derivative with respect to that 1/10 of the data.

Application example: Photo OCR

Problem description and pipeline

The photo OCR pipeline

Image
Text detection
Character segmentation
Character classification

A machine learning pipeline: A system with many stages / components, several of which may use machine learning.

Sliding windows

For example, supervised learning for pedestrian detection.

pixels in image patches. if a pedestrian appears in an image or not.

Suppose you are running a text detector using image patches. You run the classifier on a image and when using sliding window, you "step" the detector by 4 pixels each time. Totally you will end up running your classifier about 2500 times on a single image.

Getting lots of data and artificial data

Synthesizing data by creating data from fonts available
Synthesizing data by introducing distortions
- Distortion introduced should be representation of the type of noise/distortions in the test set
  - Audio: background noise, bad cellphone connection
- Usually does not help to add purely random / meaningless noise to your data
  - For example, intensity (brightness) of pixel and .
Discussion on getting more data
1. Make sure you have a low bias classifier before expending the effort. Plot the learning curves. E.g. keep increasing the number of features / number of hidden units in neural network until you have a low bias classifier.
2. "How much work would it be to get 10x as much data as we currently have?"
  - Artificial data synthesis
  - Collect / label it yourself
  - "Crowd source" (E.g. Amazon Mechanical Turk)

Ceiling analysis: What part of the pipeline to work on next

Estimating the errors due to each component (ceiling analysis)

What part of the pipeline should you spend the most time trying to improve? We can do each pipeline manually one-by-one and check how much the ending results get improved.