Mathematics for Machine Learning: 
Part 1: "Understanding Regression: A Mathematical Journey

Mathematics for Machine Learning: Part 1: "Understanding Regression: A Mathematical Journey

CONTENTS

Why Math is Essential for Machine Learning

Linear Regression Basics

Linear Regression With One Variable

Linear Regression With One Variable _ Cost Function

Linear Regression With One Variable _ Gradient Descent

Gradient Descent implementation Python code

Conclusion: Mastering Math for Machine Learning

Why Math is Essential for Machine Learning

Mathematics is the backbone of machine learning. Without a strong foundation in mathematical concepts, it is impossible to understand the complex algorithms that power machine learning models. One of the most important mathematical concepts used in machine learning is regression analysis. Regression analysis is used to make predictions based on historical data. It is used to identify patterns and relationships between variables, which can then be used to make predictions about future outcomes.

Regression analysis involves a lot of complex mathematical formulas and calculations. But don't let that scare you! With a little bit of practice, anyone can master the basics of regression analysis. And once you do, you'll be able to build powerful machine learning models that can make accurate predictions about everything from stock prices to customer behavior.

Linear Regression Basics

Linear regression is a statistical method used to establish a relationship between a dependent variable and one or more independent variables. It is commonly used to predict numerical values, such as sales figures or stock prices, based on historical data. In essence, linear regression helps us understand how changes in one variable affect another.

To perform linear regression, we first need to identify the independent and dependent variables. The independent variable is the variable we believe has an impact on the dependent variable. For example, if we are trying to predict sales figures, the independent variable might be advertising spend. The dependent variable would then be the sales figures themselves. Once we have identified these variables, we can use a mathematical formula to calculate the best-fit line that represents the relationship between them.

Linear Regression with one variable

The goal of linear regression with one variable is to find the line of best fit that describes the relationship between the independent variable (x) and the dependent variable (y). This line can then be used to make predictions about future values of y based on given values of x.

Supervised Learning

Imagine you're assisting a friend who wants to sell their house. You also have another friend whose house is valued at $200,000. To guide your first friend on, setting the right selling price, you decide to use a method where you analyze data to create a useful model. This model will be a straight line that fits the information you have.

In this situation, you're using a type of learning that involves having the correct answers for each case in your dataset. Specifically, you know the actual prices at which similar houses were sold. This type of problem is known as "regression." It's all about predicting values that are continuous, like house prices.

So, the goal is to create a model that helps estimate the appropriate selling price for your friend's house based on the data. The model uses a straight line to represent the relationship between the house's size and its price. With this line, you can predict a reasonable price for a house with a specific size, like 750 square feet. This method proved successful, as you were able to help your friend sell their house for about $150,000 using the available training data.

In supervised learning, we are presented with a training set comprising various houses, aiming to learn how to predict their prices. To clarify the notations:

  • n refers to the number of training examples.

  • X represents the input features.

  • y corresponds to the target output.

  • (x,y) denotes a single training example.

  • (Xi​,yi​) is a specific training example indexed by i.

The training set is fed into the learning algorithm, which yields a hypothesis function denoted as h, standing for "hypothesis." The purpose of this hypothesis is to map inputs (house sizes) to estimated output values y, which are the corresponding house prices. In simpler terms, h is a function that takes X as input and produces Y as output. The hypothesis can be represented as h(x)=θ0​+θ1​⋅X, where:

  • θ0​ is the intercept of the linear function.

  • θ1​ is the slope, determining the relationship between input size and output price.

This function h(x)=θ0​+θ1​⋅X is analogous to the standard linear equation y\=mx+b, where b is the intercept and m is the slope.

Linear Regression With One Variable _ Cost Function

Cost Function

The cost function in Linear Regression with One Variable is used to measure the accuracy of the hypothesis function. It is calculated by taking the average of the squared differences between the predicted values and the actual values. The goal is to minimize the cost function to obtain the best possible fit for the data.

In linear regression, we start with a dataset, perhaps the one we mentioned earlier. The goal is to find values for the parameters θ0​ and θ1​ that define a straight line fitting the data as closely as possible. The main idea is to determine θ0​ and θ1​ so that the hypothesis function h(x) comes close to the actual target values y for each training example (x,y).

To achieve this in linear regression, we embark on a process of minimization. Specifically, we aim to minimize θ0​ and θ1​ in a way that reduces the differences between the predictions h(x) and the actual values y. One effective method to achieve this is by minimizing the squared differences between the output of our hypothesis and the true prices of the houses (y). This can be accomplished by summing up the squared differences for all training examples from i\=1 to n. The objective is to minimize the average error, which is half of the sum divided by 2n.

This process of minimizing θ0​ and θ1​ aligns with finding the optimal configuration for the hypothesis h(x)=θ0​+θ1​x. The cost function j(θ0​,θ1​), also known as the squared error function, encapsulates this minimization goal. By tuning θ0​ and θ1​, we aim to find the values that make h(x) closely match the actual y values.

For a clearer visualization, consider focusing on a single parameter. For instance, you can think of setting θ0​=0 and then concentrate on minimizing θ1​. This simplifies the cost function and facilitates a better understanding of the optimization process.

As you can see in the picture, there are three graphs: two showing h(x) and the other representing J(θ1​).

Our first step is to select a value where h(x)=0+θ1​×X. This choice determines a specific point on the graph, effectively creating the straight line you observe in the first image.

Next, we proceed to calculate the cost function by plugging in the chosen value of θ1. This is done to assess how well the chosen line fits the data. We then create a plot of our cost function, J(θ1​).

Since θ1​ can span both positive and negative values, let's consider a scenario where θ1​=0.5. This choice affects the position of the predicted value h(x), which is now 0.5, while the actual price remains 1. As it turns out, the cost function becomes the sum of squared differences between the height of the prediction and the actual price.

Continuing this process, we calculate the cost function for various other values of θ1​, using the initially computed value as a reference to plot the rest of the values. This accumulation of computations results in a bowl shape function on the graph.

Upon analyzing this curve, it becomes evident that the value of θ1​ that minimizes J(θ1) aligns with the point where θ1​=1. This insight suggests that a value of θ1​=1 corresponds to the best-fitting line for our data.

Linear Regression With One Variable _ Gradient Descent

Gradient learning is an optimization algorithm used to minimize the cost function in Linear Regression with One Variable. It works by iteratively adjusting the parameters of the hypothesis function to find the values that result in the lowest possible cost. This is done by computing the gradient of the cost function with respect to each parameter and updating the parameters accordingly.

In the context of gradient descent, the symbol α represents the learning rate. This parameter governs the size of the steps we take while updating our parameters, denoted as θj when α is too small gradient descent can be slow and if α is too large gradient descent can overshoot the minimum and it may faild to converge. The second term involves the partial derivative (∂/∂θj) of the cost function J(θ0, θ1). This derivative guides the adjustments we make to our parameters in a coordinated manner.

We apply the gradient descent method to our cost function. To do this, we need to determine the partial derivative (∂/∂θj) of the cost function J(θ0, θ1). This involves inserting the definition of the cost function into the partial derivative notation and then calculating the partial derivative, which necessitates employing various calculus techniques.

After computing this partial derivative, we substitute it into the gradient descent equations. This step aligns the calculated derivative with the updates made during gradient descent. The gradient descent algorithm uses these derivatives to adjust both θ0 and θ1 simultaneously in each iteration. This simultaneous update ensures that the parameters are modified together, enabling the algorithm to find the optimal values more efficiently.

As a result of these iterative updates, we formulate our gradient descent algorithm, which continues to iterate until convergence is achieved. This entire process outlines the algorithm for linear regression, allowing us to fit a linear model to the data effectively.

Gradient Descent implementation Python code

Before diving into the implementation of Gradient Descent in Python, let's take a quick look at how this fundamental optimization algorithm works. Gradient Descent is like finding the steepest downhill path on a hilly landscape. In the context of machine learning, it helps us adjust model parameters to minimize a cost function, improving our model's fit to the data. Now, let's explore how to implement Gradient Descent using Python code.

Let's import some libraries. We'll import the linear regression model to compare it with the model we'll soon build.

x is the data and y is the target

We used this approach to adjust our linear regression model for the best fit and then assessed its accuracy by calculating the mean squared error. Now, let's create our own linear model using the algorithm we've learned!

This code introduces a class named Gradient, which applies Gradient Descent. Its constructor takes the initial slope, intercept, iteration count, dataset size, learning rate, x, and y. The gradient_Algo method executes Gradient Descent iteratively:

  • It predicts y using slope and intercept.

  • Calculates the error between predicted and actual y.

  • Computes partial derivatives of the cost function for slope and intercept.

  • Updates slope and intercept with gradients and learning rate.

  • Prints the current loss in each iteration.

  • The code showcases how Gradient Descent optimizes linear regression's slope and intercept to minimize error and enhance model fit.

When I run the code, the loss appears to decrease.

When building our model from scratch, it seems to achieve higher accuracy than the linear model for this dataset. You can confirm this by assessing the mean squared error and making observations.

The parameters theta0 and theta1 represent our model, which is expressed as y = mx + b. Through optimization, we determine the best-fit parameters, theta0 and theta1, and use them to compute prediction values.

Conclusion: Mastering Math for Machine Learning

In conclusion, mastering math is essential for anyone looking to excel in machine learning. The concepts of linear regression are just the tip of the iceberg when it comes to the mathematical foundations of this field. Understanding the underlying principles of calculus, linear algebra, and statistics will enable you to build more accurate models and make better predictions.

While it may seem daunting at first, investing time and effort into learning these mathematical concepts will pay off in the long run. Whether you are a beginner or an experienced data scientist, there is always room for improvement and growth. So don't be afraid to dive deeper into the world of math and machine learning!