Machine Learning Basics: Building Your First Predictive Model in R

Shittu Olumide Ayodeji
5 min readDec 11, 2024

--

Image by Author

Machine learning models used for prediction purposes have become one of the most adopted technologies by different organizations. These models are capable of predicting future occurrences/outcomes giving valuable insights for making key decisions, hence leading to growth and increased productivity.

In light of this, the demand for professionals capable of building high-performance predictive models continues to soar at a rapid rate. This has seen myriads of people with transferable skill sets move towards becoming machine learning engineers; I am talking about data analysts conversant with R or even statisticians or other researchers alike who utilize R, as R is a powerful language for building machine learning models.

In the course of this article, you will learn how to build a predictive model using the R programming language.

Prerequisites

This tutorial is suitable for people familiar with the R programming language, Visual Studio Code (as a code editor), and who already know a thing or two about machine learning (just a little is enough)

Objectives

By going through this article, you should be able to:

  • Learn how to build a simple machine-learning model in R using the Logistic Regression algorithm.
  • Evaluate the performance of your model.

Machine learning (ML) can be likened to teaching a computer to learn from experience instead of being explicitly programmed for every task. Imagine showing a kid hundreds of pictures of cats and dogs and asking them to identify which is which — they’ll eventually figure it out. That’s essentially how ML works but for computers.

It all started with the idea of creating systems that could mimic human intelligence. Over the years, it evolved from simple rule-based systems to sophisticated algorithms that can handle vast amounts of data. Today, ML powers everything from voice assistants like Siri to personalized recommendations on Netflix.

At its core, ML involves feeding data into algorithms, training them to recognize patterns, and making predictions or decisions without constant human guidance.

Building a Predictive Model in R

  1. Install and load the necessary packages

I assume you already have R set up on your computer. If you don’t, feel free to check out this resource. Install and load the necessary packages you will be using for training the model, as shown below.

install.packages("tidyverse")  # For data manipulation and visualization
install.packages("caret") # For model training and evaluation
library(tidyverse) # loading the tidyverse library
library(caret) # loading the caret library

2. Load your preferred dataset

In this step, you choose the dataset you will use to train your model. It can be an externally prepared dataset in CSV format or a built-in dataset. For this article, we will be using the built-in mtcars dataset.

data <- mtcars

3. Exploratory data analysis

This step involves understanding the dataset you are working with and checking its suitability for use. It encompasses a series of processes, such as checking for null values in the dataset and filling them appropriately if there are.

View the dataset:

glimpse(data)

Check for missing values in the mtcars built-in dataset.

# Check for missing values
sum(is. na(mtcars))

View the statistical summary of the dataset.

# Summary statistics
summary(mtcars)

4. Split the dataset for testing and training the model

This involves splitting your dataset into a particular proportion. In this example, we are using the 80:20 proportion, that is, 80% for training and the remaining 20% for testing the model.

Ensure reproducibility and split the dataset

set.seed(123)  # For reproducibility

# Split the data into 80% training and 20% testing
train_index <- createDataPartition(data$mpg, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

5. Build the Linear Regression model

Use thetrain() function from the caret package to build your model

 # Train the model
model <- train(
mpg ~ wt + hp, # Formula: mpg as a function of wt and hp
data = train_data,
method = "lm", # Linear regression method
trControl = trainControl(method = "cv", number = 5) # 5-fold cross-validation
)

# View model summary
print(model)

In this step, we have successfully built our linear regression model. The next step is to evaluate its performance using a suitable metric. For this article, we are going to use the RMSE metric.

6. Model evaluation

Use the model you have built in the previous step to make predictions on the test dataset and evaluate the predicted values using the RMSE metric.

# Make predictions
predictions <- predict(model, newdata = test_data)

# Evaluate performance
results <- data.frame(
Actual = test_data$mpg,
Predicted = predictions
)

# Calculate RMSE and R-squared
rmse <- sqrt(mean((results$Actual - results$Predicted)^2))

print(paste("RMSE:", rmse))

7. Model visualization

This is an optional step as it seeks to provide a visual expression of the predicted values and the actual values, showing the accuracy of the model.

# Plot actual vs predicted
plot(test_data$mpg, predictions, main = "Actual vs Predicted",
xlab = "Actual mpg", ylab = "Predicted mpg")
abline(0, 1, col = "red") # Add diagonal line for reference

Congrats on making it this far. We have successfully built a linear regression machine learning model using the R programming language.

Let’s combine all the code snippets for a better understanding.

install.packages("tidyverse")  # For data manipulation and visualization
install.packages("caret") # For model training and evaluation
library(tidyverse) # loading the tidyverse library
library(caret)

data <- mtcars

glimpse(data)

# Check for missing values
sum(is. na(mtcars))

# Summary statistics
summary(mtcars)

set.seed(123) # For reproducibility

# Split the data into 80% training and 20% testing
train_index <- createDataPartition(data$mpg, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

# Train the model
model <- train(
mpg ~ wt + hp, # Formula: mpg as a function of wt and hp
data = train_data,
method = "lm", # Linear regression method
trControl = trainControl(method = "cv", number = 5) # 5-fold cross-validation
)

# View model summary
print(model)

# Make predictions
predictions <- predict(model, newdata = test_data)

# Evaluate performance
results <- data.frame(
Actual = test_data$mpg,
Predicted = predictions
)

# Calculate RMSE and R-squared
rmse <- sqrt(mean((results$Actual - results$Predicted)^2))

print(paste("RMSE:", rmse))

# Plot actual vs predicted
plot(test_data$mpg, predictions, main = "Actual vs Predicted",
xlab = "Actual mpg", ylab = "Predicted mpg")
abline(0, 1, col = "red") # Add diagonal line for reference

Conclusion

In this article, we have built a Machine Learning model using the built-in mtcars dataset and the linear regression algorithm. Still, for your implementation when working on projects, you can use any other external dataset and algorithm you deem fit for the particular task.

It is recommended that you conduct proper exploratory data analysis on your dataset and experiment with different algorithms before selecting the one that best suits the task.

--

--

Shittu Olumide Ayodeji
Shittu Olumide Ayodeji

Written by Shittu Olumide Ayodeji

Hello 👋 my name is Shittu Olumide, I am a skilled software developer and technical writer, compassionate about the community and its members.

No responses yet