In this post, I am going to build an econometric regression model in R from scratch. As an example, I have decided to use the monthly unemployment rate in the U.S. as my dependent variable and I used 7 different macroeconomic indicators as independent variables. First, I will explain how to extract relevant data from external sources and introduce them in a proper data frame format in R.
Then, I will explain details of all important data preprocessing steps as well as basic statistical analyses such as outlier detection and correlation analysis. Then, I will build various Machine Learning methods to understand the relationship between variables and also to predict the future unemployment rate in the U.S. for a specific period.Before we start, there are three important things to mention: First; no matter what, try to use data from reliable sources! Imagine you are a chef in a reputable restaurant and you buy food from the market to prepare dinner for tonight. If you buy food of low quality, then no matter how skilled a chef you are, you will be unable to prepare a dinner of high quality! In our case, all data you collect is going to be your raw food. If you do not care about the quality/reliability of its source and are unable to preprocess it well, no matter how complex methods you apply, your analysis is highly likely to be misleading.
Second, make sure that you follow all the necessary preprocessing steps that I am going to mention below. If you miss an important step and conduct your analysis, it may be too late to go back and change your dataset (or at least it will require a lot of work&effort). So, make sure everything looks perfectly clear before employing any algorithms!
Third, keep in mind that this is an example of Time Series Analysis. So, I have used the same frequency (monthly) for every variable I have used. Also, the temporal order of observations is extremely important. Now, let's start with introducing the variables:
To make it easier to read, I have highlighted all the R codes in purple.
Dependent Variable:
1- UNEMPLOYMENTRATE (Unemployment Rate in the U.S., Percent, Monthly)
Independent Variables:
1- MOODYSAAA (Moody's Seasoned AAA Corporate Bond Yield in the U.S., in Percentage, Monthly)
2- LABORFORCEPARTICIPATION (Labor Force Participation Rate in the U.S., in Percentage, Monthly)
3- CPI (Consumer Price Index, All Items in the U.S., Growth Rate Previous Period, Monthly)
4- CADUSDEXCRATE (Canada / U.S. Foreign Exchange Rate, Canadian Dollars to One U.S. Dollar, Monthly)
5- MONTHLYHOUSESUPPLY (Monthly Supply of Houses in the U.S., Monthly)
6- COMMERCIALBANKSASSETS (Total Assets of All Commercial Banks in the U.S., Billions of USD, Monthly)
7- TOTALVEHICLESALES (Total Vehicle Sales in the U.S., Millions of Units, Monthly)
Step 1: Preparing the Dataset and Importing it in R
I recommend downloading all data to an Excel spreadsheet first, as it is very easy to do. Make sure you give relevant titles for each column and start from the very first (A1) cell. Start pasting the data of your independent variables first, and do not leave any empty columns between any of them. Data of your dependent variable should be in the last column. Also, make sure that there is no gap between any letters of your titles, as R will be unable to recognize them.
The very first thing you need to do is setting the working directory in R. In other words, the source of your Excel file that you are going to work with. To do that, left-click on your Excel file and click on "Properties". Then, copy the "Location" of your file and paste it in the setwd() code as below:
setwd("C:/Users/Administrator/Desktop/blog/Example")
IMPORTANT: Do not forget to write the location in "" brackets and make sure to change \ into /.
Then, after typing getwd() in your R console, your working directory is successfully set!
Once you are ready, it is time to import your file in R. To do that, you should first install the relevant packages and call for their libraries. Type the codes below. Then select all and press Ctrl+ENTER is to execute.
install.packages("readxl")
install.packages("writexl")
install.packages("xlsx")
install.packages("zipR")
install.packages("openxlsx")
library(readxl)
library(writexl)
library(xlsx)
library(zipR)
library(openxlsx)
Now, it is time to rename our file and introduce it to R as an Excel file. In this example, I have used the name "ANALYSIS" but you can choose any name that you want. To do that, type the code below:
ANALYSIS<-read_excel("ANALYSIS.xlsx")
It says that our dataset contains 546 observations of 9 variables. Everything goes perfectly so far. Now, let's check if there are any missing values and outliers in our dataset:
Step 2: Identifying Missing Values in the Dataset & Getting the Summary Statistics
Make sure that your dataset has no missing values. Checking this is very easy, especially if you are working with large datasets. Type the following code in your R console: sum(is.na(ANALYSIS)==TRUE) This code will tell you how many observations are missing in your dataset.
In our example, it says that there are no missing values. That's great, as missing values can be pretty harmful to our analysis. Now, lets' check the summary statistics of our dataset using summary(ANALYSIS) function:
Summary statistics are important for understanding the basic statistics of our dataset. It tells us about the characteristics of each variable used (Minimum-Maximums, 1st-3rd Quartiles, and Mean-Median values). They also signal us if there may be possible outliers in our dataset.
Step 3: Identifying Possible Outliers
1- Eyeballing the Graphs
Now, let's plot graphs of each variable to see if there are any outliers by eyeballing. Below, I provided the code for plotting a graph of MOODYSAAA with respect to DATE as an example. However, you should do that for every variable by replacing the yellow-highlighted parts with relevant inputs.
plot(ANALYSIS$DATE,ANALYSIS$MOODYSAAA,
xlab = "DATE",
ylab = "MOODYSAAA",
col = "red",
type = "l",
main = "GRAPH")
abline(h=mean(ANALYSIS$MOODYSAAA), col="blue")
This is the full R code to plot the graph of MOODYSAAA vs. DATE. xlab and ylab arguments denote the titles of x and y axes. col specifies the color of the graph, type specifies the graph type (In this case, I have used a line graph), main is the title of our graph and finally, abline function is used for adding a straight line to our graph. In this example, I have added the mean value of MOODYSAAA as a horizontal blue line.
Below, I have added all the graphs of our 7 independent variables along with their mean values.
No comments:
Post a Comment