Quantized Void: How to Build an Econometric Forecasting Model in R? A Step-by-Step Guide-1

In this post, I am going to build an econometric regression model in R from scratch. As an example, I have decided to use the monthly unemployment rate in the U.S. as my dependent variable and I used 7 different macroeconomic indicators as independent variables. First, I will explain how to extract relevant data from external sources and introduce them in a proper data frame format in R.

Then, I will explain details of all important data preprocessing steps as well as basic statistical analyses such as outlier detection and correlation analysis. Then, I will build various Machine Learning methods to understand the relationship between variables and also to predict the future unemployment rate in the U.S. for a specific period.

Before we start, there are three important things to mention: First; no matter what, try to use data from reliable sources! Imagine you are a chef in a reputable restaurant and you buy food from the market to prepare dinner for tonight. If you buy food of low quality, then no matter how skilled a chef you are, you will be unable to prepare a dinner of high quality! In our case, all data you collect is going to be your raw food. If you do not care about the quality/reliability of its source and are unable to preprocess it well, no matter how complex methods you apply, your analysis is highly likely to be misleading.

Second, make sure that you follow all the necessary preprocessing steps that I am going to mention below. If you miss an important step and conduct your analysis, it may be too late to go back and change your dataset (or at least it will require a lot of work&effort). So, make sure everything looks perfectly clear before employing any algorithms!

Third, keep in mind that this is an example of Time Series Analysis. So, I have used the same frequency (monthly) for every variable I have used. Also, the temporal order of observations is extremely important. Now, let's start with introducing the variables:

To make it easier to read, I have highlighted all the R codes in purple.

Dependent Variable:

1- UNEMPLOYMENTRATE (Unemployment Rate in the U.S., Percent, Monthly)

Independent Variables:

1- MOODYSAAA (Moody's Seasoned AAA Corporate Bond Yield in the U.S., in Percentage, Monthly)

2- LABORFORCEPARTICIPATION (Labor Force Participation Rate in the U.S., in Percentage, Monthly)

3- CPI (Consumer Price Index, All Items in the U.S., Growth Rate Previous Period, Monthly)

4- CADUSDEXCRATE (Canada / U.S. Foreign Exchange Rate, Canadian Dollars to One U.S. Dollar, Monthly)

5- MONTHLYHOUSESUPPLY (Monthly Supply of Houses in the U.S., Monthly)

6- COMMERCIALBANKSASSETS (Total Assets of All Commercial Banks in the U.S., Billions of USD, Monthly)

7- TOTALVEHICLESALES (Total Vehicle Sales in the U.S., Millions of Units, Monthly)

Step 1: Preparing the Dataset and Importing it in R

I recommend downloading all data to an Excel spreadsheet first, as it is very easy to do. Make sure you give relevant titles for each column and start from the very first (A1) cell. Start pasting the data of your independent variables first, and do not leave any empty columns between any of them. Data of your dependent variable should be in the last column. Also, make sure that there is no gap between any letters of your titles, as R will be unable to recognize them.

The very first thing you need to do is setting the working directory in R. In other words, the source of your Excel file that you are going to work with. To do that, left-click on your Excel file and click on "Properties". Then, copy the "Location" of your file and paste it in the setwd() code as below:

setwd("C:/Users/Administrator/Desktop/blog/Example")

IMPORTANT: Do not forget to write the location in "" brackets and make sure to change \ into /.

Then, after typing getwd() in your R console, your working directory is successfully set!

Once you are ready, it is time to import your file in R. To do that, you should first install the relevant packages and call for their libraries. Type the codes below. Then select all and press Ctrl+ENTER is to execute.

install.packages("readxl")

install.packages("writexl")

install.packages("xlsx")

install.packages("zipR")

install.packages("openxlsx")

library(readxl)

library(writexl)

library(xlsx)

library(zipR)

library(openxlsx)

Now, it is time to rename our file and introduce it to R as an Excel file. In this example, I have used the name "ANALYSIS" but you can choose any name that you want. To do that, type the code below:

ANALYSIS<-read_excel("ANALYSIS.xlsx")

It says that our dataset contains 546 observations of 9 variables. Everything goes perfectly so far. Now, let's check if there are any missing values and outliers in our dataset:

Step 2: Identifying Missing Values in the Dataset & Getting the Summary Statistics

Make sure that your dataset has no missing values. Checking this is very easy, especially if you are working with large datasets. Type the following code in your R console: sum(is.na(ANALYSIS)==TRUE) This code will tell you how many observations are missing in your dataset.

In our example, it says that there are no missing values. That's great, as missing values can be pretty harmful to our analysis. Now, lets' check the summary statistics of our dataset using summary(ANALYSIS) function:

Summary statistics are important for understanding the basic statistics of our dataset. It tells us about the characteristics of each variable used (Minimum-Maximums, 1st-3rd Quartiles, and Mean-Median values). They also signal us if there may be possible outliers in our dataset.

Step 3: Identifying Possible Outliers

1- Eyeballing the Graphs

Now, let's plot graphs of each variable to see if there are any outliers by eyeballing. Below, I provided the code for plotting a graph of MOODYSAAA with respect to DATE as an example. However, you should do that for every variable by replacing the yellow-highlighted parts with relevant inputs.

plot(ANALYSIS$DATE,ANALYSIS$MOODYSAAA,

xlab = "DATE",

ylab = "MOODYSAAA",

col = "red",

type = "l",

main = "GRAPH")

abline(h=mean(ANALYSIS$MOODYSAAA), col="blue")

This is the full R code to plot the graph of MOODYSAAA vs. DATE. xlab and ylab arguments denote the titles of x and y axes. col specifies the color of the graph, type specifies the graph type (In this case, I have used a line graph), main is the title of our graph and finally, abline function is used for adding a straight line to our graph. In this example, I have added the mean value of MOODYSAAA as a horizontal blue line.

Below, I have added all the graphs of our 7 independent variables along with their mean values.

2- Rosner's Test

Obviously, only eyeballing is not enough to detect possible outliers. There are various statistical tests for detecting outliers such as Grubb's Test, Dixon's Test, Rosner's Test, etc. They all work pretty well and are easy to use. In our example, I will use Rosner's Test. Let's see the code first:

As in the picture above, we are using the "EnvStats" package and rosnerTest() function. This function has 2 arguments: Data that we plan to examine for outliers, and k, which denotes the number of possible outliers that we suspect (I set it as 10 in this example. However, if you suspect that there are more or fewer outliers, make sure to change this parameter value). Finally, I have called all the statistical properties of "suspected" outliers, using all.stats argument. The results for all independent variables are shown below:

Now, results point out that only 395th observation in the CPI series seems to be an outlier. Normally, if we have used cross-sectional or panel data instead of time-series, I strongly recommend removing outliers! Because the possibility of outliers harming our results is way more than improving them. However, this does not work in time-series data. So, the only thing you should check is that whether the outlier value is the true observation or not. In our case, the observation was real. So we will continue with our data preprocessing.

3- Grubbs's Test

Using Grubbs's Test, we can identify whether the highest/lowest value in our dataset is an outlier or not. The null hypothesis for this test states that the highest/lowest value in the dataset is an outlier. So, if our p-value is less than 0.05, we will say that the given observation is an outlier. Note that grubbs.test() function uses the highest value as default. So, to check the minimum values, we are going to set opposite = TRUE. The full R-code with results is below:

Results show that the lowest value of CPI is called to be an outlier by the Grubbs's Test, as the p-value is less than 0.05. This is exactly the same observation which Rosner's Test has also identified as an outlier! Now, let's do the same for minimum values:

For minimums, none of the p-values are less than 0.05. There are no outliers.

Step 4: Checking the Correlation Matrix

Correlation between variables is one of the simplest techniques to examine the statistical relationship between them. If the correlation coefficient is positive, it means that variables move together. If it is negative, it says that variables move apart. If the correlation coefficient is 0, then we can conclude that variables are uncorrelated.

To do that, I am using cor() function. The comma (,) indicates that I am going to use all rows, and -1 indicates that I do not want to include the first column (DATE) in the function. Keep in mind that cor() function only works with numeric values!

As specified in the picture above, the correlation coefficient of UNEMPLOYMENTRATE vs. TOTALVEHICLESALES is -0.74. It means that for every 1% increase in the unemployment rate, total vehicle sales drop by roughly 740.000. Makes sense. Unemployed people have definitely more important things to worry about than buying a new car! Similarly, the correlation coefficient of COMMERCIALBANKSASSETS vs. MOODYSAAA is -0.85. This means that for a 1% increase in Moody's Seasoned AAA Corporate Bond Yields, total assets in all commercial banks in the U.S. decreases by roughly 850 million USD. Let's note them down.

In the second part of this post, I will finish discussing the rest of the data preprocessing steps namely Standardization-Normalization-Data Splitting, and build the first OLS (Ordinary Least Squares) method. Thank you very much for reading and comments are more than welcome!

Sources:

https://statsandr.com/blog/outliers-detection-in-r/

https://fred.stlouisfed.org/

Quantized Void

How to Build an Econometric Forecasting Model in R? A Step-by-Step Guide-1

No comments:

Post a Comment