Analysis of Local Restaurant Data Using R

Back to Research & Publication

A Shallow Dive into Time Series: Analysis of Local Restaurant Data Using R



Introduction and Motivation:

The purpose of this article is to highlight time series analysis in R. It is motivated by personal curiousity, and a desire to suggest that small businesses can gain from employing open source statistical computing software. Currently, few restaurants use R or any other readily available statistical software to analyze their data. Instead many use subscriptions to services that may preform basic analysis such as averaging and then display that information in basic, though visually appealing visualizations. It is worth proposing that there can be deeper and more meaningful conclusions made about the health of a small business when we take data into our own hands and begin to explore it on our own. This article will provide a transparent how-to example for anyone (with or without a statistics background) interested in jumping straight in.


Background:

According to the department of statistics at Stockholm University the beginnings of time series analysis date back to the 1920s and 1930s when George Yule, J. Walker, Karl Pearson, and others worked to develop and greatly influence auto regressive modelling. Later in the 1970s, the book: “Time Series Analysis” by G.E.P. Box and G.M. Jenkins was published and within it, the authors gave a more completed method for time series modelling than had otherwise been published before. Many different techniques for this type of analysis have since developed, including the ARCH and GARCH procedures for parametarization of non constant variance, however my focus here will exclude such methods.

A time series can be understood as a set of data that is recorded at regular intervals, and a couple good examples are the height measurement of tidal waves or the values recorded at the close of the stock market. There is plenty to know about time series, and I do not hope to cover even a fraction of the available information, but rather provide a taste for this type of analysis.

Let’s start with the restaurant’s data for seated covers. Consider the number of covers a restaurant serves to be the number of meals. The difference between an entree and a meal is very similar to the difference between a entree and a cover. One cover includes everything that accompanies an entree. Suppose a guest orders two glasses of white wine, a salad to start, an entree of salmon, and a piece of chocolate cake for dessert. Everything the guest ordered is counted as part of one cover. The cover includes all extras ordered along with one entree. The number of seated covers a restaurant experiences is a good measure of its popularity because it is representative of how many guests were received by the restaurant.


R Code Examples:

Throughout this article I will provide the code used to arrive at any of my plots and charts.

In the following code I read in the data from a csv file using the read.csv() function. I took data from the restaurant and put it in a tabular format to prepare it for this analysis. Depending on the format of your data you may need clean it up to be sure that it is in a format where each point in time has a corresponding entry.

Here is an example portion of the data:

# Read in the data (seated_covers stands for the number of "seated covers" or
  # you can consider this "number of guests" the restaurant took in every month
  # over the course of nearly 10 years)
  seated_covers <- read.csv(
    "~/Dropbox/stat133/stat133-hws-fall17/post02/seated_cvrs.csv")
  
  tail(seated_covers)
##        date seated_cvrs
  ## 115  7/1/17         328
  ## 116  8/1/17         386
  ## 117  9/1/17         467
  ## 118 10/1/17         215
  ## 119 11/1/17          NA
  ## 120 12/1/17          NA

Although my data is represented by monthly entries, it is important to note that a time series can be any set of data recorded at a regular interval down to milliseconds even.

Let’s begin the analysis:

Here I install and load the “tseries” package in R which is the source of all the time series functions used in this article.

#install.packages("tseries")
  library(tseries)
## Warning: package 'tseries' was built under R version 3.4.1
# To continue, make seated_covers a time series object
  sc_ts <- ts(seated_covers$seated_cvrs, frequency = 12, start = 2008)
  sc_ts_na_rm <- na.remove(sc_ts)
  
  # Plot seated_covers to get an idea of how the trend looks
  plot(sc_ts_na_rm)
  
  # Add a line to observe the mean
  abline(reg = lm(sc_ts_na_rm~time(sc_ts_na_rm)))
# Look at the summary info for the data
  summary(sc_ts_na_rm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  ##    53.0   199.0   301.0   284.3   367.0   573.0
# Now use a box plot to check seasonality of the data
  boxplot(sc_ts~cycle(sc_ts))

This first plot gives us a sense of how the business trend looks in general and the box plot gives us an idea for the seasonal fluctuations the restaurant experiences. 1-12 represent the months of the year and we can see that the month of May (5) is consistently a month that sees a high number of guests to the restaurant. This box plot also shows us outlier months where the number of guests visiting the restaurant happen to be exceptionally high or exceptionally low. We continue by adjusting the time series data in order to prepare it for use in the forecasting phase. This requires making the mean and the variance constant, known as making the data “stationary”.

Stationarity is important because when a sequence is stationary a lot of properties that work for independent random variables (ie Law of Large Numbers, and Central Limit Theorem) also work for the time series data. Sationarity is defined uniquely in that a sequence of data can only be stationary or not stationary. Similar to independence being defined uniquely. We need stationarity so that the model we create to describe the data does not vary in accuracy at different points in time as it would have had we not differenced the data.

So her is the code for this process:

# Do this by taking the log of the data within sc_ts, and plotting what you find
  # to see if the plot becomes stationary
  plot(log(sc_ts_na_rm))
# We can see that this plot is not stationary (does not have a constant mean
  # with equal variance) so we change that!
  
  # Now try to difference the plot to give it these qualities.
  sc_ts_diff <- diff(log(sc_ts_na_rm))
  
  
  # Here we can see that we seemingly have stationarity
  plot(sc_ts_diff)

We continue on by testing out different known time series methods beginning with ARIMA because it is known to have the smallest mean-squared forecast error, but an AR model (of the same class as ARIMA) or any other linear, univariate, or fixed parameter may also work. It is important to notice that by taking the log of the data, we have that the time series becomes additive rather than multiplicative. This should be kept in mind depending on how you would like to procede with modelling.

As we explore an ARIMA method, which stands for (Auto-Regressive Integrated Moving Average) we keep in mind that there are many ways to begin this type of analysis and ARIMA is one of many.

It is important to know that we need to calculate the coefficients for AR, I, and MA seperately.

Note: AR = p, I = d, and MA = q

  • Where (p) is the number of auto-regressive terms, or the order of auto-regressive components.
  • Where (d) is the number of nonseasonal differences needed for stationarity which is the number of
    differencing operators.
  • Where (q) is the number of forcast errors used in the prediction equation (the lag) which is the highest order coefficient of the moving average term.

We already know that d = 1 because it only took differencing of order = 1 to make the time series data stationary.We continue by finding the values of p and q by using R’s auto-correlation function acf()

To do this step install the package tseries:

# install.packages("tseries")
  library(tseries)
  
  # Now first look at the autocorrelation of the data without any transformations
  # made to it
  # If the data you're working with contains NA's or missing data you can pass
  # the NA's so that the function doesn't fail
  plot(acf(sc_ts, na.action = na.pass))
# We can see that none of the lines are inverted so lets look at the
  # stationarized data
  plot(acf(sc_ts_diff, na.action = na.pass))
# Looking at this plot we can see that the auto-regressive coefficient q is
  # equal to 0.
  
  # Without getting too far into the "pro-level" details of it, we see this
  # because it is the last positive line before the first inverted negative line
  # (a spike at the 0th line) which is what we are looking for. 

So we have that q = 0

  • When using the acf() in R to find p, take the position of the line before the first inverted line. We are looking for the lag with the most notable spike.

Now for p:

# Instead of acf() we use pacf() to find partial autocorrelation.
  plot(pacf(sc_ts_diff, na.action = na.pass))
Here, the partial auto correlation function returns a negative tail. This is suggestive that the time series data needs extra differencing. You can continue on to do things like difference the mean, use a Box-Cox Transformation, or any number of other methods, but here I will choose to use the auto.arima function to aid in the forecasting. This function comes from the forecast package so we will need to call its library.

library(forecast)
## Warning: package 'forecast' was built under R version 3.4.2
auto.arima(sc_ts_na_rm)
## Series: sc_ts_na_rm
  ## ARIMA(1,1,1) with drift
  ##
  ## Coefficients:
  ##          ar1      ma1   drift
  ##       0.2388  -0.9275  2.4009
  ## s.e.  0.1099   0.0496  0.6545
  ##
  ## sigma^2 estimated as 4170:  log likelihood=-624.96
  ## AIC=1257.93   AICc=1258.3   BIC=1268.8

Now we can see how R estimated the AR, I, and MA values. This is a handy function, and Using the information given by the auto.arima function we can now attempt to plot a forecast for restaurant popularity using this information as a guide.

sc_ts_fitted <- arima(log(sc_ts_na_rm), c(1, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12))
  
  # How far to predict into the future
  pred <- predict(sc_ts_fitted, n.ahead = 7*12)
  
  # The code below accounts for the fact that we took the log earlier
  ts.plot(sc_ts_na_rm, 2.718^pred$pred, log = "y", lty = c(1, 3))

This is good, and it seems to make sense but I want to try another procedure. At this point I’ll decide to move on to a decomposition of the time series data to get a better overall idea of what is happening. Use a decomposition() function to view the different features of the time series. This will show us the original observed data plotted over time, the trend line, the seasonality, and random factors (ie things out of control like road work that might limit the number of guests to the restaurant during a particuar month).

# Attain the decomposition of the additive time series by using the decompose()
  # function and then plotting the result
  dec_sc_ts <- decompose(sc_ts_na_rm)
  plot(dec_sc_ts)

Now that we can see the different components to our time series data, we can also start to consider forcasting with the decomposition as well.This requires the use of the stl() function from the ‘forecast’ package.

I will plot each method option so a comparision can be made:

library(forecast)
  
  sc_ts_fit <- stl(
    sc_ts_na_rm, t.window = 5, s.window = "periodic", robust = TRUE)
  # t.window is optional but gives us control over how much "wiggle" the plot
  # shows
  
  sc_ts_adj <- seasadj(sc_ts_fit)
  # make a seasonal adjustment to the stl fit by using function seasadj()
  
  plot(
    naive(sc_ts_adj), xlab = "Seated Covers/Popularity Index",
    main = "Naive Forecasts of Seasonally Adjusted Data")
sc_ts_forecast <- forecast(sc_ts_fit, method = "arima")
  # Continuing with the idea to use an ARIMA method
  # Use method arima which is basically inducing a polynomial trend of order d
  # into the forecast function
  
  plot(sc_ts_forecast, ylab = "Seated Covers Index")

This forecast has a similarity to the forecast before it, but here we can adjust the t.window parameter to get a clearer picture of when the business may in the future experience peaks in popularity.

sc_ts_forecast <- forecast(sc_ts_fit, method = "ets")
  # ETS uses exponential smoothing for error, trend, and seasonal components
  # (uses AAA version of exponential smoothing algorithm)
  
  plot(sc_ts_forecast, ylab = "Seated Covers Index")
sc_ts_forecast <- forecast(sc_ts_fit, method = "naive")
  # Typically useful for stock market time series
  # (where the last periods actual numbers are used in current
  # forecast) Used for highly seasonal data.
  
  plot(sc_ts_forecast, ylab = "Seated Covers Index")
sc_ts_forecast <- forecast(sc_ts_fit, method = "rwdrift")
  # Similar to extrapolating a line between the first and last observations
  # (a variation of the naive method that allows for forecasts to
  # increase and decrease over time) The amount of change or "drift" over time is
  # set to be the historical average.
  
  plot(sc_ts_forecast, ylab = "Seated Covers Index")

I now go a bit further and take some extra ratings data for a sixty day span from the restaurant and simply visualize it in barchart form. This may lead to some other insights.

Below is the code for the following plots:

Food Ratings


Results:

Given the forecast data, we see that there will potentially be a spike in business during the early months of 2018, and it seems that 2019 will enjoy a nice increase in expected number of guests. There appears to be an overall upward trend suggestive of steady growth over the next several years.

Given the ratings data, we can see from the barchart visualizations that Friday is one of the best rated days of the week closely followed by Saturday. Sunday however seems to be the least favorite day among patrons and is consistently given poor reviews in all four areas.

Lastly, we see in the final plot for recommendations a confirmation of the findings from the earlier barcharts. Sunday is the day that receives the least number of “yes” recommendations while Friday receives the most. The day most likely to receive a “No Response” is Friday. There are only a small number of “no” recommendations for Friday and Saturday whereas the other days of the week do not have data for “no” recommendations.


Discussion:

These time series forecast plots are useful visualizations that provide a deeper look into business projections otherwise gone unseen and provide information that a simple averaging or listing of data cannot necessarily show without extra manipulation. It is up to the R user to determine what is of importance and why. Your analysis depends on the data you have, and what you are interested in knowing.

Though this all might seem a little complex, there are a few general steps to follow when attempting a time series analysis:

  • Step 1: Read in the data, make it a time series object, and do some exploratory analysis with plot(), abline(), boxplot(), and summary().
  • Step 2: Work on stationarization of the data, this process can be a little time consuming and there is not quite one fix-all way to do this, so experiment with adding MA values if needed, differencing the mean, etc.
  • Step 3: Choose which type of model fitting works best for your data. This is open ended and really depends on the data set (and how much time you have) but ultimately it is important to play around and look at examples to get a feel for what steps to take and what functions to use.
  • Step 4: Make your forecast! Use a combination of techniques if necessary and see what the future holds. If what you get seems wrong, revise until you’re able to make a few meaningful inferences.

Conclusions:

Time series forecasting is an interesting way to delve into the analysis of univariate data, and often we don’t have much more to go off of except the passing of time. It is good to begin to understand how to use the many different methods for such investigation into a dataset, and hopefully this overview has piqued the reader’s interest in further study.

For the restaurant: The data suggests that there is growth to look forward to, and the restaurant may be popular enough in coming years to think about either moving to a larger building or possibly opening up a second location if other factors outside of this analysis confirmed. There seems to be something amiss on Sundays, and it would benefit business to investigate the problem. Perhaps have a very brief survey or comment card handed out to guests on that day to better understand the issue. It could be any number of problems, and fixing the problems guests experience on Sundays would prevent both negative reviews and word of mouth from affecting the futher success of the business.

For the reader: Time series analysis is a complicated subject when you are just beginning so keep in mind that the more familiar you become with the underlying concepts of time series forecasting the better your intuition will be for making accurate predictions. This was a quick survey of some of the things you can do in R to begin an analysis of a time dependant sequence of data, and by no means did I truly scratch the surface.

Rob Hyndman of the University of Melbourne seems to be the go-to guy for all things time series analysis! Check out the link for more about him. He created the R package tseries I used in this article!


Semester

Fall 2017

Researcher

Kenna Schoeler