# Exploratory Data Analysis and Prediction: Productivity of Garment Industry Employees This is a rather long post because it is originally a Jupyter notebook file, and so I included all my thought process, as well as code, when doing Exploratory Data Analysis (EDA) and building simple prediction models. With this post, I would like to showcase the typical routes I take when doing EDA, and building simple prediction models. ## Idea behind the project I was browsing Kaggle and came across a competition to predict productivity level of garment industry employees. Given the labour intensive nature of the industry with many manual processes, I thought it would be interesting to give it a go. The original dataset used can be found here. ## Housekeeping Programming Language: R 3.8.11 in Jupyter Notebook R Libraries used: - ggplot2 - tidyverse - reshape2 - glmnet - caret - car - grid - gridExtra - corrplot ## 1. Introduction This notebook contains the results of data analysis performed on a dataset containing important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts in Bangladesh. The aim of the data analysis is to build several models to predict employee productivity given some other factors in a factory. The models should be lightweight and lean, in order to maintain real-time capability, and satisfy the requirements of best-cost hardware of traction drives in an automative environment. The first section of this notebook will be about the exploratory data analysis (EDA) performed to explore and understand the data, taking a look at each attribute (variable) in the data to understand the nature and distribution of the attribute values. It also examines the correlation between the variables through visualisations. A brief summary is provided at the end to highlight the key findings of the EDA. This includes: section 2 [Exploratory Data Analysis](#sec_3). The second section shows the development of the linear regression models. It details the process used to build the models and shows the models at key points in the development process. Each model will be briefly analysed and its RMSE and summary statistics on the training set will be provided. This includes: section 3 [Methodology](#sec_4), and 4 [Model Development](#sec_5). Then, in the final section, we will deploy all the models on the testing set and create a comparison table of RMSEs and R-Squared for both training and testing set to compare the performance of the models and pick the best one. The final section provides the details of the model to ensure reproducibility. The final model is then presented along with an analysis and interpretation of its summary statistics. This includes: section 5: [Results and discussion](#sec_6). Two datasets were provided for the assignment - train.csv and test.csv. The exploratory data analysis and the model building were done using the train.csv dataset; the test.csv dataset was used to test the generated models. Let's now load the required libraries used in the notebook. ```R library(ggplot2) library(tidyverse) library(reshape2) library(glmnet) library(caret) library(car) library(grid) library(gridExtra) library(corrplot) ``` ## 2. Exploratory Data Analysis ### Splitting training and testing sets ```R # Load the dataset garment_data <- read_csv("garments_employee_productivity.csv") # Set random seed set.seed(2) # Create train and test sets, shuffling and spliting it 80:20 sample_size <- floor(0.8*nrow(garment_data)) train_index <- sample(seq_len(nrow(garment_data)),size = sample_size) #https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function train <- garment_data[train_index,] test <- garment_data[-train_index,] dim(train) # 957 records dim(test) # 240 records ```

### Quick overview of the train dataset ```R # Display the dimensions cat("The garment productivity dataset has", dim(train)[1], "records, each with", dim(train)[2], "attributes. The structure is:\n\n") # Display the structure str(train) cat("\nThe first few and last few records in the dataset are:") # Inspect the first few records head(train) # And the last few tail(train) cat("\nBasic statistics for each attribute are:") # Statistical summary summary(train) cat("The numbers of unique values for each attribute are:") apply(train, 2, function(x) length(unique(x))) ``` The garment productivity dataset has 957 records, each with 15 attributes. The structure is: Classes 'tbl_df', 'tbl' and 'data.frame': 957 obs. of 15 variables: $ date : Date, format: "2015-02-28" "2015-02-10" ... $ quarter : chr "Quarter4" "Quarter2" "Quarter3" "Quarter4" ... $ department : chr "sweing" "sweing" "finishing" "finishing" ... $ day : chr "Saturday" "Tuesday" "Sunday" "Saturday" ... $ team : num 10 7 11 9 8 3 7 1 4 4 ... $ targeted_productivity: num 0.7 0.5 0.6 0.6 0.8 0.5 0.35 0.8 0.8 0.8 ... $ smv : num 21.82 30.1 2.9 3.94 2.9 ... $ wip : num 1448 1025 1688 1448 970 ... $ over_time : num 6120 6960 1200 1440 4800 ... $ incentive : num 40 23 0 0 0 30 0 0 88 60 ... $ idle_time : num 0 0 0 0 0 0 0 0 0 0 ... $ idle_men : num 0 0 0 0 0 0 0 0 0 0 ... $ no_of_style_change : num 1 1 0 0 0 0 1 0 0 0 ... $ no_of_workers : num 51 58 10 8 8 56.5 58 10 57.5 56.5 ... $ actual_productivity : num 0.7 0.501 0.715 0.261 0.398 ... The first few and last few records in the dataset are:

date	quarter	department	day	team	targeted_productivity	smv	wip	over_time	incentive	no_of_style_change	no_of_workers	actual_productivity
2015-02-28	Quarter4	sweing	Saturday	10	0.7	21.82	1448	6120	40	1	51.0	0.7002366
2015-02-10	Quarter2	sweing	Tuesday	7	0.5	30.10	1025	6960	23	1	58.0	0.5008017
2015-02-15	Quarter3	finishing	Sunday	11	0.6	2.90	1688	1200	0	0	10.0	0.7153333
2015-01-24	Quarter4	finishing	Saturday	9	0.6	3.94	1448	1440	0	0	8.0	0.2611742
2015-01-22	Quarter4	finishing	Thursday	8	0.8	2.90	970	4800	0	0	8.0	0.3977431
2015-01-15	Quarter3	sweing	Thursday	3	0.5	22.52	1102	10170	30	0	56.5	0.6606833

date	quarter	department	day	team	targeted_productivity	smv	wip	over_time	incentive	no_of_workers	actual_productivity
2015-02-01	Quarter1	sweing	Sunday	7	0.70	24.26	1400	6720	0	56	0.4115536
2015-03-02	Quarter1	sweing	Monday	1	0.65	26.66	1527	6840	65	57	0.8005795
2015-01-10	Quarter2	finishing	Saturday	9	0.80	3.94	966	1440	0	8	0.8581439
2015-01-06	Quarter1	finishing	Tuesday	3	0.75	4.15	808	1800	0	10	0.8991667
2015-02-23	Quarter4	finishing	Monday	8	0.70	5.13	1583	960	0	8	0.9304167
2015-02-02	Quarter1	finishing	Monday	12	0.75	4.08	1435	1080	0	9	0.7404444

Basic statistics for each attribute are: date quarter department day Min. :2015-01-01 Length:957 Length:957 Length:957 1st Qu.:2015-01-17 Class :character Class :character Class :character Median :2015-02-03 Mode :character Mode :character Mode :character Mean :2015-02-03 3rd Qu.:2015-02-23 Max. :2015-03-11 team targeted_productivity smv wip Min. : 1.000 Min. :0.0700 Min. : 2.90 Min. : 7 1st Qu.: 3.000 1st Qu.:0.7000 1st Qu.: 3.94 1st Qu.: 749 Median : 6.000 Median :0.7500 Median :15.26 Median : 1017 Mean : 6.373 Mean :0.7283 Mean :15.44 Mean : 1093 3rd Qu.: 9.000 3rd Qu.:0.8000 3rd Qu.:25.90 3rd Qu.: 1231 Max. :12.000 Max. :0.8000 Max. :54.56 Max. :23122 over_time incentive idle_time idle_men Min. : 0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000 1st Qu.: 1440 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000 Median : 4080 Median : 23.0 Median : 0.0000 Median : 0.0000 Mean : 4649 Mean : 41.2 Mean : 0.7978 Mean : 0.3312 3rd Qu.: 6960 3rd Qu.: 50.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Max. :25920 Max. :3600.0 Max. :300.0000 Max. :45.0000 no_of_style_change no_of_workers actual_productivity Min. :0.0000 Min. : 2.00 Min. :0.2337 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:0.6402 Median :0.0000 Median :34.00 Median :0.7537 Mean :0.1546 Mean :35.14 Mean :0.7293 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:0.8501 Max. :2.0000 Max. :89.00 Max. :1.1204 The numbers of unique values for each attribute are as follow:

date: 59
quarter: 5
department: 2
day: 6
team: 12
targeted_productivity: 9
smv: 68
wip: 471
over_time: 134
incentive: 48
idle_time: 10
idle_men: 9
no_of_style_change: 3
no_of_workers: 58
actual_productivity: 714

```R # Exploring the number of values of each individual attribute (targeted_productivity in this case) length(garment_data$targeted_productivity) length(unique(garment_data$targeted_productivity)) table(garment_data$targeted_productivity) ``` 1197 9 0.07 0.35 0.4 0.5 0.6 0.65 0.7 0.75 0.8 1 27 2 49 57 63 242 216 540 ## Summary of attributes The following 2 tables contain information on each attribute for the train dataset. The first table is obtained from the original UCI Repository, which states the description for each attribute. The second table is obtained through individual examination of each attribute, which identifies which attributes are numerical and whether they are continuous or discrete, and which are categorical and whether they are nominal or ordinal. It includes some initial observations about the ranges and common values of the attributes. **Table 1: Attribute description** |ID|Attribute|Description| |----|-----------|-------------| |1|date|Date in MM-DD-YYYY| |2|quarter|A portion of the month. A month was divided into four quarters| |3|department|Associated department with the instance| |4|day|Day of the Week| |5|team|Associated team number with the instance| |6|targeted_productivity|Targeted productivity set by the Authority for each team for each day.| |7|smv|Standard Minute Value, it is the allocated time for a task| |8|wip|Work in progress. Includes the number of unfinished items for products| |9|over_time|Represents the amount of overtime by each team in minutes| |10|incentive|Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action| |11|idle_time|The amount of time when the production was interrupted due to several reasons| |12|idle_men|The number of workers who were idle due to production interruption| |13|no_of_style_change|Number of changes in the style of a particular product| |14|no_of_workers|Number of workers in each team| |15|actual_productivity|The actual % of productivity that was delivered by the workers. It ranges from 0-1| **Table 2: Attribute type** |ID|Attribute|Type|Sub-type|Comments| |--|---------|----|--------|--------| |1|date|Categorical|Ordinal|Contains duplicates. Range from 2015-01-01 to 2015-03-11, which is 59 days. Can drop| |2|quarter|Categorical|Ordinal|Has 5 values. Quarter 1-5| |3|department|Categorical|Nominal|Has 2 values. 'sweing' and 'finishing'| |4|day|Categorical|Ordinal|Has 6 values. Friday does not appear. Dayoff?| |5|team|Categorical|Nominal|Has 12 values, for 12 teams| |6|targeted_productivity|Numerical|Continuous|Has 9 unique values - range is 0.07 - 0.8. Very likely to have outliers| |7|smv|Numerical|Continuous|Values range from 2.9 to 54.56| |8|wip|Numerical|Discrete|Values range from 7 to 23122. Mistake? Extreme outliers| |9|over_time|Numerical|Discrete|Values range from 0 to 25920. Extreme outliers. The majority are 0, 960, 1440| |10|incentive|Numerical|Continuous|Values range from 0 to 3600. Extreme outliers. The majority is 0| |11|idle_time|Numerical|Continuous|Values range from 0 to 300. Extreme outliers. The majority is 0| |12|idle_men|Numerical|Discrete|Values range from 0 to 45. The majority is 0| |13|no_of_style_change|Numerical|Discrete|Only has 3 values - 0,1,2. The majority is 0| |14|no_of_workers|Numerical|Discrete|Ranges from 2 to 89. The majority is 8. There are some numbers that include decimals i.e. 51.5. Mistake?| |15|actual_productivity|Numerical|Continuous|Values range from 0.2337 to 1.1204| The 'date' attribute is unlikely to be helpful for our analysis, so we will remove it. Also, we need to change the 3 attributes with character type into factor and then numeric for EDA. ### Investigate distribution for each variable #### View the variable distributions using boxplots ```R # Generate box plots of all variables except the first 5 categorical ones ggplot(gather(train[,-c(1:5)]), aes(key,value)) + geom_boxplot() + facet_wrap(~key, scales="free") + labs(title = "Box plots showing the distribution of all variables", subtitle ="except the first 3 categorical variables") ``` ![png](output_16_0.png) #### View the variable distributions using histograms and bar charts. ```R # Plot a histogram or bar chart of each variable except the first 5 categorical ones ggplot(gather(train[,-c(1:5)]), aes(value)) + geom_histogram(bins=20,color='black', fill='white') + facet_wrap(~key, scales = 'free') + labs(title = "Histograms showing the distribution of all variables", subtitle ="except the first 3 categorical variables") ``` ![png](output_18_0.png) The distribution graphs above show that: - actual_productivity has multiple lower outliers, potentially leading to a negative skew as we can observe the longer tail to the left. - Most of our attributes are heavily skewed, both positive and negative, except for perhaps actual_productivity, and team, which appears to be approximately normal and uniform respectively. - There is 0 idle time and idle workers for the majority of the time. - There is 0 incentive offered for the majority of the time. - Very few to none style changes, except for a few products. - 12 teams working with quite uniform distribution of activities. Team 1 and 2 have the highest number of activities. - Number of workers in a team can vary greatly, with most having around 8 or 58 workers. - Targeted productivity is set at 0.8 for the majority of the time. - The allocated time for a task is usually within 20 minutes, most being around 3 minutes. ```R # Plot a bar chart for each of the first 5 categorical variables except the date variable ggplot(gather(train[,c(2:5)]), aes(value)) + geom_bar(color='black', fill='white') + facet_wrap(~key, scales ='free') + labs(title = "Bar charts showing the distribution of the first 3 categorical variables") ``` ![png](output_20_0.png) The bar graphs above show that: - Production activities are spread quite evenly throughout weekdays and weekends, except for Friday, which seems to be a day off. - There are close to 200 more sewing tasks than finishing in total. - More than half production activities happen during the first two quarters of the month. ### Correlations #### Correlation table and graphs We can obtain the correlation table below by excluding the first 5 categorical variables: date, day, quarter, department and team. ```R cor(train[,-c(1:5)]) ```

	targeted_productivity	smv	wip	over_time	incentive	idle_time	idle_men	no_of_style_change	no_of_workers	actual_productivity
targeted_productivity	1.00000000	-0.07144559	0.04422815	-0.08170874	0.03008378	-0.05445067	-0.03714038	-0.22411472	-0.08979954	0.43233088
smv	-0.07144559	1.00000000	0.04596410	0.66646288	0.01351429	0.05217029	0.10327150	0.29267799	0.90685936	-0.10155147
wip	0.04422815	0.04596410	1.00000000	0.05920016	0.03390728	-0.02107011	-0.04142333	-0.04353742	0.08284851	0.08858794
over_time	-0.08170874	0.66646288	0.05920016	1.00000000	-0.02237368	0.03087565	-0.02145995	0.04684807	0.72603916	-0.03850287
incentive	0.03008378	0.01351429	0.03390728	-0.02237368	1.00000000	-0.01172284	-0.01910988	-0.03115262	0.03064095	0.08048132
idle_time	-0.05445067	0.05217029	-0.02107011	0.03087565	-0.01172284	1.00000000	0.64762633	-0.01342704	0.05674108	-0.08274623
idle_men	-0.03714038	0.10327150	-0.04142333	-0.02145995	-0.01910988	0.64762633	1.00000000	0.12920875	0.10682784	-0.15130357
no_of_style_change	-0.22411472	0.29267799	-0.04353742	0.04684807	-0.03115262	-0.01342704	0.12920875	1.00000000	0.32023164	-0.20079279
no_of_workers	-0.08979954	0.90685936	0.08284851	0.72603916	0.03064095	0.05674108	0.10682784	0.32023164	1.00000000	-0.02474232
actual_productivity	0.43233088	-0.10155147	0.08858794	-0.03850287	0.08048132	-0.08274623	-0.15130357	-0.20079279	-0.02474232	1.00000000

Let's visualise this table with the following graph: ```R corrplot.mixed(cor(train[,-c(1:5)]), lower="ellipse", upper="number") ``` ![png](output_26_0.png) Let's try another graph to see the data points: ```R colorRange <- c('#69091e', '#e37f65', 'white', '#aed2e6', '#042f60') ## colorRamp() returns a function which takes as an argument a number ## on [0,1] and returns a color in the gradient in colorRange myColorRampFunc <- colorRamp(colorRange) panel.cor <- function(w, z, ...) { correlation <- cor(w, z) ## because the func needs [0,1] and cor gives [-1,1], we need to shift and scale it col <- rgb(myColorRampFunc((1 + correlation) / 2 ) / 255 ) ## square it to avoid visual bias due to "area vs diameter" radius <- sqrt(abs(correlation)) radians <- seq(0, 2*pi, len = 50) # 50 is arbitrary x <- radius * cos(radians) y <- radius * sin(radians) ## make them full loops x <- c(x, tail(x,n=1)) y <- c(y, tail(y,n=1)) ## trick: "don't create a new plot" thing by following the ## advice here: http://www.r-bloggers.com/multiple-y-axis-in-a-r-plot/ ## This allows par(new=TRUE) plot(0, type='n', xlim=c(-1,1), ylim=c(-1,1), axes=FALSE, asp=1) polygon(x, y, border=col, col=col) } # usage e.g.: # pairs(mtcars, upper.panel = panel.cor) ``` ```R pairs(train[,-c(1:5)], lower.panel=panel.cor) ``` ![png](output_29_0.png) The correlation graphs and table above show that: - Most correlations between attributes are weak and insignificant. There is only a few moderate and strong positive correlations. - smv and no_of_workers have a strong positive correlation. This makes sense because we would expect the longer the task, the more workers are allocated to it. - overtime and no_of_workers have a moderate positive correlation, as well as overtime and smv. - idle_time and idle_men also have a moderate positive correlation, most likely because they are mostly 0s. #### Investigate the connections of the 3 categorical variables day, quarter, department to others, especially actual_productivity ```R ggplot(melt(as.data.frame(train[,-1])),aes(x=day,y=value)) + facet_wrap(~variable, scales="free") + geom_boxplot() ``` Using quarter, department, day as id variables ![png](output_32_1.png) Key point: - We can notice there is virtually no differences between the days of the week for all variables ```R ggplot(melt(as.data.frame(train[,-1])),aes(x=quarter,y=value)) + facet_wrap(~variable, scales="free") + geom_boxplot() ``` Using quarter, department, day as id variables ![png](output_34_1.png) Key points: - Quarter 5 appears to have higher actual_productivity than all the other quarters. - Quarter 1 has the most outliers for wip and idle_time, which makes sense as it is the beginning of working period. - Quarter 2 has the most outliers for incentive. - Quarter 3 and 4 seems to be the low point for actual_productivity. Also they have the most outliers in idle_men. ```R ggplot(melt(as.data.frame(train[,-1])),aes(x=department,y=value)) + facet_wrap(~variable, scales="free") + geom_boxplot() ``` Using quarter, department, day as id variables ![png](output_36_1.png) Key points: - The smv is generally longer for the sewing department, compared to the finishing department. - The sewing department has more outliers when it comes to wip, idle_time, and idle_men. - The sewing department works over_time for much longer on average. - The sewing department has more no_of_workers than the finishing department, with an average of around 60 workers, compared to 12 for finishing. - On average, the 2 departments seem to be equal in terms of actual_productivity, but the sewing department has more consistent average actual_productivity with many more unproductive outliers. #### Investigate if date attribute has an effect on actual_productivity Let's plot actual_productivity for all the dates in our training dataset, which is over a month and 5 quarters. ```R ggplot(train,aes(x = reorder(date,actual_productivity,median),y = actual_productivity)) + geom_boxplot(outlier.shape=NA, mapping=aes(fill=quarter)) + theme(legend.position = "bottom", legend.box = "horizontal") + labs(x="Date (ordered by median actual_productivity)", title="Boxplots showing actual_productivity distribution for all dates") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6)) ``` ![png](output_39_0.png) This again confirms our findings above that Quarter 5 is much more productive on average than other quarters. Also Quarter 3 and 4 seems to be where actual_productivity dips the most. #### Investigate the relationship between actual_productivity and targeted_productivity Let's first plot actual_productivity against targeted_productivity to check for correlations. I also distinguished between the 2 departments. ```R ggplot(train, aes(x=targeted_productivity,y=actual_productivity, color=department)) + geom_point() + geom_smooth(method=lm,se=FALSE) + labs(title = "Scatterplot showing the relationship between actual productivity and targeted_productivity") ``` `geom_smooth()` using formula 'y ~ x' ![png](output_42_1.png) There is a bit of a heteroskedastic pattern between these 2 variables. Targeted_productivity is often set to be at 0.8, which is also the highest value. However, we find that as the targeted_productivity increases greatly, the variance in actual_productivity also seems to increase. This suggests overly ambitious target can backfire. The sewing department's actual_productivity seems to be more positively correlated with its targeted_productivity than the finishing department. Also, we can notice that, for targeted_productivity, the number is usually a well-rounded value compared to the widely varied value of actual_productivity. Hence, I believe it might be interesting to take the median (to account for outliers) of actual_productivity across the different levels of targeted_variable in order to visualise the relationships better, which is what the graph below is about. ```R summarize(group_by(train, targeted_productivity), median=median(actual_productivity)) %>% ggplot(aes(x=targeted_productivity,y=median)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Scatterplot showing the relationship between median of actual productivity and targeted_productivity") ``` `geom_smooth()` using formula 'y ~ x' ![png](output_44_1.png) We can notice a relatively strong correlation between these 2 variables now. It is also interesting that when the median of targeted_productivity is above or equal to 0.5, it seems that there is a perfect positive linear relationship with target_productivity. It might point to the self-fulfilling prophecy phenomenon. Let's check their correlation score in this case. ```R # Correlation table of actual_productivity and targeted_productivity, with mean of actual_productivity across different levels of targeted_productivity a <- summarize(group_by(train, targeted_productivity), median_actual_productivity=median(actual_productivity)) cor(a) ```

	targeted_productivity	median_actual_productivity
targeted_productivity	1.0000000	0.7742251
median_actual_productivity	0.7742251	1.0000000

#### Investigate the relationship between actual_productivity and incentive We would expect incentive to be the most correlated attribute to actual_productivity based on intuition. However, the correlation graphs we made above did not support this claim, with the score being around 0.08, which is very close to no relationship at all. Let's dig deeper to see what that is about. I will start with a scatterplot showing the relationship between actual_productivity and incentive. ```R train %>% ggplot(aes(x=incentive,y=actual_productivity)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Box plots showing actual_productivity distribution for different team") ``` `geom_smooth()` using formula 'y ~ x' ![png](output_49_1.png) We can notice immediately that the relationship is extremely skewed by many 0 values and a number of large outliers, which are all above or equal to 960. I will explore this point later in a bit. For now, let's try and remove the 0 values and outliers to see if there is a pattern. ```R train %>% filter(incentive < 960 & incentive != 0) %>% ggplot(aes(x=incentive,y=actual_productivity)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Scatterplot showing relationship between actual_productivity and incentive with 0s and outliers removed from incentive") ``` `geom_smooth()` using formula 'y ~ x' ![png](output_51_1.png) We can indeed see that there is now a relatively strong positive correlation between actual_productivity and incentive after we remove outliers (above 960 BDT) and zero-value incentives. Let's check the correlation score. ```R # Correlation table of actual_productivity and incentive with 0 values and outliers removed a <- train %>% filter(incentive < 960 & incentive != 0) %>% select(incentive, actual_productivity) cor(a) ```

	incentive	actual_productivity
incentive	1.0000000	0.8147412
actual_productivity	0.8147412	1.0000000

Next, each team also received different total amount of incentive. Let's see if this affects actual_productivity in any way. I will start with a bar chart showing the total amount of incentive received by each team. ```R train %>% group_by(team) %>% summarise(incentive=sum(incentive)) %>% ggplot(aes(team,incentive)) + geom_bar(stat='identity', color='black', fill='white') + scale_x_discrete(limits = factor(1:12)) + labs(title = "Bar chart showing the total amount of incentive received by each team") ``` ![png](output_55_0.png) From the above bar graph, we can clearly see that Team 9 received the most amount of incentive, with most other teams receive half this amount. We could reasonably expect that this might influence productivity of each team at least a little bit, particularly making Team 9 if not the most productive, at least one of the top performing teams. Let's see if it is the case. ```R grid.arrange( ggplot(train, aes(x=team,y=actual_productivity,group=team)) + geom_boxplot() + scale_x_discrete(limits = factor(1:12)) + labs(title = "Box plots showing actual_productivity distribution for all teams"), train %>% group_by(team) %>% summarise(actual_productivity=median(actual_productivity)) %>% ggplot(aes(team,actual_productivity)) + geom_bar(stat='identity', color='black', fill='white') + scale_x_discrete(limits = factor(1:12)) + labs(title = "Bar chart showing the median actual productivity for all teams"),ncol=2) ``` ![png](output_57_0.png) As we can see from the two graphs above, Team 9 is not the most productive, nor is it in the top 5 most productive teams on average. The most productive teams are Team 1, 12, and 3 in that order. All of them only received an average amount of incentive around 3500 BDT. On the other hand, the teams that received the least amount of incentive, which are Team 6, 7, 8, have the lowest actual productivity on average. It seems that after reaching a certain threshold, the amount of extra incentive stop mattering for actual productivity. Next, let's check the distribution of these incentives by different teams to see if we can find a pattern. ```R ggplot(train, aes(x=team,y=incentive,group=team)) + geom_boxplot() + scale_x_discrete(limits = factor(1:12)) + labs(title = "Box plots showing incentive distribution for different team") ``` ![png](output_59_0.png) It seems that on average most teams received the same amount of incentive. However, there is a number of extreme outliers that completely skewed the incentive distributions for most teams. The majority of these outliers lies around 960 BDT and above. This relates back to our scatterplot between incentive and actual_productivity above. Let's take a closer look at the records of these outliers. ```R # Checking the outliers that show up in the box plots above train %>% filter(incentive >= 960) ```

date	quarter	department	day	team	targeted_productivity	smv	wip	incentive	no_of_workers	actual_productivity
2015-03-09	Quarter2	finishing	Monday	9	0.75	2.90	1161	3600	15	0.8410000
2015-03-09	Quarter2	finishing	Monday	10	0.70	2.90	1161	960	8	0.4772917
2015-03-09	Quarter2	finishing	Monday	2	0.70	3.90	1161	1200	10	0.6825000
2015-03-09	Quarter2	finishing	Monday	8	0.65	3.90	1161	960	8	0.2640625
2015-03-09	Quarter2	finishing	Monday	5	0.60	3.94	1161	2880	12	0.8643426
2015-03-09	Quarter2	finishing	Monday	12	0.80	4.60	1161	1080	9	0.9029630
2015-03-09	Quarter2	finishing	Monday	4	0.75	3.94	1161	960	8	0.7953875
2015-03-09	Quarter2	finishing	Monday	3	0.80	4.60	1161	1440	12	0.7954167
2015-03-09	Quarter2	finishing	Monday	11	0.80	2.90	1161	960	8	0.9606250

We can see that all of these outliers occur on the same day, which is 2015-03-09 in Quarter 2. This might mean a number of things. It could be that this day is especially challenging and important to management so they want to incentivise workers to be more productive, or it is just a set date for this type of incentive to be paid out. Whatever the reason for this may be, I believe it is safe to assume that this type of incentive does not happen often and its effects on actual_productivity can vary greatly. Thus, these outliers can be safely removed from our data as it will not help us in building a predictive model. Lastly, I will try to replot incentive using a log scale to see if it has a log-normal distribution, since its distribution is highly skewed. ```R # Replot incentive using a log_scale to see if this variables have a log-normal distribution mutate(train, log_inc = log(incentive + 0.01)) %>% ggplot(aes(x=log_inc)) + geom_histogram(bins=20, color='black', fill='white') ``` ![png](output_64_0.png) It does not seem to have a log-normal distribution, so I believe adding another incentive attribute as a categorical variable where it indicates the amount of incentive is above 0 and below 960 might be a good predictor for our model. #### Investigate the relationship between actual_productivity and no_of_workers Let's see if the number of workers has an effect on actual_productivity. I will start with a scatterplot between these two variables. ```R scatterplot(actual_productivity ~ no_of_workers, data=train, main="Scatterplot showing relationship between actual_productivity and no_of_workers", grid=FALSE) ``` ![png](output_68_0.png) Again, there is lots of data points and outliers for actual_productivity at several levels of no_of_workers. Next, similar to targeted_productivity, I will attempt to take the median of actual_productivity and plot them across the levels of no_of_workers to see if the pattern is clearer. ```R ap <- train %>% group_by(no_of_workers) %>% summarise(median_actual_productivity=median(actual_productivity)) scatterplot(median_actual_productivity ~ no_of_workers, main="Scatterplot showing relationship between median of actual_productivity and no_of_workers", data=ap, grid=FALSE) ``` ![png](output_70_0.png) We can notice an interesting pattern where actual_productivity on average rises to a certain number of no_of_workers and then decreases and rises again. Could it be a cubic relationship? This might be explained according to the Law of Diminishing Marginal Returns. As more workers are added initially, the productivity rate increases, until the number of workers becomes too cumbersome, which leads to a decrease in the rate. Eventually, if we keep increasing the number of workers, at a certain level the productivity rate might rise again as certain tasks are more optimised with a larger number of workers, and so on. #### Investigate the relationship between actual_productivity and wip Let's start with a scatterplot between these two variables. ```R scatterplot(actual_productivity ~ wip, main="Scatterplot showing relationship between actual_productivity and wip", data=train, grid=FALSE) ``` ![png](output_73_0.png) We notice several values for wip that greatly skewed the distribution. Let's remove these outliers and plot them again. ```R ap <- train %>% filter(wip<5000) scatterplot(actual_productivity ~ wip, main="Scatterplot showing relationship between actual_productivity and wip with outliers removed", data=ap, grid=FALSE) ``` ![png](output_75_0.png) The values of actual_productivity seem to be quite consistent across the number of wip, and there is no discernable pattern. #### Investigate the relationship between actual_productivity and over_time Let's start with a scatterplot between these two variables. I also added a scatterplot with median values of actual_productivity since it seems to be concentrated around a few values of over_time. ```R scatterplot(actual_productivity ~ over_time, main="Scatterplot showing relationship between actual_productivity and over_time", data=train, pch=c(0,1,2), grid=FALSE) ap <- train %>% group_by(over_time) %>% summarise(median_actual_productivity=median(actual_productivity)) scatterplot(median_actual_productivity ~ over_time, main="Scatterplot showing relationship between median of actual_productivity and over_time", data=ap, pch=c(0,1,2), grid=FALSE) ``` ![png](output_78_0.png) ![png](output_78_1.png) The values of actual_productivity seem to slightly decrease as over_time increases, which makes sense, but the relationship is weak. Overall, the variance of actual_productivity is quite high and there is no other clear pattern. #### Investigate the relationship between actual_productivity and smv Let's start with a scatterplot between these two variables. I also added a scatterplot with median values of actual_productivity since it seems to be concentrated around a few values of smv. ```R scatterplot(actual_productivity ~ smv, main="Scatterplot showing relationship between actual_productivity and smv", data=train, grid=FALSE) ap <- train %>% group_by(smv) %>% summarise(median_actual_productivity=median(actual_productivity)) scatterplot(median_actual_productivity ~ smv, main="Scatterplot showing relationship between median of actual_productivity and smv", data=ap, grid=FALSE) ``` ![png](output_81_0.png) ![png](output_81_1.png) It seems for longer standard minute value set for a task, the actual productivity decreases slightly. But overall, the variance of actual_productivity is quite high and there is no clear pattern. ### Summary of EDA After completing EDA on our train dataset, the following main points and relationships were discovered: - Most correlations between attributes are weak and insignificant. There is only a few moderate and strong positive correlations. - No differences between the days of the week for all variables. - Quarter 5 appears to have higher actual_productivity than all the other quarters. - Quarter 3 and 4 seems to be the low point for actual_productivity. Also they have the most outliers in idle_men. - On average, the 2 departments seem to be equal in terms of actual_productivity, but the sewing department has more consistent average actual_productivity, with many more unproductive outliers. - smv and no_of_workers have a strong positive correlation. This makes sense because we would expect the longer the task, the more workers are allocated to it. - There is a relatively strong positive correlation between median of actual_productivity and targeted_productivity. - There is strong positive correlation between actual_productivity and incentive after we remove outliers (above 960 BDT) and zero-value incentives. - Team 9 received the most amount of incentive, with most other teams receive around half this amount. - Team 9 is not the most productive, nor is it in the top 5 most productive teams on average. The most productive teams are Team 1, 12, and 3 in that order. All of them only received an average amount of incentive around 3500 BDT. - On the other hand, the teams that received the least amount of incentive, which are Team 6, 7, 8, have the lowest actual productivity on average. - It seems that after reaching a certain threshold, the amount of extra incentive stop mattering for actual productivity. - All incentive outliers occur on the same day, which is 2015-03-09 in Quarter 2. - There is an interesting pattern where actual_productivity on average rises to a certain number of no_of_workers and then decreases and rises again. - The values of actual_productivity seem to be quite consistent across the number of wip, and there is no discernable pattern. - The values of actual_productivity seem to slightly decrease as over_time increases, but the relationship is weak. - The actual productivity decreases slightly as smv increases, but the relationship is weak. ## 3. Methodology With EDA done, I will now describe the methodology I used for building models before building them. ### Thought process #### Simple linear regression model containing all variables At the start of model building, I started with a simple linear regression that contains all the independent variables. This is to achieve two things: - To see which variables are significant or not. - To have sort of a reference model to make improvements upon. #### Second model with transformations, incorporating knowledge obtained from EDA Next, I moved on to build the second model, also using linear regression, but this time transformations of initial variables were added, which hopefully would result in a noticeable improvement. At this stage, I tried to incorporate as much knowledge obtained from the EDA process as much as possible. Specifically, I did the following things: - Add a categorical variable called good_incentive, that is 1 when incentive < 960 and not 0. This is to capture the possible positive relationship between actual_productivity and incentive that is not outliers or 0. - Add a categorical variable called good_team, that is 1 when it is team 1, 2, 3, 4 or 12, because these teams have actual_productivity above the median of all teams. This is to simplify the team categorical variable. - Add a categorical variable called good_quarter, that is 1 when it is quarter 5, 1, or 2, because these quarters are when actual_productivity is above the median of all quarters. This is to simplify the quarter categorical variable. - Add no_of_workers^3 because scatterplot shows potential cubic relationship with actual_productivity. - Add no_of_workers^2 and no_of_workers^5 through trial and error. - Add interaction term between smv and no_of_workers due to high positive correlation. - Remove all the other insignificant variables. There should be significant improvement over the first model. #### Third model with Stepwise regression After obtaining our second model with transformations based on EDA, I decided to try Stepwise regression to see if this automatic feature selection can offer any viable alternatives, compared to the model I built with EDA. I tried the brute force approach with both AIC and BIC criteria, and the direction is set to both forward and backward, to see if there is a different in model generated. The brute force approach involved passing all the basic transformations as predictors (i.e. interaction terms between all variables, squared variable terms, and log...) The results were: - AIC resulted in a very complex models with around 50 variables. - BIC resulted in a little bit less complex model with 21 variables. This model has a bit lower R-squared and RMSE on the training set than the AIC model, but far less complex. Hence, the BIC model is preferrable to AIC. #### Fourth model with Lasso regression Lastly, since our data is riddled with outliers, I wanted to try out Lasso Regression as it is more robust to outliers than other method such as Ridge (Rodenburg, 2019). Also because Lasso Regression has built-in feature selection where the coeficients can be shrunk to zero, I applied the brute force model similar to Stepwise regression above. We can then see the comparison between Lasso regression and Stepwise regression with this approach. The results were: - Lasso regression model with brute force approach is a very large model with 55 variables in total. This is significantly more complex than our model using BIC as a criteria. - Its RMSE and R-squared on the training set is slightly better than that of the BIC model. Hence, it is safe to say that we can disregard this model due its complexity and mediocre performance. ### Bringing it all together After finishing building our 4 models as described, I trained them all on our testing set and report the RMSE and R-Squared for each in a table for easy comparison. For the R-Squareds, in our case they are all relatively low (range from around 0.25 to 0.5). However, this is not necessarily bad if the dependent variable is a properly stationarized series (Duke University, n.d.). Next for RMSEs, we look for consistency between the training and testing set. The most consistent model is our second linear model with transformations informed by our EDA. The Lasso Regression and Stepwise Regression models all demonstrate overfitting potential with much lower RMSE for training set than testing set and a large number of predictors. ### Summary of model choice Based on all the information above, I think it is preferable to go with our second Linear Model, due to its relative simplicity and consistent performance. Then, further analysis into our model of choice can be done, such as influence plot and influential outliers test etc. ## 4. Model Development ### Final tweaks to our train dataset before we build our models. I will drop the date attribute since it is unlikely to be helpful for us. I will also change the categorical variables to factors. ```R # Dropping date attribute since it will not be used for building our models train <- train[,-1] # Change character attributes to factor before building our models train[,1:3] <- as.data.frame(sapply(train[,1:3],as.factor)) # Order the day attribute train$day <- factor(train$day, levels = c("Monday", "Tuesday", "Wednesday","Thursday","Saturday","Sunday")) str(train) ## Do the same for testing set # Dropping date attribute since it will not be used for building our models test <- test[,-1] # Change character attributes to factor before building our models test[,1:3] <- as.data.frame(sapply(test[,1:3],as.factor)) # Order the day attribute test$day <- factor(test$day, levels = c("Monday", "Tuesday", "Wednesday","Thursday","Saturday","Sunday")) str(test) ``` Classes 'tbl_df', 'tbl' and 'data.frame': 957 obs. of 14 variables: $ quarter : Factor w/ 5 levels "Quarter1","Quarter2",..: 4 2 3 4 4 3 2 1 5 4 ... $ department : Factor w/ 2 levels "finishing","sweing": 2 2 1 1 1 2 2 1 2 2 ... $ day : Factor w/ 6 levels "Monday","Tuesday",..: 5 2 6 5 4 4 1 5 4 2 ... $ team : num 10 7 11 9 8 3 7 1 4 4 ... $ targeted_productivity: num 0.7 0.5 0.6 0.6 0.8 0.5 0.35 0.8 0.8 0.8 ... $ smv : num 21.82 30.1 2.9 3.94 2.9 ... $ wip : num 1448 1025 1688 1448 970 ... $ over_time : num 6120 6960 1200 1440 4800 ... $ incentive : num 40 23 0 0 0 30 0 0 88 60 ... $ idle_time : num 0 0 0 0 0 0 0 0 0 0 ... $ idle_men : num 0 0 0 0 0 0 0 0 0 0 ... $ no_of_style_change : num 1 1 0 0 0 0 1 0 0 0 ... $ no_of_workers : num 51 58 10 8 8 56.5 58 10 57.5 56.5 ... $ actual_productivity : num 0.7 0.501 0.715 0.261 0.398 ... Classes 'tbl_df', 'tbl' and 'data.frame': 240 obs. of 14 variables: $ quarter : Factor w/ 5 levels "Quarter1","Quarter2",..: 1 1 1 1 1 1 1 1 1 1 ... $ department : Factor w/ 2 levels "finishing","sweing": 2 2 1 2 1 2 2 1 2 1 ... $ day : Factor w/ 6 levels "Monday","Tuesday",..: 4 4 4 4 4 5 5 6 6 1 ... $ team : num 6 7 10 4 11 5 10 3 2 11 ... $ targeted_productivity: num 0.8 0.8 0.65 0.65 0.7 0.8 0.75 0.75 0.8 0.8 ... $ smv : num 25.9 25.9 3.94 23.69 4.15 ... $ wip : num 1170 984 861 861 861 659 610 884 782 666 ... $ over_time : num 1920 6720 960 7200 1440 7080 6480 1560 6660 2400 ... $ incentive : num 50 38 0 0 0 50 56 0 50 0 ... $ idle_time : num 0 0 0 0 0 0 0 0 0 0 ... $ idle_men : num 0 0 0 0 0 0 0 0 0 0 ... $ no_of_style_change : num 0 0 0 0 0 0 0 0 0 0 ... $ no_of_workers : num 56 56 8 60 12 31.5 54 8 55.5 10 ... $ actual_productivity : num 0.8 0.8 0.706 0.521 0.436 ... ### Define some Functions to measure the model accuracy These functions are used during the model building to evaluate the model accuracy. #### Function to Calculate Model Accuracy Statistics Name: Model.Accuracy Input parameters: - predicted - a vector of predictions - target - a vector containing the target values for the predictions - df - the degrees of freedom - p - the number of parameters excluding the coefficient Return Value: A list containing: - rsquared - the R-Squared value calculated from the predicted and target values - rse - the residual standard error - f.stat - the F-statistic Description: Calculate the TSS and RSS as: - TSS: $\sum_{i=1}^n (y_i - \bar y)^2$ - RSS: $\sum_{i=1}^n (\hat y_i - y_i)^2$ Calculate the statistics according to the following formulae: - R-Squared value: $R^2 = 1 - \frac{RSS}{TSS}$ - Residual standard error - $\sqrt{\frac{1}{df}RSS}$ - F-statistics - $\frac{(TSS - RSS)/p}{RSS / df}$ ```R Model.Accuracy <- function(predicted, target, df, p) { rss <- 0 tss <- 0 target.mean <- mean(target) for (i in 1:length(predicted)) { rss <- rss + (predicted[i]-target[i])^2 tss <- tss + (target[i]-target.mean)^2 } rsquared <- 1 - rss/tss rse <- sqrt(rss/df) f.stat <- ((tss-rss)/p) / (rss/df) return(list(rsquared=rsquared,rse=rse,f.stat=f.stat)) } ``` #### Function to Calculate RMSE Name: RMSE Input parameters: - predicted - a vector of predictions - target - a vector containing the target values for the predictions Return Value: The RMSE value calculated from the predicted and target values Description: Calculate the RMSE value: $RMSE = \sqrt {\sum_{i=1}^n (\hat y_i - y_i)^2 / N}$ ```R RMSE <- function(true, predicted) { se <- 0 for (i in 1:length(predicted)) { se <- se + (predicted[i]-true[i])^2 } return (sqrt(se/length(predicted))) } ``` ### Model 1: Linear model 1 with all variables and no transformations Let's try a basic linear regression model with all of our original variables to get a feel for what is there. ```R lm_model_14 <- lm(actual_productivity ~ ., data=train) summary(lm_model_14) ``` Call: lm(formula = actual_productivity ~ ., data = train) Residuals: Min 1Q Median 3Q Max -0.56885 -0.06844 0.01807 0.07708 0.53302 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.279e-01 4.256e-02 5.354 1.08e-07 *** quarterQuarter2 8.568e-03 1.292e-02 0.663 0.507346 quarterQuarter3 -1.167e-02 1.499e-02 -0.778 0.436544 quarterQuarter4 -1.357e-02 1.446e-02 -0.938 0.348256 quarterQuarter5 9.153e-02 2.831e-02 3.234 0.001265 ** departmentsweing -4.650e-02 2.983e-02 -1.559 0.119359 dayTuesday 9.323e-03 1.727e-02 0.540 0.589408 dayWednesday 2.896e-03 1.709e-02 0.169 0.865447 dayThursday -2.738e-03 1.752e-02 -0.156 0.875834 daySaturday -8.564e-04 1.769e-02 -0.048 0.961403 daySunday -3.558e-04 1.719e-02 -0.021 0.983486 team -7.364e-03 1.498e-03 -4.917 1.04e-06 *** targeted_productivity 7.239e-01 4.961e-02 14.593 < 2e-16 *** smv -7.424e-03 1.056e-03 -7.033 3.90e-12 *** wip 5.264e-06 3.551e-06 1.483 0.138511 over_time -4.660e-06 2.252e-06 -2.070 0.038761 * incentive 4.559e-05 2.833e-05 1.609 0.107938 idle_time 3.767e-04 4.669e-04 0.807 0.419945 idle_men -8.208e-03 2.161e-03 -3.797 0.000156 *** no_of_style_change -4.113e-02 1.311e-02 -3.137 0.001761 ** no_of_workers 5.234e-03 8.217e-04 6.369 2.97e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1493 on 936 degrees of freedom Multiple R-squared: 0.3024, Adjusted R-squared: 0.2875 F-statistic: 20.29 on 20 and 936 DF, p-value: < 2.2e-16 ```R par(mfrow=c(2,2)) plot(lm_model_14) ``` ![png](output_97_0.png) The R-squared values seem to be quite bad, being around 0.3. This means the model can only explain about 30% of the variation in our dependent variable. The Residuals vs Fitted plot shows quite concentrated residuals around a horizontal line without distinct patterns, which means it does not appear to show any non-linear patterns between the predictor variables and the outcome variable. The Q-Q plot confirms our EDA analysis that our data distribution is left skewed compared to a normal distribution, with heavy tails. The Scale-Location shows that residuals appear to be randomly spread and quite concentrated, meaning that there is little to no heteroscedasticity. Lastly, the Residuals vs Leverage shows us that there are no influential cases of outliers for our model. Let's calculate the actual accuracy metrics using the functions we defined above: ```R lm_model_14_predict <- lm_model_14$fitted.values # Calculate the accuracy statistics lm_model_14_accuracy <- Model.Accuracy(lm_model_14_predict,train$actual_productivity,936,20) # Print the accuracy statistics cat("\nModel accuracy:\n") cat("\nDegrees of freedom: 936\nModel parameters: 20 plus intercept") cat("\nResidual standard error:",lm_model_14_accuracy$rse) cat("\nPercentage error:",lm_model_14_accuracy$rse*100/mean(train$actual_productivity),"%") cat("\nR-Squared:",lm_model_14_accuracy$rsquared) cat("\nF-statistic:",lm_model_14_accuracy$f.stat,"; p-value:",pf(20.29,20,936,lower.tail=FALSE)) # Print the RMSE cat("\n\nRMSE:", RMSE(lm_model_14_predict,train$actual_productivity)) ``` Model accuracy: Degrees of freedom: 936 Model parameters: 20 plus intercept Residual standard error: 0.149298 Percentage error: 20.47107 % R-Squared: 0.3024447 F-statistic: 20.29145 ; p-value: 4.553204e-60 RMSE: 0.1476508 This first intial model is quite bad, but it did shows us a few things. As shown with EDA, the day attribute does not matter at all with no significant p-values. The department attribute is also not significant. Quarter5 is significant at 1% level of confidence, as well as team, targeted_productivity, and no_of_workers, which matches our finding above. A few significant variables I find surprising are smv, idle_men, and no_of_style_change. This is mainly because they contain mostly 0 values and only a few outliers, and are not very correlated to actual_productivity even with these removed. Let's build a different model with variables that are significant, and try to incorporate the information we gained from EDA by adding some variabels. ### Model 2: Linear model 2 with transformations based on EDA First I will generate some new variables based on the criteria explored in EDA. Then I will leave out the insignificant variables from the first model and tweak a few more. ```R # Incentive below 960 BDT seems to matter a lot more to actual_productivity train$good_incentive[train$incentive != 0 & train$incentive < 960] <- 1 train$good_incentive[train$incentive == 0 | train$incentive >= 960] <- 0 test$good_incentive[test$incentive != 0 & test$incentive < 960] <- 1 test$good_incentive[test$incentive == 0 | test$incentive >= 960] <- 0 ``` ```R # Team 1,2,3,4,12 consistently achieve higher actual_productivity then the other teams on average despite receiving average incentive # ap <- summarize(group_by(train, team), median = median(actual_productivity)) %>% arrange(desc(median)) train$good_team[train$team %in% c(1,2,3,4,12)] <- 1 train$good_team[!train$team %in% c(1,2,3,4,12)] <- 0 test$good_team[test$team %in% c(1,2,3,4,12)] <- 1 test$good_team[!test$team %in% c(1,2,3,4,12)] <- 0 ``` ```R # Productivity in Quarter 5, 1, 2 are higher than other quarters train$good_quarter[train$quarter %in% c("Quarter5","Quarter1","Quarter2")] <- 1 train$good_quarter[!train$quarter %in% c("Quarter5","Quarter1","Quarter2")] <- 0 test$good_quarter[test$quarter %in% c("Quarter5","Quarter1","Quarter2")] <- 1 test$good_quarter[!test$quarter %in% c("Quarter5","Quarter1","Quarter2")] <- 0 ``` ```R # Add no_of_workers^3 because scatterplot shows potential cubic relationship with actual_productivity # Add no_of_workers^2 and no_of_workers^5 through trial and error # Add interaction term between smv and no_of_workers due to high positive correlation lm_model_alt <- lm(actual_productivity ~ good_incentive + good_team + good_quarter + targeted_productivity + smv + no_of_workers + smv*no_of_workers + I(no_of_workers^2) + I(no_of_workers^3) +I(no_of_workers^5), data=train) summary(lm_model_alt) ``` Call: lm(formula = actual_productivity ~ good_incentive + good_team + good_quarter + targeted_productivity + smv + no_of_workers + smv * no_of_workers + I(no_of_workers^2) + I(no_of_workers^3) + I(no_of_workers^5), data = train) Residuals: Min 1Q Median 3Q Max -0.55377 -0.06780 0.00583 0.07997 0.51320 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.616e-01 6.820e-02 -2.369 0.018 * good_incentive 1.363e-01 1.621e-02 8.408 < 2e-16 *** good_team 6.193e-02 9.480e-03 6.534 1.05e-10 *** good_quarter 4.045e-02 9.460e-03 4.276 2.10e-05 *** targeted_productivity 5.927e-01 4.724e-02 12.545 < 2e-16 *** smv 2.107e-03 6.782e-03 0.311 0.756 no_of_workers 7.022e-02 8.273e-03 8.488 < 2e-16 *** I(no_of_workers^2) -3.127e-03 3.890e-04 -8.040 2.67e-15 *** I(no_of_workers^3) 4.195e-05 5.573e-06 7.527 1.20e-13 *** I(no_of_workers^5) -1.882e-09 2.734e-10 -6.883 1.06e-11 *** smv:no_of_workers -1.645e-04 1.253e-04 -1.313 0.190 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1408 on 946 degrees of freedom Multiple R-squared: 0.3728, Adjusted R-squared: 0.3662 F-statistic: 56.23 on 10 and 946 DF, p-value: < 2.2e-16 ```R par(mfrow=c(2,2)) plot(lm_model_alt) ``` ![png](output_106_0.png) ```R lm_model_alt_predict <- lm_model_alt$fitted.values # Calculate the accuracy statistics lm_model_alt_accuracy <- Model.Accuracy(lm_model_alt_predict,train$actual_productivity,946,10) # Print the accuracy statistics cat("\nModel accuracy:\n") cat("\nDegrees of freedom: 945\nModel parameters: 11 plus intercept") cat("\nResidual standard error:",lm_model_alt_accuracy$rse) cat("\nPercentage error:",lm_model_alt_accuracy$rse*100/mean(train$actual_productivity),"%") cat("\nR-Squared:",lm_model_alt_accuracy$rsquared) cat("\nF-statistic:",lm_model_alt_accuracy$f.stat,"; p-value:",pf(56.23,10,946,lower.tail=FALSE)) # Print the RMSE cat("\n\nRMSE:", RMSE(lm_model_alt_predict,train$actual_productivity)) ``` Model accuracy: Degrees of freedom: 945 Model parameters: 11 plus intercept Residual standard error: 0.1408161 Percentage error: 19.30808 % R-Squared: 0.3728219 F-statistic: 56.23436 ; p-value: 6.198772e-89 RMSE: 0.1400045 As we can see from the model accuracy metrics, this model is somewhat of an improvement from the first model, in terms of RMSE and R-squared. Let's see next if we can automate some of the features selection using R, specifically using step function and Lasso regression. ```R # Drop the transformed variables created for Model 2 before training a new model train <- train[,1:(length(train)-3)] str(train) ``` Classes 'tbl_df', 'tbl' and 'data.frame': 957 obs. of 14 variables: $ quarter : Factor w/ 5 levels "Quarter1","Quarter2",..: 4 2 3 4 4 3 2 1 5 4 ... $ department : Factor w/ 2 levels "finishing","sweing": 2 2 1 1 1 2 2 1 2 2 ... $ day : Factor w/ 6 levels "Monday","Tuesday",..: 5 2 6 5 4 4 1 5 4 2 ... $ team : num 10 7 11 9 8 3 7 1 4 4 ... $ targeted_productivity: num 0.7 0.5 0.6 0.6 0.8 0.5 0.35 0.8 0.8 0.8 ... $ smv : num 21.82 30.1 2.9 3.94 2.9 ... $ wip : num 1448 1025 1688 1448 970 ... $ over_time : num 6120 6960 1200 1440 4800 ... $ incentive : num 40 23 0 0 0 30 0 0 88 60 ... $ idle_time : num 0 0 0 0 0 0 0 0 0 0 ... $ idle_men : num 0 0 0 0 0 0 0 0 0 0 ... $ no_of_style_change : num 1 1 0 0 0 0 1 0 0 0 ... $ no_of_workers : num 51 58 10 8 8 56.5 58 10 57.5 56.5 ... $ actual_productivity : num 0.7 0.501 0.715 0.261 0.398 ... ### Model 3: Stepwise regression I will start with step function, using a brute force approach. Since we are only dealing with a number of independent variables, I will pass all basic transformations into the model and hope step() can do a good job of sorting out what is important and what is not. ```R fullmod = lm(actual_productivity ~ . + .*. + log(targeted_productivity) + log(smv) + log(wip) + log(over_time + 0.01) +log(incentive + 0.01) + log(idle_time + 0.01) + log(idle_men + 0.01) + log(no_of_style_change + 0.01) + log(no_of_workers) + I(targeted_productivity^2) + I(smv^2) + I(wip^2) + I(over_time^2) + I(incentive^2) + I(idle_time^2) + I(idle_men^2) + I(no_of_style_change^2) + I(no_of_workers^2), data=train) summary(fullmod) ``` Call: lm(formula = actual_productivity ~ . + . * . + log(targeted_productivity) + log(smv) + log(wip) + log(over_time + 0.01) + log(incentive + 0.01) + log(idle_time + 0.01) + log(idle_men + 0.01) + log(no_of_style_change + 0.01) + log(no_of_workers) + I(targeted_productivity^2) + I(smv^2) + I(wip^2) + I(over_time^2) + I(incentive^2) + I(idle_time^2) + I(idle_men^2) + I(no_of_style_change^2) + I(no_of_workers^2), data = train) Residuals: Min 1Q Median 3Q Max -0.60989 -0.03973 0.00254 0.05295 0.32422 Coefficients: (37 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 3.414e-01 2.594e+00 0.132 0.895329 quarterQuarter2 1.306e-01 1.374e-01 0.951 0.341952 quarterQuarter3 8.512e-02 1.673e-01 0.509 0.610974 quarterQuarter4 4.260e-02 1.519e-01 0.281 0.779163 quarterQuarter5 3.290e-01 3.091e-01 1.064 0.287506 departmentsweing -1.688e+00 5.591e-01 -3.019 0.002618 dayTuesday -1.906e-01 1.557e-01 -1.224 0.221273 dayWednesday -2.273e-01 1.590e-01 -1.429 0.153313 dayThursday 7.956e-03 1.586e-01 0.050 0.959997 daySaturday -2.371e-01 1.723e-01 -1.376 0.169139 daySunday -1.736e-01 1.785e-01 -0.973 0.331070 team -3.529e-02 1.496e-02 -2.358 0.018616 targeted_productivity -1.776e+00 1.386e+00 -1.281 0.200432 smv -2.860e-01 9.553e-02 -2.993 0.002845 wip -2.118e-05 1.151e-04 -0.184 0.854093 over_time -6.067e-05 2.887e-05 -2.101 0.035925 incentive -1.468e-03 8.769e-04 -1.674 0.094540 idle_time 4.197e-02 1.607e-01 0.261 0.794076 idle_men 4.532e-02 3.860e-01 0.117 0.906559 no_of_style_change -3.246e-01 2.300e-01 -1.411 0.158618 no_of_workers 1.891e-02 1.130e-02 1.673 0.094655 log(targeted_productivity) 1.077e-01 2.999e-01 0.359 0.719690 log(smv) 1.281e+00 3.438e-01 3.727 0.000208 log(wip) 7.387e-03 1.812e-02 0.408 0.683691 log(over_time + 0.01) -4.330e-03 5.704e-03 -0.759 0.448020 log(incentive + 0.01) 5.178e-03 4.026e-03 1.286 0.198829 log(idle_time + 0.01) -6.717e-02 1.629e+00 -0.041 0.967123 log(idle_men + 0.01) -1.332e-02 1.581e+00 -0.008 0.993278 log(no_of_style_change + 0.01) -3.649e-03 1.650e-02 -0.221 0.825050 log(no_of_workers) 1.088e-02 6.531e-02 0.167 0.867716 I(targeted_productivity^2) 1.262e+00 7.273e-01 1.735 0.083175 I(smv^2) 7.613e-04 2.570e-04 2.962 0.003151 I(wip^2) 1.514e-09 1.354e-09 1.118 0.263782 I(over_time^2) -4.632e-10 7.175e-10 -0.646 0.518737 I(incentive^2) 2.022e-07 1.133e-07 1.784 0.074769 I(idle_time^2) -1.283e-04 5.109e-04 -0.251 0.801830 I(idle_men^2) -1.547e-03 1.029e-02 -0.150 0.880492 I(no_of_style_change^2) NA NA NA NA I(no_of_workers^2) -2.028e-05 1.028e-04 -0.197 0.843694 quarterQuarter2:departmentsweing -2.061e-02 7.588e-02 -0.272 0.785980 quarterQuarter3:departmentsweing 1.426e-01 9.109e-02 1.565 0.117897 quarterQuarter4:departmentsweing -6.089e-02 8.958e-02 -0.680 0.496900 quarterQuarter5:departmentsweing -3.286e-01 2.318e-01 -1.418 0.156734 quarterQuarter2:dayTuesday -4.410e-02 4.213e-02 -1.047 0.295628 quarterQuarter3:dayTuesday -8.512e-02 4.898e-02 -1.738 0.082652 quarterQuarter4:dayTuesday -6.023e-02 4.895e-02 -1.230 0.218935 quarterQuarter5:dayTuesday NA NA NA NA quarterQuarter2:dayWednesday -2.963e-02 4.295e-02 -0.690 0.490530 quarterQuarter3:dayWednesday -8.080e-02 4.823e-02 -1.675 0.094296 quarterQuarter4:dayWednesday -6.705e-02 4.902e-02 -1.368 0.171743 quarterQuarter5:dayWednesday NA NA NA NA quarterQuarter2:dayThursday -5.507e-02 4.485e-02 -1.228 0.219932 quarterQuarter3:dayThursday -5.287e-02 4.961e-02 -1.066 0.286807 quarterQuarter4:dayThursday -8.155e-02 5.109e-02 -1.596 0.110860 quarterQuarter5:dayThursday -1.600e-02 6.053e-02 -0.264 0.791658 quarterQuarter2:daySaturday -7.813e-02 4.278e-02 -1.826 0.068196 quarterQuarter3:daySaturday -7.736e-02 5.343e-02 -1.448 0.148054 quarterQuarter4:daySaturday -9.540e-02 4.750e-02 -2.009 0.044924 quarterQuarter5:daySaturday NA NA NA NA quarterQuarter2:daySunday -1.407e-03 3.972e-02 -0.035 0.971753 quarterQuarter3:daySunday -3.007e-02 4.628e-02 -0.650 0.516088 quarterQuarter4:daySunday -7.200e-03 4.517e-02 -0.159 0.873390 quarterQuarter5:daySunday NA NA NA NA quarterQuarter2:team -6.207e-04 3.580e-03 -0.173 0.862403 quarterQuarter3:team -6.404e-03 4.335e-03 -1.477 0.139997 quarterQuarter4:team 5.744e-04 4.054e-03 0.142 0.887359 quarterQuarter5:team 1.145e-02 8.011e-03 1.429 0.153303 quarterQuarter2:targeted_productivity -2.461e-02 1.661e-01 -0.148 0.882299 quarterQuarter3:targeted_productivity 1.033e-01 2.039e-01 0.506 0.612663 quarterQuarter4:targeted_productivity -1.188e-01 1.862e-01 -0.638 0.523719 quarterQuarter5:targeted_productivity -5.397e-01 3.598e-01 -1.500 0.134066 quarterQuarter2:smv -4.000e-03 3.526e-03 -1.134 0.256956 quarterQuarter3:smv -4.374e-03 3.960e-03 -1.105 0.269689 quarterQuarter4:smv 3.162e-04 3.906e-03 0.081 0.935504 quarterQuarter5:smv 1.187e-02 9.562e-03 1.241 0.214914 quarterQuarter2:wip -8.300e-05 3.841e-05 -2.161 0.031009 quarterQuarter3:wip -9.647e-05 4.094e-05 -2.357 0.018684 quarterQuarter4:wip -2.417e-05 3.621e-05 -0.667 0.504674 quarterQuarter5:wip 1.457e-04 1.572e-04 0.927 0.354046 quarterQuarter2:over_time 1.488e-05 6.920e-06 2.151 0.031801 quarterQuarter3:over_time 5.240e-06 8.950e-06 0.585 0.558400 quarterQuarter4:over_time 5.673e-06 8.479e-06 0.669 0.503676 quarterQuarter5:over_time 1.102e-04 8.328e-05 1.324 0.185986 quarterQuarter2:incentive 3.018e-04 7.185e-04 0.420 0.674608 quarterQuarter3:incentive -5.349e-04 8.528e-04 -0.627 0.530721 quarterQuarter4:incentive 2.622e-04 9.109e-04 0.288 0.773550 quarterQuarter5:incentive 2.376e-03 1.238e-03 1.919 0.055395 quarterQuarter2:idle_time NA NA NA NA quarterQuarter3:idle_time -1.076e-01 4.230e-01 -0.254 0.799215 quarterQuarter4:idle_time 6.824e-01 2.904e+00 0.235 0.814265 quarterQuarter5:idle_time NA NA NA NA quarterQuarter2:idle_men NA NA NA NA quarterQuarter3:idle_men 1.901e-02 5.139e-02 0.370 0.711511 quarterQuarter4:idle_men -1.306e-01 5.095e-01 -0.256 0.797845 quarterQuarter5:idle_men NA NA NA NA quarterQuarter2:no_of_style_change 1.248e-01 7.157e-02 1.743 0.081673 quarterQuarter3:no_of_style_change 4.204e-02 8.029e-02 0.524 0.600758 quarterQuarter4:no_of_style_change 5.211e-02 7.837e-02 0.665 0.506277 quarterQuarter5:no_of_style_change NA NA NA NA quarterQuarter2:no_of_workers -1.216e-04 2.306e-03 -0.053 0.957961 quarterQuarter3:no_of_workers -1.064e-04 2.912e-03 -0.037 0.970865 quarterQuarter4:no_of_workers 2.391e-03 2.660e-03 0.899 0.369085 quarterQuarter5:no_of_workers -1.876e-02 1.310e-02 -1.433 0.152284 departmentsweing:dayTuesday 5.081e-02 1.043e-01 0.487 0.626293 departmentsweing:dayWednesday -2.802e-02 1.070e-01 -0.262 0.793431 departmentsweing:dayThursday 4.484e-02 1.046e-01 0.429 0.668175 departmentsweing:daySaturday 6.734e-02 1.095e-01 0.615 0.538781 departmentsweing:daySunday 5.477e-02 1.040e-01 0.527 0.598477 departmentsweing:team -1.514e-02 9.227e-03 -1.641 0.101152 departmentsweing:targeted_productivity 7.389e-01 3.994e-01 1.850 0.064692 departmentsweing:smv 1.773e-01 6.556e-02 2.704 0.006998 departmentsweing:wip -2.422e-05 4.472e-05 -0.542 0.588313 departmentsweing:over_time 4.111e-06 1.540e-05 0.267 0.789627 departmentsweing:incentive 4.078e-03 2.003e-03 2.036 0.042100 departmentsweing:idle_time NA NA NA NA departmentsweing:idle_men NA NA NA NA departmentsweing:no_of_style_change NA NA NA NA departmentsweing:no_of_workers -2.125e-02 7.619e-03 -2.789 0.005417 dayTuesday:team 5.332e-03 4.710e-03 1.132 0.257932 dayWednesday:team 9.747e-03 4.654e-03 2.094 0.036539 dayThursday:team 6.690e-03 4.904e-03 1.364 0.172940 daySaturday:team 2.633e-03 4.804e-03 0.548 0.583779 daySunday:team 2.930e-03 4.657e-03 0.629 0.529443 dayTuesday:targeted_productivity 1.328e-01 1.796e-01 0.740 0.459729 dayWednesday:targeted_productivity 1.374e-01 1.851e-01 0.742 0.458261 dayThursday:targeted_productivity -1.865e-01 1.820e-01 -1.025 0.305857 daySaturday:targeted_productivity 1.601e-01 2.028e-01 0.789 0.430067 daySunday:targeted_productivity 1.023e-01 2.127e-01 0.481 0.630801 dayTuesday:smv -1.185e-03 3.647e-03 -0.325 0.745284 dayWednesday:smv -7.020e-03 3.520e-03 -1.994 0.046449 dayThursday:smv -2.334e-03 4.135e-03 -0.564 0.572673 daySaturday:smv -2.528e-03 3.949e-03 -0.640 0.522375 daySunday:smv 3.900e-04 3.457e-03 0.113 0.910230 dayTuesday:wip 9.196e-05 4.319e-05 2.129 0.033568 dayWednesday:wip 6.311e-05 4.153e-05 1.519 0.129054 dayThursday:wip 8.637e-05 5.006e-05 1.725 0.084866 daySaturday:wip 9.883e-05 4.922e-05 2.008 0.044988 daySunday:wip 6.089e-05 4.005e-05 1.520 0.128831 dayTuesday:over_time 2.129e-06 7.823e-06 0.272 0.785554 dayWednesday:over_time 5.272e-06 7.827e-06 0.674 0.500794 dayThursday:over_time 5.630e-06 8.886e-06 0.634 0.526530 daySaturday:over_time 1.710e-05 9.634e-06 1.775 0.076257 daySunday:over_time 3.607e-06 8.372e-06 0.431 0.666714 dayTuesday:incentive -8.128e-04 9.839e-04 -0.826 0.409014 dayWednesday:incentive -8.344e-04 9.732e-04 -0.857 0.391499 dayThursday:incentive -1.947e-04 9.867e-04 -0.197 0.843646 daySaturday:incentive -1.196e-03 1.048e-03 -1.141 0.254118 daySunday:incentive -6.920e-04 9.810e-04 -0.705 0.480748 dayTuesday:idle_time 4.377e-02 1.550e-01 0.282 0.777703 dayWednesday:idle_time -2.062e-02 9.781e-02 -0.211 0.833061 dayThursday:idle_time 7.063e-03 4.717e-01 0.015 0.988057 daySaturday:idle_time NA NA NA NA daySunday:idle_time NA NA NA NA dayTuesday:idle_men NA NA NA NA dayWednesday:idle_men NA NA NA NA dayThursday:idle_men NA NA NA NA daySaturday:idle_men NA NA NA NA daySunday:idle_men NA NA NA NA dayTuesday:no_of_style_change 1.701e-02 4.241e-02 0.401 0.688379 dayWednesday:no_of_style_change 4.662e-02 4.191e-02 1.112 0.266316 dayThursday:no_of_style_change 2.405e-02 4.667e-02 0.515 0.606475 daySaturday:no_of_style_change 8.210e-02 4.963e-02 1.654 0.098484 daySunday:no_of_style_change 2.943e-02 4.615e-02 0.638 0.523844 dayTuesday:no_of_workers 3.568e-05 2.822e-03 0.013 0.989917 dayWednesday:no_of_workers 4.209e-03 2.791e-03 1.508 0.132012 dayThursday:no_of_workers 1.344e-04 3.151e-03 0.043 0.965991 daySaturday:no_of_workers -6.929e-04 3.239e-03 -0.214 0.830667 daySunday:no_of_workers -7.635e-04 2.785e-03 -0.274 0.784016 team:targeted_productivity 3.183e-02 1.845e-02 1.726 0.084807 team:smv -4.084e-04 3.884e-04 -1.051 0.293415 team:wip -5.536e-07 1.106e-06 -0.501 0.616768 team:over_time 1.019e-06 7.337e-07 1.388 0.165456 team:incentive 5.713e-06 1.383e-05 0.413 0.679750 team:idle_time NA NA NA NA team:idle_men NA NA NA NA team:no_of_style_change 6.323e-03 5.004e-03 1.264 0.206689 team:no_of_workers 4.134e-04 2.871e-04 1.440 0.150364 targeted_productivity:smv -1.072e-02 1.285e-02 -0.835 0.404216 targeted_productivity:wip 5.583e-05 1.485e-04 0.376 0.707099 targeted_productivity:over_time 3.432e-05 2.694e-05 1.274 0.203107 targeted_productivity:incentive 8.910e-04 6.867e-04 1.297 0.194889 targeted_productivity:idle_time NA NA NA NA targeted_productivity:idle_men NA NA NA NA targeted_productivity:no_of_style_change 3.460e-01 1.553e-01 2.227 0.026201 targeted_productivity:no_of_workers -5.972e-03 1.043e-02 -0.573 0.567129 smv:wip 3.050e-06 2.765e-06 1.103 0.270288 smv:over_time -4.967e-07 5.305e-07 -0.936 0.349455 smv:incentive 9.525e-05 4.885e-05 1.950 0.051535 smv:idle_time NA NA NA NA smv:idle_men NA NA NA NA smv:no_of_style_change 1.275e-03 3.184e-03 0.400 0.688934 smv:no_of_workers 4.057e-04 2.000e-04 2.029 0.042823 wip:over_time 3.919e-09 7.067e-09 0.555 0.579321 wip:incentive 1.255e-07 2.113e-07 0.594 0.552824 wip:idle_time NA NA NA NA wip:idle_men NA NA NA NA wip:no_of_style_change -8.311e-06 4.075e-05 -0.204 0.838439 wip:no_of_workers -2.401e-06 1.886e-06 -1.273 0.203531 over_time:incentive -9.073e-09 1.276e-07 -0.071 0.943326 over_time:idle_time NA NA NA NA over_time:idle_men NA NA NA NA over_time:no_of_style_change -5.125e-06 8.833e-06 -0.580 0.561965 over_time:no_of_workers 4.423e-07 4.462e-07 0.991 0.321811 incentive:idle_time NA NA NA NA incentive:idle_men NA NA NA NA incentive:no_of_style_change -2.088e-03 9.361e-04 -2.230 0.026019 incentive:no_of_workers -4.690e-05 4.190e-05 -1.119 0.263394 idle_time:idle_men NA NA NA NA idle_time:no_of_style_change NA NA NA NA idle_time:no_of_workers NA NA NA NA idle_men:no_of_style_change NA NA NA NA idle_men:no_of_workers NA NA NA NA no_of_style_change:no_of_workers 7.849e-04 3.369e-03 0.233 0.815840 (Intercept) quarterQuarter2 quarterQuarter3 quarterQuarter4 quarterQuarter5 departmentsweing ** dayTuesday dayWednesday dayThursday daySaturday daySunday team * targeted_productivity smv ** wip over_time * incentive . idle_time idle_men no_of_style_change no_of_workers . log(targeted_productivity) log(smv) *** log(wip) log(over_time + 0.01) log(incentive + 0.01) log(idle_time + 0.01) log(idle_men + 0.01) log(no_of_style_change + 0.01) log(no_of_workers) I(targeted_productivity^2) . I(smv^2) ** I(wip^2) I(over_time^2) I(incentive^2) . I(idle_time^2) I(idle_men^2) I(no_of_style_change^2) I(no_of_workers^2) quarterQuarter2:departmentsweing quarterQuarter3:departmentsweing quarterQuarter4:departmentsweing quarterQuarter5:departmentsweing quarterQuarter2:dayTuesday quarterQuarter3:dayTuesday . quarterQuarter4:dayTuesday quarterQuarter5:dayTuesday quarterQuarter2:dayWednesday quarterQuarter3:dayWednesday . quarterQuarter4:dayWednesday quarterQuarter5:dayWednesday quarterQuarter2:dayThursday quarterQuarter3:dayThursday quarterQuarter4:dayThursday quarterQuarter5:dayThursday quarterQuarter2:daySaturday . quarterQuarter3:daySaturday quarterQuarter4:daySaturday * quarterQuarter5:daySaturday quarterQuarter2:daySunday quarterQuarter3:daySunday quarterQuarter4:daySunday quarterQuarter5:daySunday quarterQuarter2:team quarterQuarter3:team quarterQuarter4:team quarterQuarter5:team quarterQuarter2:targeted_productivity quarterQuarter3:targeted_productivity quarterQuarter4:targeted_productivity quarterQuarter5:targeted_productivity quarterQuarter2:smv quarterQuarter3:smv quarterQuarter4:smv quarterQuarter5:smv quarterQuarter2:wip * quarterQuarter3:wip * quarterQuarter4:wip quarterQuarter5:wip quarterQuarter2:over_time * quarterQuarter3:over_time quarterQuarter4:over_time quarterQuarter5:over_time quarterQuarter2:incentive quarterQuarter3:incentive quarterQuarter4:incentive quarterQuarter5:incentive . quarterQuarter2:idle_time quarterQuarter3:idle_time quarterQuarter4:idle_time quarterQuarter5:idle_time quarterQuarter2:idle_men quarterQuarter3:idle_men quarterQuarter4:idle_men quarterQuarter5:idle_men quarterQuarter2:no_of_style_change . quarterQuarter3:no_of_style_change quarterQuarter4:no_of_style_change quarterQuarter5:no_of_style_change quarterQuarter2:no_of_workers quarterQuarter3:no_of_workers quarterQuarter4:no_of_workers quarterQuarter5:no_of_workers departmentsweing:dayTuesday departmentsweing:dayWednesday departmentsweing:dayThursday departmentsweing:daySaturday departmentsweing:daySunday departmentsweing:team departmentsweing:targeted_productivity . departmentsweing:smv ** departmentsweing:wip departmentsweing:over_time departmentsweing:incentive * departmentsweing:idle_time departmentsweing:idle_men departmentsweing:no_of_style_change departmentsweing:no_of_workers ** dayTuesday:team dayWednesday:team * dayThursday:team daySaturday:team daySunday:team dayTuesday:targeted_productivity dayWednesday:targeted_productivity dayThursday:targeted_productivity daySaturday:targeted_productivity daySunday:targeted_productivity dayTuesday:smv dayWednesday:smv * dayThursday:smv daySaturday:smv daySunday:smv dayTuesday:wip * dayWednesday:wip dayThursday:wip . daySaturday:wip * daySunday:wip dayTuesday:over_time dayWednesday:over_time dayThursday:over_time daySaturday:over_time . daySunday:over_time dayTuesday:incentive dayWednesday:incentive dayThursday:incentive daySaturday:incentive daySunday:incentive dayTuesday:idle_time dayWednesday:idle_time dayThursday:idle_time daySaturday:idle_time daySunday:idle_time dayTuesday:idle_men dayWednesday:idle_men dayThursday:idle_men daySaturday:idle_men daySunday:idle_men dayTuesday:no_of_style_change dayWednesday:no_of_style_change dayThursday:no_of_style_change daySaturday:no_of_style_change . daySunday:no_of_style_change dayTuesday:no_of_workers dayWednesday:no_of_workers dayThursday:no_of_workers daySaturday:no_of_workers daySunday:no_of_workers team:targeted_productivity . team:smv team:wip team:over_time team:incentive team:idle_time team:idle_men team:no_of_style_change team:no_of_workers targeted_productivity:smv targeted_productivity:wip targeted_productivity:over_time targeted_productivity:incentive targeted_productivity:idle_time targeted_productivity:idle_men targeted_productivity:no_of_style_change * targeted_productivity:no_of_workers smv:wip smv:over_time smv:incentive . smv:idle_time smv:idle_men smv:no_of_style_change smv:no_of_workers * wip:over_time wip:incentive wip:idle_time wip:idle_men wip:no_of_style_change wip:no_of_workers over_time:incentive over_time:idle_time over_time:idle_men over_time:no_of_style_change over_time:no_of_workers incentive:idle_time incentive:idle_men incentive:no_of_style_change * incentive:no_of_workers idle_time:idle_men idle_time:no_of_style_change idle_time:no_of_workers idle_men:no_of_style_change idle_men:no_of_workers no_of_style_change:no_of_workers --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1252 on 781 degrees of freedom Multiple R-squared: 0.591, Adjusted R-squared: 0.4993 F-statistic: 6.447 on 175 and 781 DF, p-value: < 2.2e-16 I will now use step function to perform features selection, starting with the BIC criteria and then AIC. By specifying k = log(nrow(train)), we can use the BIC criteria instead of the AIC. The direction = "both" option will tell R to try a combination of forwards and backwards selection. ```R bic_fit = step(fullmod, trace=0, k = log(nrow(train)), direction="both") summary(bic_fit) ``` Call: lm(formula = actual_productivity ~ quarter + department + targeted_productivity + smv + over_time + incentive + idle_men + no_of_workers + log(smv) + I(targeted_productivity^2) + quarter:department + department:targeted_productivity + department:over_time + department:incentive + department:no_of_workers, data = train) Residuals: Min 1Q Median 3Q Max -0.63392 -0.03617 0.00623 0.05451 0.40143 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.761e-01 1.188e-01 4.007 6.64e-05 quarterQuarter2 1.955e-02 1.696e-02 1.153 0.249219 quarterQuarter3 -3.417e-02 1.974e-02 -1.731 0.083779 quarterQuarter4 -6.174e-02 1.845e-02 -3.347 0.000850 quarterQuarter5 1.597e-01 3.364e-02 4.748 2.38e-06 departmentsweing -5.898e-01 8.700e-02 -6.779 2.14e-11 targeted_productivity -9.833e-01 3.262e-01 -3.014 0.002645 smv -1.479e-02 1.957e-03 -7.556 9.90e-14 over_time -2.347e-05 4.686e-06 -5.009 6.53e-07 incentive -1.732e-05 2.426e-05 -0.714 0.475470 idle_men -4.621e-03 1.384e-03 -3.339 0.000874 no_of_workers 2.069e-02 2.113e-03 9.789 < 2e-16 log(smv) 2.752e-01 4.031e-02 6.828 1.55e-11 I(targeted_productivity^2) 9.400e-01 2.522e-01 3.728 0.000205 quarterQuarter2:departmentsweing -2.315e-02 2.192e-02 -1.056 0.291119 quarterQuarter3:departmentsweing 4.177e-02 2.552e-02 1.637 0.102068 quarterQuarter4:departmentsweing 5.946e-02 2.386e-02 2.492 0.012869 quarterQuarter5:departmentsweing -1.798e-01 4.620e-02 -3.893 0.000106 departmentsweing:targeted_productivity 5.159e-01 9.298e-02 5.548 3.77e-08 departmentsweing:over_time 1.955e-05 5.084e-06 3.844 0.000129 departmentsweing:incentive 2.896e-03 2.468e-04 11.737 < 2e-16 departmentsweing:no_of_workers -1.947e-02 2.218e-03 -8.780 < 2e-16 (Intercept) *** quarterQuarter2 quarterQuarter3 . quarterQuarter4 *** quarterQuarter5 *** departmentsweing *** targeted_productivity ** smv *** over_time *** incentive idle_men *** no_of_workers *** log(smv) *** I(targeted_productivity^2) *** quarterQuarter2:departmentsweing quarterQuarter3:departmentsweing quarterQuarter4:departmentsweing * quarterQuarter5:departmentsweing *** departmentsweing:targeted_productivity *** departmentsweing:over_time *** departmentsweing:incentive *** departmentsweing:no_of_workers *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1252 on 935 degrees of freedom Multiple R-squared: 0.5102, Adjusted R-squared: 0.4992 F-statistic: 46.37 on 21 and 935 DF, p-value: < 2.2e-16 Now we try the AIC criteria. ```R aic_fit = step(fullmod, trace=0, k = 2, direction="both") summary(aic_fit) ``` Call: lm(formula = actual_productivity ~ quarter + department + team + targeted_productivity + smv + wip + over_time + incentive + idle_men + no_of_style_change + no_of_workers + log(smv) + log(wip) + log(incentive + 0.01) + log(idle_men + 0.01) + I(targeted_productivity^2) + I(smv^2) + I(incentive^2) + I(idle_men^2) + quarter:smv + quarter:wip + quarter:over_time + department:team + department:targeted_productivity + department:smv + department:incentive + department:no_of_workers + team:over_time + team:no_of_style_change + team:no_of_workers + targeted_productivity:smv + targeted_productivity:incentive + smv:incentive + smv:no_of_workers + wip:over_time + wip:no_of_workers + over_time:no_of_workers + incentive:no_of_style_change + incentive:no_of_workers, data = train) Residuals: Min 1Q Median 3Q Max -0.61788 -0.03943 0.00369 0.05919 0.37455 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.230e-01 2.303e-01 2.705 0.006963 quarterQuarter2 6.665e-02 3.103e-02 2.148 0.031986 quarterQuarter3 3.746e-02 4.419e-02 0.848 0.396798 quarterQuarter4 -1.019e-01 3.392e-02 -3.003 0.002749 quarterQuarter5 3.879e-02 1.083e-01 0.358 0.720240 departmentsweing -1.333e+00 4.160e-01 -3.203 0.001408 team -7.799e-03 2.624e-03 -2.973 0.003031 targeted_productivity -1.212e+00 3.411e-01 -3.553 0.000400 smv -2.079e-01 7.650e-02 -2.718 0.006697 wip 1.362e-05 2.055e-05 0.663 0.507650 over_time -4.113e-05 8.663e-06 -4.748 2.39e-06 incentive -7.197e-04 3.596e-04 -2.001 0.045666 idle_men -6.119e-02 2.314e-02 -2.644 0.008331 no_of_style_change 1.169e-02 3.144e-02 0.372 0.710145 no_of_workers 1.656e-02 2.404e-03 6.888 1.06e-11 log(smv) 9.955e-01 2.754e-01 3.614 0.000318 log(wip) 2.591e-02 1.053e-02 2.462 0.014021 log(incentive + 0.01) 6.445e-03 3.341e-03 1.929 0.054017 log(idle_men + 0.01) 7.961e-02 3.949e-02 2.016 0.044106 I(targeted_productivity^2) 1.167e+00 2.644e-01 4.412 1.15e-05 I(smv^2) 5.087e-04 2.135e-04 2.383 0.017367 I(incentive^2) 1.812e-07 7.928e-08 2.286 0.022508 I(idle_men^2) 1.053e-03 3.909e-04 2.695 0.007173 quarterQuarter2:smv -2.572e-03 1.475e-03 -1.743 0.081596 quarterQuarter3:smv 2.065e-04 1.772e-03 0.117 0.907248 quarterQuarter4:smv 2.340e-03 1.575e-03 1.485 0.137794 quarterQuarter5:smv 4.209e-03 5.093e-03 0.826 0.408746 quarterQuarter2:wip -7.030e-05 2.971e-05 -2.366 0.018195 quarterQuarter3:wip -7.352e-05 3.008e-05 -2.444 0.014703 quarterQuarter4:wip 1.230e-05 2.576e-05 0.477 0.633173 quarterQuarter5:wip 1.322e-04 1.033e-04 1.279 0.201052 quarterQuarter2:over_time 9.336e-06 4.898e-06 1.906 0.056949 quarterQuarter3:over_time 5.709e-06 5.550e-06 1.029 0.303911 quarterQuarter4:over_time 4.457e-06 5.252e-06 0.849 0.396303 quarterQuarter5:over_time -4.809e-05 2.155e-05 -2.232 0.025874 departmentsweing:team -1.715e-02 8.172e-03 -2.099 0.036095 departmentsweing:targeted_productivity 8.544e-01 1.939e-01 4.405 1.18e-05 departmentsweing:smv 1.205e-01 5.216e-02 2.310 0.021094 departmentsweing:incentive 3.489e-03 1.361e-03 2.564 0.010514 departmentsweing:no_of_workers -2.215e-02 3.647e-03 -6.073 1.84e-09 team:over_time 8.998e-07 5.435e-07 1.656 0.098147 team:no_of_style_change 4.699e-03 3.285e-03 1.430 0.152984 team:no_of_workers 2.805e-04 2.002e-04 1.401 0.161552 targeted_productivity:smv -1.466e-02 8.504e-03 -1.724 0.084966 targeted_productivity:incentive 8.155e-04 4.766e-04 1.711 0.087376 smv:incentive 6.347e-05 4.014e-05 1.581 0.114195 smv:no_of_workers 4.148e-04 1.615e-04 2.568 0.010394 wip:over_time 7.905e-09 4.086e-09 1.934 0.053371 wip:no_of_workers -1.265e-06 6.145e-07 -2.058 0.039879 over_time:no_of_workers 2.994e-07 1.350e-07 2.218 0.026787 incentive:no_of_style_change -9.931e-04 5.959e-04 -1.666 0.095971 incentive:no_of_workers -4.815e-05 3.005e-05 -1.603 0.109391 (Intercept) ** quarterQuarter2 * quarterQuarter3 quarterQuarter4 ** quarterQuarter5 departmentsweing ** team ** targeted_productivity *** smv ** wip over_time *** incentive * idle_men ** no_of_style_change no_of_workers *** log(smv) *** log(wip) * log(incentive + 0.01) . log(idle_men + 0.01) * I(targeted_productivity^2) *** I(smv^2) * I(incentive^2) * I(idle_men^2) ** quarterQuarter2:smv . quarterQuarter3:smv quarterQuarter4:smv quarterQuarter5:smv quarterQuarter2:wip * quarterQuarter3:wip * quarterQuarter4:wip quarterQuarter5:wip quarterQuarter2:over_time . quarterQuarter3:over_time quarterQuarter4:over_time quarterQuarter5:over_time * departmentsweing:team * departmentsweing:targeted_productivity *** departmentsweing:smv * departmentsweing:incentive * departmentsweing:no_of_workers *** team:over_time . team:no_of_style_change team:no_of_workers targeted_productivity:smv . targeted_productivity:incentive . smv:incentive smv:no_of_workers * wip:over_time . wip:no_of_workers * over_time:no_of_workers * incentive:no_of_style_change . incentive:no_of_workers --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1227 on 905 degrees of freedom Multiple R-squared: 0.5445, Adjusted R-squared: 0.5188 F-statistic: 21.21 on 51 and 905 DF, p-value: < 2.2e-16 The AIC criteria model yields a slightly higher R-squared, but the BIC criteria model is much simpler. Hence, I will explore the BIC model from here. ```R par(mfrow=c(2,2)) plot(bic_fit) ``` ![png](output_117_0.png) ```R bic_fit_predict <- bic_fit$fitted.values # Calculate the accuracy statistics bic_fit_accuracy <- Model.Accuracy(bic_fit_predict,train$actual_productivity,935,21) # Print the accuracy statistics cat("\nModel accuracy of backfit:\n") cat("\nDegrees of freedom: 935\nModel parameters: 21 plus intercept") cat("\nResidual standard error:",bic_fit_accuracy$rse) cat("\nPercentage error:",bic_fit_accuracy$rse*100/mean(train$actual_productivity),"%") cat("\nR-Squared:",bic_fit_accuracy$rsquared) cat("\nF-statistic:",bic_fit_accuracy$f.stat,"; p-value:",pf(46.37,21,935,lower.tail=FALSE)) # Print the RMSE cat("\n\nRMSE:", RMSE(bic_fit_predict,train$actual_productivity)) ``` Model accuracy of backfit: Degrees of freedom: 935 Model parameters: 21 plus intercept Residual standard error: 0.1251765 Percentage error: 17.16364 % R-Squared: 0.5101624 F-statistic: 46.37124 ; p-value: 4.879907e-129 RMSE: 0.1237293 It seems we got a quite noticeable increase in R-squared using the stepwise regression with BIC compared to the first 2 models. Our RMSE also improves, but whether these improvements are simply due to overfitting we can't be sure until we test it on our testing dataset. Next we come to our Lasso regression. using the same brute force approach. In this way, we can compare it with our stepwise regression. ### Model 4: Lasso regression We first need to create a model matrix using the dummyVars function from the caret package. The predict function is then used to create numeric model matrices for training and test. (Code inspired by an article writtern by Deepika Singh (2019) on pluralsight.com). ```R # Initialise our variable names to be passed on to create the model matrix cols_reg = c('quarter', 'department', 'day', 'team', 'targeted_productivity','smv','wip','over_time','incentive','idle_time','idle_men','no_of_style_change','no_of_workers','actual_productivity') # Create a model matrix using dummyVars function from caret package dummies <- dummyVars(actual_productivity ~ . + .*. + log(team) + log(targeted_productivity) + log(smv) + log(wip) + log(over_time + 0.01) +log(incentive + 0.01) + log(idle_time + 0.01) + log(idle_men + 0.01) + log(no_of_style_change + 0.01) + log(no_of_workers) + I(team^2) + I(targeted_productivity^2) + I(smv^2) + I(wip^2) + I(over_time^2) + I(incentive^2) + I(idle_time^2) + I(idle_men^2) + I(no_of_style_change^2) + I(no_of_workers^2), data = garment_data[,cols_reg]) # Apply predict function to create numeric model matrics for training and testing set train_dummies = predict(dummies, newdata = train[,cols_reg]) test_dummies = predict(dummies, newdata = test[,cols_reg]) print(dim(train_dummies)); print(dim(test_dummies)) ``` [1] 957 270 [1] 240 270 ```R x_train = as.matrix(train_dummies) y_train = train$actual_productivity x_test = as.matrix(test_dummies) y_test = test$actual_productivity ``` The next step is to find the optimal lambda value, by implementing the code below to find the best cross-validated lambda. ```R lambdas <- 10^seq(2, -3, by = -.1) # Setting alpha = 1 implements lasso regression lasso_reg <- cv.glmnet(x_train, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5) # Best lambda_best <- lasso_reg$lambda.min lambda_best ``` 0.00199526231496888 We can then train our final Lasso regression model using the optimal lambda. ```R lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_best, standardize = TRUE) ``` ```R # Extract coefficients estimates from Lasso model. Code inspired by Mehrad Mahmoudian's answer on Stackoverflow: # https://stackoverflow.com/questions/27801130/extracting-coefficient-variable-names-from-glmnet-into-a-data-frame. tmp_coeffs <- coef(lasso_model) data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], coefficient = tmp_coeffs@x) ```

name	coefficient
(Intercept)	2.868860e-01
over_time	-1.407300e-06
log(team)	-1.257325e-02
log(wip)	1.198079e-02
I(targeted_productivity^2)	2.785709e-01
I(smv^2)	-5.445470e-05
I(over_time^2)	-1.435899e-10
quarterQuarter2:departmentfinishing	2.079728e-02
quarterQuarter3:departmentfinishing	-9.049354e-03
quarterQuarter4:departmentfinishing	-4.567480e-02
quarterQuarter5:departmentfinishing	1.504109e-01
quarterQuarter3:departmentsweing	2.908346e-03
quarterQuarter5:departmentsweing	-1.814669e-02
quarterQuarter1:dayMonday	-6.218259e-04
quarterQuarter3:dayMonday	1.952829e-03
quarterQuarter4:dayMonday	7.424978e-03
quarterQuarter3:dayTuesday	-4.100141e-03
quarterQuarter2:dayWednesday	7.061277e-03
quarterQuarter4:dayWednesday	-6.459491e-03
quarterQuarter1:dayThursday	2.081965e-02
quarterQuarter3:dayThursday	2.324878e-02
quarterQuarter4:dayThursday	-2.574333e-02
quarterQuarter5:dayThursday	-1.744914e-02
quarterQuarter1:daySaturday	2.570991e-03
quarterQuarter2:daySaturday	-9.362130e-03
quarterQuarter4:daySaturday	-2.247711e-02
quarterQuarter3:team	-1.130799e-03
quarterQuarter5:team	2.548928e-03
quarterQuarter3:wip	-6.259030e-06
quarterQuarter4:idle_time	-3.136193e-02
quarterQuarter1:no_of_style_change	-2.964139e-02
departmentfinishing:dayMonday	7.075863e-03
departmentfinishing:dayThursday	-2.938065e-03
departmentfinishing:team	-1.187349e-03
departmentsweing:targeted_productivity	1.321328e-01
departmentfinishing:smv	2.730181e-02
departmentfinishing:over_time	-1.003696e-05
departmentsweing:incentive	2.873826e-03
departmentfinishing:no_of_workers	1.553586e-02
dayMonday:team	-1.159163e-03
dayWednesday:team	1.684257e-04
daySunday:team	-1.694266e-04
dayMonday:targeted_productivity	6.134080e-03
dayThursday:targeted_productivity	-1.261045e-02
dayWednesday:smv	-2.505062e-04
dayMonday:wip	-1.587032e-06
dayTuesday:wip	6.652671e-07
dayTuesday:over_time	-1.002034e-07
dayTuesday:idle_time	-5.983147e-03
dayWednesday:idle_men	-4.743654e-03
daySunday:idle_men	-2.627898e-03
dayWednesday:no_of_style_change	8.021185e-03
targeted_productivity:no_of_style_change	1.719814e-03
targeted_productivity:no_of_workers	1.382017e-03
smv:over_time	-1.699031e-09
wip:idle_men	-2.878491e-06
over_time:incentive	2.026976e-08

This resulted in a quite large model with 55 variables in total. This is significantly more complex than our model using BIC as a criteria. Let's check the model accuracy metrics. ```R lasso_predict <- predict(lasso_model, s = lambda_best, newx = x_train) # Calculate the accuracy statistics lasso_accuracy <- Model.Accuracy(lasso_predict,train$actual_productivity,901,55) # Print the accuracy statistics cat("\nModel accuracy of Lasso:\n") cat("\nDegrees of freedom: 901\nModel parameters: 55 plus intercept") cat("\nResidual standard error:",lasso_accuracy$rse) cat("\nPercentage error:",lasso_accuracy$rse*100/mean(train$actual_productivity),"%") cat("\nR-Squared:",lasso_accuracy$rsquared) cat("\nF-statistic:",lasso_accuracy$f.stat) # Print the RMSE cat("\n\nRMSE:", RMSE(lasso_predict,train$actual_productivity)) ``` Model accuracy of Lasso: Degrees of freedom: 901 Model parameters: 55 plus intercept Residual standard error: 0.1276014 Percentage error: 17.49614 % R-Squared: 0.509509 F-statistic: 17.017 RMSE: 0.1238118 It is quite interesting to see that even though this Lasso model uses many more variables than our BIC model, its R-squared and F-statistic are still slightly below the BIC model, at least for the training dataset. This suggests that the BIC model is the better one. ## 5. Results and discussion Let's test and compare all the models we have built thus far on the testing set. I will write another function that is quite similar to the Model.Accuracy and RMSE functions above to put everything into a neat table. ```R # Compute R^2 and RMSE from true and predicted values eval_results <- function(true_test, predicted_test, true_train, predicted_train, model_name) { SSE_test <- sum((predicted_test - true_test)^2) SST_test <- sum((true_test - mean(true_test))^2) R_square_test <- 1 - SSE_test / SST_test RMSE_test = sqrt(SSE_test/length(predicted_test)) SSE_train <- sum((predicted_train - true_train)^2) SST_train <- sum((true_train - mean(true_train))^2) R_square_train <- 1 - SSE_train / SST_train RMSE_train = sqrt(SSE_train/length(predicted_train)) # Model performance metrics data.frame(Model = model_name, RMSE_test = RMSE_test, R_square_test = R_square_test, RMSE_train = RMSE_train, R_square_train = R_square_train) } # Test Linear model 1 predictions_test <- predict(lm_model_14, newdata = test) lm1_results <- eval_results(y_test, predictions_test, y_train, lm_model_14_predict, "Linear Model 1") # Test Linear model 2 predictions_test <- predict(lm_model_alt, newdata = test) lmalt_results <- eval_results(y_test, predictions_test, y_train, lm_model_alt_predict, "Linear Model 2") # Test BIC stepwise model predictions_test <- predict(bic_fit, newdata = test) bic_results <- eval_results(y_test, predictions_test, y_train, bic_fit_predict, "BIC Stepwise Model") # Test Lasso model predictions_test <- predict(lasso_model, s = lambda_best, newx = x_test) lasso_results <- eval_results(y_test, predictions_test, y_train, lasso_predict, "Lasso Model") # Merge all result dataframes. Code inspired by Charles' answer on Stackoverflow: # https://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list. Reduce(function(df1, df2) merge(df1, df2, all = TRUE), list(lm1_results,lmalt_results,bic_results,lasso_results)) ```

Model	RMSE_test	R_square_test	RMSE_train	R_square_train
Linear Model 1	0.1431625	0.2247345	0.1476508	0.3024447
Linear Model 2	0.1404324	0.2540210	0.1400045	0.3728219
BIC Stepwise Model	0.1374985	0.2848652	0.1237293	0.5101624
Lasso Model	0.1337447	0.3233793	0.1238118	0.5095090

### Summary of model choice From the comparison between our 4 models, we can tell that the BIC Stepwise Model is the best if we go solely by RMSE and R-squared. However, its R-squared for the traning set is much higher than the testing set, which suggests it might be overfitting. Furthermore, there is also the concern of complexity with 21 variables. The most consistent model is Linear Model 2, where we transformed a few variables according to our EDA. Its RMSE is about the same for both the testing set and training set. This model is relatively simple, compared to the BIC Stepwise Model. Therefore, I believe in the end we should favor Linear Model 2 over the BIC Stepwise Model, simply due to the fact the performance of the 2 models on the testing set is not that far off from one another, and that Linear Model 2 is much simpler and easy to implement. Linear Model 2 also has quite consistent RMSE for both the traning and testing set, which suggests that it does not overfit. Hence our model of choice is below: ```R coef(summary(lm_model_alt)) ```

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-1.615576e-01	6.820024e-02	-2.3688710	1.804251e-02
good_incentive	1.362920e-01	1.620976e-02	8.4080184	1.521287e-16
good_team	6.193486e-02	9.479558e-03	6.5335179	1.047988e-10
good_quarter	4.045000e-02	9.460305e-03	4.2757602	2.098043e-05
targeted_productivity	5.926710e-01	4.724174e-02	12.5454939	1.658912e-33
smv	2.107342e-03	6.782437e-03	0.3107058	7.560928e-01
no_of_workers	7.021756e-02	8.272553e-03	8.4880158	8.046002e-17
I(no_of_workers^2)	-3.127325e-03	3.889656e-04	-8.0401077	2.667761e-15
I(no_of_workers^3)	4.195135e-05	5.573110e-06	7.5274583	1.204326e-13
I(no_of_workers^5)	-1.881858e-09	2.734006e-10	-6.8831535	1.064906e-11
smv:no_of_workers	-1.645268e-04	1.253084e-04	-1.3129749	1.895098e-01

### Model analysis summary Let's summarise our description and analysis of the model we just picked. ```R summary(lm_model_alt) ``` Call: lm(formula = actual_productivity ~ good_incentive + good_team + good_quarter + targeted_productivity + smv + no_of_workers + smv * no_of_workers + I(no_of_workers^2) + I(no_of_workers^3) + I(no_of_workers^5), data = train) Residuals: Min 1Q Median 3Q Max -0.55377 -0.06780 0.00583 0.07997 0.51320 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.616e-01 6.820e-02 -2.369 0.018 * good_incentive 1.363e-01 1.621e-02 8.408 < 2e-16 *** good_team 6.193e-02 9.480e-03 6.534 1.05e-10 *** good_quarter 4.045e-02 9.460e-03 4.276 2.10e-05 *** targeted_productivity 5.927e-01 4.724e-02 12.545 < 2e-16 *** smv 2.107e-03 6.782e-03 0.311 0.756 no_of_workers 7.022e-02 8.273e-03 8.488 < 2e-16 *** I(no_of_workers^2) -3.127e-03 3.890e-04 -8.040 2.67e-15 *** I(no_of_workers^3) 4.195e-05 5.573e-06 7.527 1.20e-13 *** I(no_of_workers^5) -1.882e-09 2.734e-10 -6.883 1.06e-11 *** smv:no_of_workers -1.645e-04 1.253e-04 -1.313 0.190 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1408 on 946 degrees of freedom Multiple R-squared: 0.3728, Adjusted R-squared: 0.3662 F-statistic: 56.23 on 10 and 946 DF, p-value: < 2.2e-16 ```R par(mfrow=c(2,2)) plot(lm_model_alt) ``` ![png](output_138_0.png) The residual plots: - Residual vs Fitted plot shows the residuals shows quite concentrated residuals around a horizontal line without distinct patterns, which means that the model is meeting the assumption of homoscedasticity - the error terms are constant along the regression line. - The Q-Q plot confirms that our data distribution is left skewed compared to a normal distribution, with heavy tails, but it is approximately normal. - Scale-Location - The chart shows the variance of the residuals is reasonably consistent. - Residuals vs Leverage - The chart shows there is one possibly influential outlier, that is very close to the Cook's line. ```R lm_model_alt_predict <- lm_model_alt$fitted.values # Calculate the accuracy statistics lm_model_alt_accuracy <- Model.Accuracy(lm_model_alt_predict,train$actual_productivity,946,10) # Print the accuracy statistics cat("\nModel accuracy:\n") cat("\nDegrees of freedom: 945\nModel parameters: 11 plus intercept") cat("\nResidual standard error:",lm_model_alt_accuracy$rse) cat("\nPercentage error:",lm_model_alt_accuracy$rse*100/mean(train$actual_productivity),"%") cat("\nR-Squared:",lm_model_alt_accuracy$rsquared) cat("\nF-statistic:",lm_model_alt_accuracy$f.stat,"; p-value:",pf(56.23,10,946,lower.tail=FALSE)) # Print the RMSE cat("\n\nRMSE:", RMSE(lm_model_alt_predict,train$actual_productivity)) ``` Model accuracy: Degrees of freedom: 945 Model parameters: 11 plus intercept Residual standard error: 0.1408161 Percentage error: 19.30808 % R-Squared: 0.3728219 F-statistic: 56.23436 ; p-value: 6.198772e-89 RMSE: 0.1400045 **Residuals**: These seems to be centred around zero, with a small difference in the minimum and maximum distance. This tells us that there is an even distribution of the residuals. **Residual Standard Error**: The residual standard error, or the estimated standard deviation of the residuals is 0.1392, which is around 18% of the median actual productivity, which is 0.7733. $R^2$: This indicates the model only explains around 38% of the variation of our dependent variable, actual_productivity, in our training set. For our testing set, R-squared is 0.2631. After converting the regression results to the house price, the $R^2$ value increases and shows the model explains 26.31% of the variation in the house price. It is worth noting that a small value of R-squared does not necessarily mean that it is a bad model. In fact, "if the dependent variable is a properly stationarized series (e.g., differences or percentage differences rather than levels), then an R-squared of 25% may be quite good." (Duke University, n.d.). **F-statistic**: The F-statistic calculated using the house prices is 52.67, with a very small p-value, meaning the null hypothesis (the model is not useful) can be rejected. There are 10 variables in the model, leaving 946 degrees of freedom from the sample size of 957. **Coefficients**: All the numerical variable coefficients are significant because of small p-values, except for the interaction term between smv and no_of_workers. I choose to leave this interaction term in mainly because of the strong positive correlation I found in EDA, and because the p-value is not too large, around 0.1. #### Check for Influential Outliers I will use the code from the tutorial to check for influential outlier points. ```R outlierTest(lm_model_alt, cutoff=0.05, digits = 1) ``` No Studentized residuals with Bonferroni p < 0.05 Largest |rstudent|: rstudent unadjusted p-value Bonferroni p 169 -3.975064 7.5722e-05 0.072466 The outlier test has not reported any outliers. Let's generate an influence plot to see if there are any influential data points. ```R influencePlot(lm_model_alt, scale=5, id.method="noteworthy", main="Influence Plot", sub="Circle size is proportional to Cook's Distance" ) ```

	StudRes	Hat	CookD
169	-3.9750637	0.005934751	0.008443855
461	-3.8637500	0.015411160	0.020934277
615	3.7656327	0.050254853	0.067273680
872	-0.6553112	0.128890210	0.005779784
906	0.2149649	0.988381529	0.357730566

![png](output_145_1.png) The influence plot has reported a number of influential points. The most noticeable one is 906, with a very large Cook's distance. However, it is not classified as an outlier with our test from before. This is no doubt the most influential point for our model. The other 4 points (169, 461, 615, 627) all have quite large studentised residual, meaning that the model has produced a poor prediction for these sample. ### Summary of key factors that have strong affect on the actual productivity After doing thorough EDA and building several models, including some overfitted ones, I believe I have identified several important factors that have a strong effect on actual productivity. Based on EDA, it appears that **targerted_productivity** can have a strong positive impact on actual_productivity, some sort of a self-fulfilling prophecy effect. **quarter** is another factor that can influence actual productivity, with Quarter 5, 1, and 2 being the most productive respectively. **team** can also affect actual productivity, with team 1, 12, and 3 being the most productive respectively. **no_of_workers** can also play a role in influencing actual_productivity, even though its effect is more nuanced and not necessarily linear, as demonstrated through EDA and model development. **incentive** is also another important factor that can be a meaningful predictor of actual_productivity, however only after we remove some of the outliers and zero values, which makes sense since zero incentive should not have an effect on actual_productivity. I don't consider **smv** to be a key factor but I still included it in the main model because it is highly correlated with **no_of_workers**. This relationship can be captured with an interaction term. ## 6. Conclusion ### Summaries of each section Detailed summaries of each section are provided at the end of that respective section. ### What I've learned Through the process of EDA, I believe many interesting insights have been gained on this particular garment factory operation, even though this dataset is quite limited. I also learned that by tweaking the way we look at a variable, perhaps looking at it alongside several other variables, new useful patterns can emerge. Furthermore, I learned that building models is very much an art as well as a science. If you don't have a good sense for what's there, no matter how solid your statistics background is, it is very hard to get close to the truth. I don't think I got close to the truth here, but I hope this is a good enough first attempt so that I can build on it further in the future. ### Future work These models can definitely be refined further, maybe more data will benefit them as well. Other more advanced machine learning models such as Random Forest, or Support Vector Machines can also be implemented with potentially much better results. ## 7. References Charlse (2011, Nov 11). *Simultaneously merge multiple data.frames in a list*. [Response to] Retrieved from https://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list dickoa (2013, Jun 19). *How to split data into training/testing sets using sample function?*. [Response to] Retrieved from https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function Duke University (n.d). *What’s a good value for R-squared?*. [Response to] Retrieved from https://people.duke.edu/~rnau/rsquared.htm#punchline Frans Rodenburg (2019, Apr 23). *Why is lasso more robust to outliers compared to ridge?*. [Response to] Retrieved from https://stats.stackexchange.com/questions/404495/why-is-lasso-more-robust-to-outliers-compared-to-ridge Mehrad Mahmoudian (2015, Jan 15). *Extracting coefficient variable names from glmnet into a data.frame*. [Response to] Retrieved from https://stackoverflow.com/questions/27801130/extracting-coefficient-variable-names-from-glmnet-into-a-data-frame Singh, D. (2019, Nov 12). *Linear, Lasso, and Ridge Regression with R*. Retrieved from https://www.pluralsight.com/guides/linear-lasso-and-ridge-regression-with-r --- *December 2021 - Melbourne, Victoria.* *Thanks for reading! Check out my other projects and blog posts as well if you would like.*