Matthew Wysocki
12/7/2013
Application of statistical machine learning to accurately predict the heating and cooling loads exerted by a building during operation. This method will determine what parameters have the greatest effects on building energy usage so engineers and architects may make more informed decisions while designing buildings.
Introduction
A great amount of effort and research has gone into accurately predicting building efficiency; efforts focus on a wide range of aspects of building efficiency: everything from human behavior while the building is in operation to improving the accuracy of simulations in engineering software. This comes as a result of steady increase in building energy usage and concern about environmental impact. A majority of a building’s energy usage can be attributed to heating, ventilation, and air conditioning; making buildings more efficient in these properties will drastically increase
Simulation tools are commonly employed to analyze and predict energy usage of buildings; with the aid of these tools, engineers are able to accurately predict the energy use of buildings before construction has started. These simulation tools are able to compare two buildings that are identical in all forms with the exception of a single parameter; this direct comparison can yields good insight into how a given parameter affects the overall energy usage of a building. In practice, running a large quantity of simulations can be very time consuming especially for large projects; further, these simulation tools require a high level of proficiency among the user.
Recent experiments have focused on using machine learning to accurately predict the results of these simulations. With an adequately trained decision tree, the impact of changes made to certain parameters can be quantified and engineers and architects will be able to make more informed decisions relating to energy usage during the design process. For this study, I will investigate the effects that eight building parameters that overall building energy consumption depend on. These particular eight parameters, (relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, and glazing area distribution) all have been studied to have effects on building energy usage. These inputs will be used to predict heating load and cooling load of particular buildings.
Data
The dataset that will be used was obtained from the University of California-Irvine Machine Learning Repository. Each building form in the set has an equal volume of 771.75m3 and is simulated using the same building materials: materials that are considered common in the construction industry. Further, each building is simulated as if it were located in Athens, Greece. Internal climate conditions such as temperature, humidity, and lighting were also held constant.
The quantitative glazing area values are given as a percentage: 10%, 25% and 40%. The glazing distribution was broken down into five categories: a uniform distribution with 25% of the total glazing on each side of the building and the other four distributions are 55% on one side of the building and 15% on the other sides of the building for each of the four cardinal directions: north, south, east, and west. The dataset also includes buildings that do not have any glazing area or glazing distribution at all. The orientation of the building is given one of four values for each cardinal direction. There will also be twelve different building forms that vary surface area, wall area, roof area, relative compactness, and building height. This leads to the 768 different values that are in the data set. Figure 1 and Figure 2 show the distributions for each of the inputs and each of the outputs respectively.
Figure 1
Figure 2
The building simulations with specifications given from the dataset were generated using Ecotect building simulation software. Both heating load and cooling load were measured in the simulation. While the results are not guaranteed be to perfectly accurate in the simulation, it should give a good indication of the each feature directly can affect the energy usage of a building.
Method and Results
One of the first observations I made with respect to the dataset was to each input variable to each of the output variables. This is useful from a very basic standpoint of visually determining what variables may be more important than others. Figure 3 shows each of the variables mapped to heating load and figure 4 shows each of the variables mapped to cooling load.
Figure 3
Figure 4
Next, I used three different correlation algorithms to attempt to correlate the data inputs and outputs. Algorithms characterize relationships between the inputs and the outputs within a range of -1 to 1. The value of 1 shows a very strong linear correlation while the value -1 shows an inverse correlation between the input and the output: the value 0 means that there is no correlation at all between the input and output. The three algorithms used to correlate the data were the Pearson product-moment coefficient, Spearman’s rank correlation coefficient, and Kendall’s rank correlation coefficient. Each of them yielded different but overall fairly consistent outputs. Table 1 shows the correlations for the inputs correlated with the heating load and Table 2 shows the inputs correlated with the cooling load.
Input Value / Pearson product-moment coefficient / Spearman’s rank correlation coefficient / Kendall’s rank correlation coefficientRelative Compactness / 0.6223 / 0.6221 / 0.3541
Surface Area / -0.6581 / -0.6221 / -0.3541
Wall Area / 0.4557 / 0.4715 / 0.3424
Roof Area / -0.8618 / -0.8040 / -0.6102
Overall Height / 0.8894 / 0.8613 / 0.7040
Orientation / -0.0026 / -0.0042 / -0.0031
Glazing Area / 0.2698 / 0.3229 / 0.2632
Glazing Area Distribution / 0.0874 / 0.0683 / 0.0487
Table 1
Input Value / Pearson product-moment coefficient / Spearman’s rank correlation coefficient / Kendall’s rank correlation coefficientRelative Compactness / 0.6343 / 0.6510 / 0.3871
Surface Area / -0.6730 / -0.6510 / -0.3871
Wall Area / 0.4271 / 0.4160 / 0.3035
Roof Area / -0.8625 / -0.8032 / -0.6056
Overall Height / 0.8958 / 0.8649 / 0.7063
Orientation / 0.0143 / 0.0176 / 0.0130
Glazing Area / 0.2075 / 0.2889 / 0.2398
Glazing Area Distribution / 0.050 / 0.0465 / 0.0331
Table 2
The way I chose to approach this problem was using a regression tree to make decisions and get accurate outputs. Each node of the decision tree essentially represents a conditional decision made by a tree; from a programming language stand-point this is an if/else statement. In this case, each node checks the value of a specific input feature to see if it is more or less than some threshold and then makes a decision to go down the left branch or right branch of the tree. Each leaf of the decision tree represents an estimated output value that is assigned to the input vector. Conceptually, a regression tree is simple to understand and is easy to visualize; it is also able to get surprisingly accurate results. Figure 5 and Figure 6 show the regression trees that were generated for both heating load and cooling load respectively.
Figure 5
Figure 6
Once the learner is trained, the next step is to test the performance of the decision tree; this can be completed by checking how accurate a similar dataset is and getting the error. Cross validation is a common statistical sampling technique that can be used for testing performance. The dataset is divided into a training subset and a testing subset, where the training subset is used to generate the learner and the testing subset is used to check the generalized performance. In this case we use a 10-fold cross validation algorithm to test the learner. Finally, the mean absolute error, mean square error, and the mean relative error are recorded. Table 3 shows the mean absolute error, mean squared error, and the mean relative error for heating load and cooling load.
Output Variable / Mean Absolute Error / Mean Squared Error / Mean Relative ErrorHeating Load / 0.52 ± 0.16 / 1.10 ± 0.05 / 2.18 ± 0.61
Cooling Load / 1.46 ± 0.21 / 6.59 ± 1.57 / 4.61± 0.68
Conclusion
I have built a regression tree that is able to take in eight different variables and apply this data to accurately predict heating and cooling load. I have analyzed the input data and correlated each feature to the outputs. These findings are significant as they may allow for engineers to understand how each of the input changes may affect both heating and cooling and make informed decisions while designing buildings without the need to run many lengthy simulations. It is worth noting that these results are limited by the accuracy of the simulation run in Ecotect; investigating the accuracy of these simulations is beyond the scope of this project. The results will enable engineers and designers to save large amounts of time and design buildings much faster and buildings that are more efficiently designed.