Volume 8, Issue 6 e1265
Focus Article
Open Access

Performance evaluation in non-intrusive load monitoring: Datasets, metrics, and tools—A review

Lucas Pereira

Corresponding Author

Lucas Pereira

M-ITI/LARSYS, Funchal, Portugal


Lucas Pereira, M-ITI/LARSYS, Madeira Madeira Interactive Technologies Institute, Funchal, Portugal.

Email: [email protected]

Search for more papers by this author
Nuno Nunes

Nuno Nunes

M-ITI/LARSYS, Técnico, University of Lisbon, Lisbon, Portugal

Search for more papers by this author
First published: 22 May 2018
Citations: 85
Funding information Fundação para a Ciência e a Tecnologia Grant Numbers: SFRH/DB/77856/2011, UID/EEA/50009/2013


Non-intrusive load monitoring (also known as NILM or energy disaggregation) is the process of estimating the energy consumption of individual appliances from electric power measurements taken at a limited number of locations in the electric distribution of a building. This approach reduces sensing infrastructure costs by relying on machine learning techniques to monitor electric loads. However, the ability to evaluate and benchmark the proposed approaches across different datasets is key for enabling the generalization of research findings and consequently contributes to the large-scale adoption of this technology. Still, only recently researchers have focused on creating and standardizing the existing datasets in order to deliver a single interface to run NILM evaluations. Furthermore, there is still no consensus regarding, which performance metrics should be used to measure and report the performance of NILM systems and their underlying algorithms. This paper provides a review of the main datasets, metrics, and tools for evaluating the performance of NILM systems and technologies. Specifically, we review three main topics: (a) publicly available datasets, (b) performance metrics, and (c) frameworks and toolkits. The review suggests future research directions in NILM systems and technologies, including cross-datasets, performance metrics for evaluation and generalizable frameworks for benchmarking NILM technology.

This article is categorized under:

  • Application Areas > Science and Technology
  • Application Areas > Data Mining Software Tools
  • Technologies > Computational Intelligence
  • Technologies > Machine Learning

Graphical Abstract

Toward systematic performance evaluation of non-intrusive load monitoring algorithms: a survey of the existing public datasets, performance metrics, tools, and frameworks.


In the past three decades, a substantial body of research has been devoted to the development of non-intrusive load monitoring (NILM) approaches that are able to sense and disaggregate energy consumption from measurements taken at a limited number of locations in the electric distribution infrastructure. NILM technology, also known as energy disaggregation, is key to reduce the sensing infrastructure costs in buildings and even electrical grids. It contrasts approaches that rely on deploying and connecting multiple sensors in each appliance to monitor their consumption.

Early research on this topic dates back to 1985 when George Hart from the Massachusetts Institute of Technology (MIT) coined the term Non-intrusive (Appliance) Load Monitoring (NIALM) (Hart, 1985). In very simple terms NILM is defined as the set of signal processing and machine-learning techniques used to estimate the aggregate and individual appliance electricity consumption from electric power measurements taken at a limited number of locations in the electric distribution infrastructure of a house or building (optimally the mains, hence covering the demand of the entire space).

Still, only recently NILM gained renewed attention from the research community. This was motivated by the availability of advanced metering technologies (e.g., smart-grids) and also because of energy efficiency concerns motivated by the need to reduce the carbon footprint of buildings and households (Carrie Armel, Gupta, Shrimali, & Albert, 2013). Furthermore, leveraging on the advances in machine learning, NILM technology is expected to serve as the backbone technology that will enable the creation of innovative smart-grid services that go beyond helping individuals saving energy (Townson, 2016).

The combination of the need for technologies promoting low-carbon emissions and the advances in machine learning and statistical techniques is generating a substantial amount of energy disaggregation review papers such as Nalmpantis and Vrakas (2018), Esa, Abdullah, and Hassan (2016), Abubakar, Khalid, Mustafa, Shareef, and Mustapha (2016), Wong, Ahmet Sekercioglu, Drummond, and Wong (2013), Butner, Reid, Hoffman, Sullivan, and Blanchard (2013), Makonin (2012), Zoha, Gluhak, Imran, and Rajasegarar (2012), Jiang, Li, Luo, Jin, and West (2011), and Zeifman and Roth (2011). In general NILM approaches are grouped into two categories: (a) event-based approaches and (b) event-less approaches (Bergés & Kolter, 2012).

Event-based approaches are intrinsically related to the early days of NILM. They seek to disaggregate the total consumption by means of detecting and labeling every appliance transition in the aggregated signal (see Figure 1). Event-based approaches rely on previously trained supervised or semi-supervised learning algorithms to label the electric load power events. Consequently, approaches under this category require a data collection step where a number of transitions (i.e., power events) from the appliances of interest are collected, labeled, and stored, to be used later as training data.

Details are in the caption following the image
Example of event-based energy disaggregation (Reprinted with permission from Hart (1992). Copyright 1992 IEEE)

Event-less approaches, on the other hand, do not rely on event detection (ED) and classification. Instead, these approaches attempt to match each sample of the aggregated power with the consumption of one specific appliance or a combination of different appliances (see Figure 2), by means of statistical (e.g., Bayesian methods) and probabilistic (e.g., hidden Markov models) machine-learning methods. Therefore, the training data does not require any labeled transitions. Instead, only the aggregated consumption of the loads of interest is required, hence turning the process of collecting training data for event-less approaches more straightforward than for event-based approaches.

Details are in the caption following the image
Example of event-less energy disaggregation (Reprinted with permission from Hart (1992). Copyright 1992 IEEE)

Despite the growing body of work in this field, there are still many challenges that need to be addressed before NILM technology becomes practical and reliable. One of such challenges is the many issues related with the replication and generalization of research findings (e.g., the lack of proper test and training data and the absence of a formal agreement on how to report the disaggregation results). This important research challenge only recently became the focus of a group of NILM researchers such as Butner et al. (2013), Batra, Kelly, et al. (2014), Batra, Parson, et al. (2014), Makonin and Popowich (2014), Mayhorn, Sullivan, Petersen, Butner, and Johnson (2016), and Pereira and Nunes (2017).

In this paper, we present the first review of the research efforts toward the performance evaluation of NILM algorithms and systems. These efforts are summarized in the next three sections. More specifically, first we present, describe, and compare the currently publicly available NILM datasets. Second, we review the performance metrics reported to assess the accurateness of the proposed NILM algorithms and systems. Third, we present a number of tools and frameworks developed to leverage the potential of NILM datasets and performance metrics.


An energy disaggregation dataset is a collection of electrical energy measurements taken from real-world scenarios, without disrupting the everyday routines in the monitored space, that is, trying to keep the data as close to reality as possible.

These usually contain measurements from the aggregate consumption (taken from the mains) and of the individual loads (i.e., ground-truth data), which is obtained either by measuring each load at the plug-level or measuring the individual circuit to which the load is connected. In a real-world scenario, typically multiple loads are connected to the same circuit. Therefore, plug-level measuring does not always ensure the availability of individual consumption data for each load.

Similarly to the classification of NILM techniques, the currently available datasets can also be categorized as event-based or event-less datasets. The major difference between the two is that event-less approaches do not require the identification of individual power changes. Consequently, collecting datasets for event-less approaches is more straightforward and less time consuming. This partly explains the higher availability of event-less datasets, as we will see next.

Currently, the best of our knowledge, there are 26 publicly available NILM datasets. From these, the vast majority, 21, are suitable to evaluate event-less approaches and only 5 can be used to evaluate event-based approaches.

In order to facilitate comparisons between the existing solutions, in Table 1 we summarize the 26 datasets, sorted by year of initial release. From the 26 datasets, 23 are from individual households, 1 from a living lab household energy smart home lab (ESHL), and 2 from university buildings commercial building dataset (COMBED) and building-level office eNvironment dataset (BLOND).

Table 1. Overview of publicly available energy monitoring and disaggregation datasets
Dataset Country (Sites) Dur. Approach Meters Features Resolution
REDD (2011) United States (6) 2–4 W (NC) × × × × I, V: 15 kHz P: 1 Hz IC, IA: 3–4 s
Smart* (2012) United States (8a) 3–4 M 3 Y (NC) × × ×b × × × 2012: 1 Hz 2017: 30 m
BLUED (2012) United States (1) 1 W (C) × × I, V: 12 kHz P, Q: 60 Hz IA: 1 Hz
HES (2012) United Kingdom (251) 1–12 M (C) × c × × × × × × × × EE 2–5 m
Tracebase (2012) Germany (N/A) 1883 d (N/C) × c × × × × × × × 1–10 s
Dataport (2013) United States (1400+) 4 Ye (C) × × × × × 1 m
AMPds (2013) Canada (1) 2 Ye (C) × × F PF EE 1 m
iAWE (2013) India (1) 74 D (C) × × F PA EE 1 Hz
IHEPCDS (2013) France (1) 4 Y (C) × × × × 1 m
ACS-Fx (2013) Switzerland (N/A) N/A × c × × × × PA 10 s
UK-DALE (2014) United Kingdom (5) 655 D (NC) × × I, V: 16 kHz P, Q: 1 Hz A, IA: 6 s
ECO (2014) Switzerland (6) 8 M (NC) × × × × × PA 1 Hz
REFIT (2014) United Kingdom (20) 2 Y (C) × × × × × × × EP 6–8 s
GREEND (2014) Austria, Italy (9) 3–6 M (C) × c × × × × × × × 1 Hz
PLAID I and II (2014) United States (55) N/A d × × × × × × × 30 kHz
RBSA (2014) United States (101) 27 M × c × × × × × × × EE 15 m
COMBED (2014)g India (1) 1 M × × × × × × EE 30 s
DRED (2015) Holand (1) 6 M (C) × × × × × × × 1 Hz, 1 m
HFED (2015) India (N/A) N/A d × × × × × × × × × EMI 10 kHz 5 MHz
WHITED (2016) Germany, Austria, Indonesia (N/A) N/A d × × × × × × × 44.1 kHz
COOLL (2016) France (N/A) N/A d × × × × × × × 100 kHz
SustDataED (2016) Portugal (NC) 10 D × × I, V: 12.8 kHz P, Q: 50 Hz IA: 0.5 Hz
EEUD (2017) Canada (23) 1 Y × × × × × × 1 m
ESHL (2017e) Germany (1f) 4–5 Y (NC) × × × × × 0.5–1 Hz
RAE (2018) Canada (1) 72 D (C) × × × F PF 1 Hz
BLOND (2018) Germany (1g) 50–230 D (C) × × × × × × A: 50-250 kHz IA: 6.4-50 kHz
REDD (Kolter & Matthew, 2011) Smart* (Barker, Mishra, Irwin, Cecchet, & Shenoy, 2012)
BLUED (Anderson et al., 2012) HES (Household electricity survey a study of domestic electrical product usage, 2012)
Tracebase (Reinhardt et al., 2012) Dataport (Holcomb, 2012)
AMPds (Makonin, Ellert, Bajić, & Popowich, 2016) iAWE (Batra, Gulati, Singh, & Srivastava, 2013)
IHEPCDS (Bache & Lichman, 2013) ACS-Fx (Gisler, Ridi, Zufferey, Khaled, & Hennebert, 2013; Ridi, Gisler, & Hennebert, 2014)
UK-DALE (Kelly & Knottenbelt, 2015) REFIT (Murray et al., 2015)
GREEND (Monacchi, Egarter, Elmenreich, D'Alessandro, & Tonello, 2014) PLAID I and II(Baets et al., 2017; Gao, Giri, Kara, & Bergés, 2014)
RBSA (Ecotope Inc, 2014) COMBED (Batra, Parson, et al., 2014)
DRED (Uttama Nambi, Reyes Lua, & Prasad, 2015) HFED (Gulati, Ram, & Singh, 2014)
WHITED (Kahl, Ui Haq, Kriechbaumer, & Hans-Arno, 2016) COOLL (Picon et al., 2016)
SustDataED (Ribeiro, Pereira, Quintal, & Nunes, 2016) EEUD (Johnson & Beausoleil-Morrison, 2017)
RAE (Makonin, Wang, & Tumpach, 2018) ESHL (Kaibin Bao, 2016)
BLOND (Kriechbaumer & Jacobsen, 2018)
  • a In the original version (2012) there is data for three houses between 3 and 4 weeks, but only one of them contains submetered data. In the 2017 edition there is data for seven houses during 3 years.
  • b There is a list of light switch events that can be used to label the events of lighting appliances.
  • c These datasets can only be used as training data. Evaluation must happen in datasets where aggregate consumption data is available.
  • d Only for event classification using either cross-validation or power events from other datasets.
  • e Should have been released in 2017 according to the authors.
  • f The home is a living lab.
  • g University/office building.
  • h EE: electric energy; EMI: electromagnetic interference; F: frequency; P: energy price; PA: phase angle; PF: power factor.

The following characteristics are provided for each dataset: Year of release, country, number of monitored households, if the data is continuous or not (continuous—C or not continuous—NC), that is, if the data was collected in consecutive time periods. The approaches enabled by the dataset (event-based—EB or event-less—EL), types of smart-meters used in the collection (aggregate—A, individual circuit—IC or individual appliance—IA, and if a list of power event labels is available—LE). The available electric energy features (current—I, voltage—V, active power—P, reactive power—Q, apparent power—S, others—O), and the time resolution of the available data.

In the event-less category, there are 21 datasets. From these, 17 (REDD, Smart*, BLUED, Dataport, AMPds, iAWE, IHEPCDS, UK-DALE, ECO, REFIT, COMBED, DRED, SustDataED, EEUD, RAE, ESHL, and BLOND) contain aggregated and individual appliance/circuit consumption, thus making them suitable to be used simultaneously as training and testing data.

The remaining five datasets in this category (HES, Tracebase, ACS-Fx, GREEND, and RBSA), only provide individual appliance consumption information. Therefore, they can only serve as training data. One possibility to use them for training and testing would be to calculate the aggregate data by summing the power demand of each appliance. Still, this approach presents some caveats that would affect the final results. For example, this completely excludes the effects of the appliances that were not submetered, hence resulting in very simplistic and unrealistic datasets that may lead to overoptimistic disaggregation results.

In the event-based category, only BLUED and SustDataED contain aggregate consumption information and a list of appliance labels for the identified power changes. Therefore, these are the only two alternatives to evaluate event-based approaches. Still, since both datasets only contain, respectively, 7 and 10 days of data, they are not very suitable to evaluate classification and energy estimation (EE) algorithms. One possibility is using a resampling technique (e.g., bootstrapping or jackknifing) to generate new labels from the existing ones, that can later be used as training data for event classification.

Finally, the PLAID, HFED, WHITED, and COOLL datasets only contain data from the startup transients and spectral traces of several individual appliances. Consequently, they are only suitable to evaluate feature extraction and classification algorithms using cross-validation (Barsim, Mauch, & Yang, 2016; Gao, Kara, Giri, & Bergés, 2015). Likewise, it should also be possible to use PLAID, WHITED, and COOLL to classify power events from other datasets. Still, it should be noted that they only contain startup transients, therefore it will not be possible to classify OFF transitions.


As previously mentioned in the introduction of this paper, most of the early efforts in the NILM research were devoted to event-based approaches. Consequently, several performance metrics have been proposed to evaluate such systems. For example, in his seminal work, Hart (1985) used both the fraction of correctly classified power events and the fraction of total energy explained as accuracy metrics. The former evaluates the performance of the event classification step, whereas the latter evaluates the EE step.

Many other performance metrics have been defined in the following years. For instance, in Berges (2010) the author used different metrics for ED (e.g., failed detections—FD, and the detection error rate—DER), event classification (e.g., F1Score), and EE (e.g., the energy identification rate—EIR). A similar approach, of considering the different steps in the NILM pipeline, was presented in Liang, Ng, Kendall, and Cheng (2010). In this work, the authors propose three different metrics to evaluate the event-based NILM pipeline, namely, detection accuracy—DeA, disaggregation accuracy—DiA, and overall accuracy—OA.

All these metrics take into consideration event detector, type I error (when no appliance changes its state, but an event is detected) and type II error (when an appliance is operated but no event is detected). However, they assume that all the events are equally important, which is far from plausible since not all appliances require the same energy. Therefore, in an attempt to quickly understand the interactions between ED errors and the actual energy needs, in Anderson, Ocneanu, et al. (2012) the authors proposed two new performance metrics: the total power change (TPC) and the average power change (APC), which are the sum (average in the second case) of the amount of power change for all the type I and type II errors.

Event-less approaches, on the other hand, rarely rely on a separate ED process. Instead event-less approaches attempt to disaggregate the total load in separate time slices. Consequently, event-less algorithms only require metrics to evaluate the final EE, which is the last stage in the case of event-based approaches.

The first event-less metrics formulations took into consideration the estimated and the ground-truth energy to capture the energy disaggregation error. For example, in Kolter and Matthew (2011) the authors propose a performance metric that captures the total error in the assigned energy (TEAE), normalized by the real energy consumption in each time slice averaged over all appliances. An equivalent metric was proposed in Kolter and Jaakkola (2012), but this time considering the individual appliance error (IATEAE) rather than the average between all the appliances, which reduces the chance of reporting large errors in certain time slices due to single appliances performing very poorly.

Furthermore, the authors working on event-less approaches “reinvented” the notions of false positives, false negatives, true positives, and true negatives in terms of time slice results. This enables common confusion-matrix based metrics like accuracy, precision, recall, sensitivity, F1-score, and receiver operating characteristic (ROC) to be considered when evaluating event-less approaches (Batra, Kelly, et al., 2014; Batra, Parson, et al. 2014; Beckel, Kleiminger, Cicchetti, Staake, & Santini, 2014; Liang et al., 2010; Makonin & Popowich, 2014).

More recently, there was an effort to better understand how to report the performance of NILM algorithms (Butner et al., 2013; Makonin & Popowich, 2014; Mayhorm, Butner, Baechler, Sullivan, & Hao, 2015). These efforts culminated in a proposal to group the performance metrics in two main categories:(Mayhorn et al., 2016) (a) ED metrics, designed to evaluate the NILM's ability to track the consumption over time, and (b) EE metrics, designed to characterize and evaluate the NILM disaggregated data against the actual ground-truth.

ED performance includes metrics derived from the confusion matrix (i.e., true positives, false positives, true negative, and false negatives). EE performance includes metrics based on basic and advanced statistics (e.g., root mean squared error—RMSE, average error—AE, and standard deviation of error—SDE) (Holmes, 2014), as well as a number of metrics specifically designed for NILM, including the previously mentioned EIR (Berges, 2010), TEAE (Kolter & Matthew, 2011) and IATEAE (Kolter & Jaakkola, 2012).

In Tables 2 and 3, we summarize the event-detection and EE metrics, respectively. For each metric we present a brief description, and a list of papers where these metrics are referenced. Finally, if not otherwise mentioned, the metric can be used to evaluate both event-based and event-less approaches.

Table 2. Overview of performance metrics under the event detection category
Metric Description Equation
TP A true positive (Batra, Kelly, et al., 2014; Beckel et al., 2014; Berges, 2010) is considered whenever the system detects/classifies something as being true and the actual output is true, for example, a power event is labeled as being triggered by appliance A and it actually was (event-based), or a time slice consumption is attributed to appliance A which is actually responsible for it (event-less).
TN A true negative (Batra, Kelly, et al., 2014; Beckel et al., 2014; Berges, 2010) is considered whenever the system detects/classifies something as being false and the actual output is false, for example, no power event is detected in a given instant and actually no appliance changed its state in that instant (event-based), or for a given time slice no consumption is attributed to an appliance when that appliance is actually not consuming.
FP A false positive (Batra, Kelly, et al., 2014; Beckel et al., 2014; Berges, 2010) is considered whenever the system detects/classifies something as being true and the actual output is false, for example, a power event is labeled as being triggered by appliance A when it was triggered by appliance B (event-based), or a time slice consumption being attributed to appliance > when that appliance is not working (event-less).
FN A false negative (Batra, Kelly, et al., 2014; Beckel et al., 2014; Berges, 2010) is considered whenever the system detects/classifies something as being false and the actual output is true, for example, no power event is detected at a given instant but an appliance changed its state in that instant (event-based), or for a given time slice no consumption is attributed to an appliance when that appliance is actually consuming (event-less).
P Precision (Anderson, Bergés, Ocneanu, Benitez, & Moura, 2012; Batra, Kelly, et al., 2014; Beckel et al., 2014; Berges, 2010) (also called positive predictive value—PPV), is the proportion of relevant instances that were reported as being relevant against all the instances that were reported as relevant. urn:x-wiley:19424787:media:widm1265:widm1265-math-0001
R Recall (Anderson, Bergés, et al., 2012; Batra, Kelly, et al., 2014; Beckel et al., 2014; Berges, 2010) (also called sensitivity or true positive rate—TPR), is the proportion of relevant instances that were reported as being relevant against all the truly relevant instances. urn:x-wiley:19424787:media:widm1265:widm1265-math-0002
F β The F β-measure (Anderson, Bergés, et al., 2012; Batra, Kelly, et al., 2014; Beckel et al., 2014; Berges, 2010) trades-off precision and recall. Mathematically, it is the harmonic mean between the two metrics. β is a weighing factor that is used to attach β times as much importance to recall as to precision. For example, if β = 2 (F2-measure), recall is twice as important as precision, whereas if β = 0.5 (F0.5-measure), precision is twice as important as recall. Finally, if β = 1 (F1-measure), recall and precision have the same weight. urn:x-wiley:19424787:media:widm1265:widm1265-math-0003
FPR False positive rate (Anderson, Bergés, et al., 2012; Batra, Kelly, et al., 2014) is the proportion of false positives against the actual negative results. urn:x-wiley:19424787:media:widm1265:widm1265-math-0004
ROCAUC The receive operating characteristics - area under curve (Berges, 2010; Liang et al., 2010; Parson, Mark, & Rodgers, 2014) metric finds the algorithm / parameter configurations that have the best trade-off between TPR and FPR. The area under the ROC curve measures accuracy. An area of 1 represents a perfect test; an area of.5 represents a random (therefore worthless) test. a
FD Failed detections (Berges, 2010) is the sum of missed (FN) and wrongfully detected events (FP). FD = FP + FN
DER Detection error rate (Berges, 2010) is the ration between failed detections and the number of positive cases. urn:x-wiley:19424787:media:widm1265:widm1265-math-0005
DeA Detection accuracy (Liang et al., 2010) measures the accuracy of the algorithm, including the effects of the wrongfully detected events. urn:x-wiley:19424787:media:widm1265:widm1265-math-0006
DiA Disaggregation accuracy (Liang et al., 2010) is the DA excluding the effects of false positives. urn:x-wiley:19424787:media:widm1265:widm1265-math-0007
OA Overall accuracy (Liang et al., 2010) is the disaggregation accuracy including the effects of false positives and false negatives. urn:x-wiley:19424787:media:widm1265:widm1265-math-0008
TPP True positive percentage (Anderson, Bergés, et al., 2012) is the percentage of the ratio between true positives and actual true results. TPP is equivalent to recall. urn:x-wiley:19424787:media:widm1265:widm1265-math-0009
FPP False positive percentage (Anderson, Bergés, et al., 2012) is the percentage of the ratio between false positives and actual true results. Note that since the number of FP can be larger that the number of real events, FPP can be larger than 100%. urn:x-wiley:19424787:media:widm1265:widm1265-math-0010
TPC Total power change (Anderson, Bergés, et al., 2012) is the sum of the deltas for all the false positives or false negatives. Event-based urn:x-wiley:19424787:media:widm1265:widm1265-math-0011
APC Average power change (Anderson, Bergés, et al., 2012) is the average of the deltas for all the false positives or false negatives. Event-based urn:x-wiley:19424787:media:widm1265:widm1265-math-0012 urn:x-wiley:19424787:media:widm1265:widm1265-math-0013
HL The hamming loss (Batra, Kelly, et al., 2014) measures the total information loss when appliances are incorrectly classified over the entire dataset. urn:x-wiley:19424787:media:widm1265:widm1265-math-0014
MFScore The modified F-score (Kim, Marwah, Arlitt, Lyon, & Han, 2011) combines the performance of classification and energy estimation algorithms. It works by introducing the notion of accurate and inaccurate true positives (ATP and ITP). ATP are the correct classifications with low energy estimation errors, whereas ITP are the correct classifications with high estimation errors.b. urn:x-wiley:19424787:media:widm1265:widm1265-math-0015 urn:x-wiley:19424787:media:widm1265:widm1265-math-0016
FSFScore The finite-state F-score (Makonin & Popowich, 2014) is a discrete version of the F β-score. A partial penalization (inacc) is applied to the correct classifications (TP), based on the distance between the estimated and the ground truth states.c urn:x-wiley:19424787:media:widm1265:widm1265-math-0017
BM Informedness (Powers, 2011; (Barsim & Yang, 2018) quantifies how informed a predictor is for the specified condition, and specifies the probability that a prediction is informed in relation to the condition (versus chance). BM = TPR + TNR – 1
MK Markedness (Barsim & Yang, 2018; Powers, 2011) quantifies how marked a condition is for the specified predictor and specifies the probability that a condition is marked by the predictor (versus chance). MK = PPV + NPV – 1
MCC The Mathews correlation coefficient (Barsim & Yang, 2018; Matthews, 1975) indicates the central tendency between Informedness and Markedness. Mathematically, it is the geometric mean between the two metrics. urn:x-wiley:19424787:media:widm1265:widm1265-math-0018
N dis Number of events that is accurately recognized by a disaggregation algorithm, that is, TP
N det Total number of detected events (N det = N true + N wroN miss)
N true True number of events that actually occurred, that is, TP + FN
N wro Number of wrongfully detected events, that is, FP
N miss Number of missed events, that is, FN
ΔP m Power change of a missed event, that is, FN
ΔP f Power change of a wrongfully detected event, that is, FP
T Number of observations or events based on each time step
A Number of appliances being considered
urn:x-wiley:19424787:media:widm1265:widm1265-math-0019 Metered state of appliance a at instant t
urn:x-wiley:19424787:media:widm1265:widm1265-math-0020 Estimated state of appliance a at instant t
urn:x-wiley:19424787:media:widm1265:widm1265-math-0021 Metered energy of appliance a at instant t
urn:x-wiley:19424787:media:widm1265:widm1265-math-0022 Estimated energy of appliance a at instant t
ρ Threshold used to define Accurate and Inaccurate true positives
TNR True negative rate (or specificity) is the inverse recall TNR = TN ÷ (TN + FP)
NPV Negative predictive value is the inverse precision NPV = TN ÷ (TN + FN)
  • a Trapezoidal rule for scoring algorithms, the non-parametric Wilcoxon statistic for discrete algorithms. (Iba, Hasegawa, & Paul, 2009)
  • b urn:x-wiley:19424787:media:widm1265:widm1265-math-0042, urn:x-wiley:19424787:media:widm1265:widm1265-math-0043
  • c urn:x-wiley:19424787:media:widm1265:widm1265-math-0044, urn:x-wiley:19424787:media:widm1265:widm1265-math-0045
Table 3. Overview of performance metrics under the energy estimation category
Metric Description Equation
RE The relative error (Mayhorn et al., 2016) gives an indication of how good the energy estimation is relative to the ground-truth data. urn:x-wiley:19424787:media:widm1265:widm1265-math-0023
RMSE The root mean square error (Chris Holmes, 2014;Batra, Kelly, et al., 2014 ; Mayhorn et al., 2016) is the standard deviation of the energy estimation errors. The RMSE reports based on how spread-out these errors are. In other words, it tells you how concentrated the estimations are around the true values. The RMSE reports on the same unit as the data, thus making it an intuitive metric. urn:x-wiley:19424787:media:widm1265:widm1265-math-0024
AE The average error (Chris Holmes, 2014; Mayhorn et al., 2016) indicates if the estimated energy is on average overestimated or underestimated. A positive AE implies an overall higher proportion of overestimation, a negative AE implies a higher proportion of underestimation. urn:x-wiley:19424787:media:widm1265:widm1265-math-0025
SDE The standard deviation of error (Chris Holmes, 2014; Mayhorn et al., 2016) indicates the extent of spread of the differences around the AE estimation. A larger SDE implies a wider dispersion of the individual estimated values, a lower SDE implies tighter distributions. urn:x-wiley:19424787:media:widm1265:widm1265-math-0026
r2 The R-squared (Mayhorm et al., 2015; Mayhorn et al., 2016; Mayhorn, Sullivan, Fu, & Petersen, 2017) is a statistical measure of how close the estimations are from the ground-truth data. The higher the coefficient, the higher percentage of estimations is inline with the ground-truth. Values of 1 or 0 indicate that the estimation represents all or none of the data, respectively. urn:x-wiley:19424787:media:widm1265:widm1265-math-0027
%SDx The percent (%) standard deviation eXplained (Mayhorm et al., 2015; Nau, 2017) is the percent by which the standard deviation of the errors is less than the standard deviation of the measured data. It is believed to be more intuitive than the r2 since it reports in the same units as the actual data. urn:x-wiley:19424787:media:widm1265:widm1265-math-0028
EE The energy error (Batra, Kelly, et al., 2014; Mayhorn et al., 2016) metric is the ratio of the absolute difference between estimated and true energy, and the total amount of true energy. In Mayhorn et al. (2016), the authors have shown that with relatively large errors this metric would result in values greater than 1, making it less intuitive and explainable. urn:x-wiley:19424787:media:widm1265:widm1265-math-0029
EAv1 The energy accuracy (Mayhorn et al., 2016) was proposed as an attempt to report the energy error between 0 and 1. It was reported in Mayhorn et al. (2016) as a stable metrics, however, it requires the tuning of the α parameter which may not be desirable. eα(EE)
MR The match rate (Mayhorn et al., 2016, 2017) is a metric where the evaluation is based on the overlapping rate of true and estimated energy. It varies between 0 and 1. As the value tends to 1, the metric indicates a strong match between the estimated and the true energy. On the contrary, a value tending to 0 indicates a poor match. A value of zero is only possible if the true and estimated energy are zero. This was the metric that demonstrated best overall performance in Mayhorn et al. (2016). urn:x-wiley:19424787:media:widm1265:widm1265-math-0030
SEM The standard error of the mean (Kalla, 2017; Mayhorn et al., 2016) reports on how the mean varies with different experiments measuring the same quantity. In the case of energy estimation, if there are significant errors, the SEM will be higher. On the contrary, if there are few to no significant errors, the SEM will tend to zero. urn:x-wiley:19424787:media:widm1265:widm1265-math-0031
FEE The fraction of energy explained (Berges, 2010; Hart, 1985) is the ratio between the total estimated energy and actual energy used. In Berges (2010) this metric is referred to as energy identification rate (EIR). urn:x-wiley:19424787:media:widm1265:widm1265-math-0032
TECA The Total energy correctly assigned (Johnson & Willsky, 2013; Kolter & Matthew, 2011; Makonin & Popowich, 2014) is the total error in assigned energy, normalized by the actual energy consumption in each time slice averaged over all appliances. urn:x-wiley:19424787:media:widm1265:widm1265-math-0033
ETEA The error in Total energy assigned (Batra, Kelly, et al., 2014) is the difference between the total energy assigned, and the actual energy consumed by a given appliance over the dataset. urn:x-wiley:19424787:media:widm1265:widm1265-math-0034
Dev The deviation (Beckel et al., 2014) determines the deviation of the inferred electricity consumption from the actual electricity consumption over the dataset. It is the ratio between the ETEA and the actual energy consumed. urn:x-wiley:19424787:media:widm1265:widm1265-math-0035
FTEAC The fraction of Total energy assigned correctly (Batra, Kelly, et al., 2014) is the overlap between the fraction of energy assigned to each appliance and the actual fraction of energy consumed by each appliance over the dataset. urn:x-wiley:19424787:media:widm1265:widm1265-math-0036
T Number of observations or events based on each time step
A Number of appliances being considered
E t Metered energy at time interval t
urn:x-wiley:19424787:media:widm1265:widm1265-math-0037 Metered energy for appliance a at interval t
urn:x-wiley:19424787:media:widm1265:widm1265-math-0038 Average metered energy over the dataset
urn:x-wiley:19424787:media:widm1265:widm1265-math-0039 Estimated energy at instant t
urn:x-wiley:19424787:media:widm1265:widm1265-math-0040 Estimated energy for appliance a at instant t
ΔE t Error between the NILM and metered data at instant t
urn:x-wiley:19424787:media:widm1265:widm1265-math-0041 Error between the NILM and metered data over the entire dataset
α Mapping factor, defined to be 1.4
σ Standard deviation of the error over the dataset

Regarding the ED metrics, we can observe that TPC and APC can only be used for ED and classification algorithms. This happens because these metrics rely on the amount of power change in the vicinity of the power events, which only makes sense when the ED problem is considered. As for the remaining metrics, they can be generalized for event-less by defining the confusion matrix for each individual appliance in terms of time slices.

It is also important to note that event classification and EE algorithms are multiclass problems. Consequently, when using metrics under this category, it is necessary to take into consideration the different strategies to calculate the confusion matrix and the metrics themselves.

Regarding event classification, the most well-known strategy is to transform the multiclass problem into separate binary problems either using the one-versus-all or one-versus-one strategy (Galar, Fernández, Barrenechea, Bustince, & Herrera, 2011). The one-versus-all (or one-vs.-rest, OvR or OvA, one-against-all, OAA) strategy involves evaluating the classifier as many times as the number of classes (i.e., N binary problems). For each class, the samples of that class are considered positives examples, and all the other samples negative examples. The one-versus-one approach (OvO) strategy involves evaluating all the possible pairs of classes (i.e., N(N − 1)/2 binary problems) where one class is the positive and the other the negative example. In both cases, the final confusion matrix is given by summing up the individual binary matrices.

Regarding EE, multiclass metrics are calculated from the resulting OvA or OvO confusion matrices. They can be calculated over the entire class collection, which is called microaveraging, or by averaging the performance of each individual class, which is called macroaveraging (Van Asch, 2013). In microaveraging, each class counts the same for the average, as such larger classes dominate the measure. In macroaveraging, first the average for each class is determined, and only then each class counts the same for the final average. Finally, it is also common-practice to weight the individual class metrics by the respective number of instances, thus making the final average less sensitive to smaller classes. This is known as weighted macroaverage.

With respect to EE metrics, it is important to remark that in the cases of TECA, ETEA, Dev(iation), and FTEAC, individual appliance consumption data is required. Furthermore, despite the fact that these metrics give an overview of the performance with just one number, it should be noted that we lose track of which appliances are pushing the performance up or down. For example, poor estimation for high consuming appliances can lead to poor overall results, even if the EE of the remaining appliances is accurate.

As for the remaining metrics, they do not require individual appliance data, meaning that they can be used to evaluate EE algorithms in event-based datasets, even if the individual appliance consumption data is not available. Furthermore, when individual data is available, it is possible to evaluate the performance of each individual appliance, which will immediately give an idea of which appliances affect the performance the most.


There is a general consensus in the NILM research community on the importance of public datasets in furthering energy disaggregation research. Nevertheless, despite the tremendous efforts in releasing public data, there are many barriers toward making the datasets easily and efficiently available to the research community. In fact, the most common way of releasing publicly accessible NILM datasets is using text files that follow a certain structure that is then passed to the users in a disparate array of formats from CSV to plain-text. This is particularly inefficient given the requirements of NILM dataset which involve storing and recording information related to power signals, event labeling and properties of appliances and the household among many others (e.g., Lai, Trayer, Ramakrishna, & Li, 2012). Consequently, before any evaluation step, researchers have to understand the underlying structure of the datasets and produce code to interface with them, as well as to accommodate the different performance metrics.

Against this background, several efforts emerged to homogenize the datasets and provide a single interface to run evaluations. In this section we introduce some of these projects, namely the NILM Metadata framework (Kelly & Knottenbelt, 2014), the Energy Monitoring and Disaggregation Data Format (EMD-DF) (Pereira, 2016, 2017), the open-source NILM Toolkit (NILMTK) (Batra, Kelly, et al., 2014; Batra, Parson, et al. 2014; Kelly et al., 2014), and the NILM-Eval framework (Beckel et al., 2014; Cicchetti, 2013).

4.1 NILM metadata

The NILM Metadata is a framework created with the goal of homogenizing the definition of the many elements that can be found on a NILM dataset (e.g., information about the monitored appliances, the monitoring hardware, and the buildings where the collection occurred). The proposed schema is divided in two main parts: (a) a central metadata with general information about how appliances are represented in NILM metadata; and (b) a schema describing the datasets.

The central metadata provides a base for all the appliances that will appear represented in any of the datasets. This includes, for example, (a) categories for each appliance type; (b) prior knowledge about the distribution of variables such as: ON power and ON duration; (c) appliance correlations (e.g., that the TV is usually ON if the games console is ON); and (d) additional properties for each appliance.

The dataset metadata schema is used to model the individual components that comprise the dataset. The modeled objects include: (a) electricity meters (whole-home and individual appliance meters); (b) domestic appliances; (c) mapping between meters and appliances; (d) buildings (e.g., appliances, number of rooms, and a description of the occupants); and (e) datasets (e.g., name, authors, and geographical location, etc.).

The NILM metadata framework is currently part of the NILMTK project, where it is used to provide structured information about the available datasets, and to define the rules to create datasets that are compatible with NILMTK. This is an open-source project, and the source code is available on Github.1 As of this writing, the last update was in April 2017.

4.2 Energy monitoring and disaggregation data format

The energy monitoring and disaggregation data formant (EMD-DF), is a common data model and file format to support the creation and manipulation of energy disaggregation datasets.

The EMD-DF data model defines three main data entities that should be present in a dataset for energy disaggregation: (a) consumption data, (b) ground-truth data, and (c) data annotations.

The consumption data entity represents all the data elements that refer to energy consumption. Consumption data can be of two different types: (a) raw waveforms, that is, current and voltage; or (b) processed waveforms, that is, different power metrics like real and reactive power.

The ground-truth data entity refers to all ground-truth elements and can be of four different types: (a) individual appliance consumption; (b) individual circuit consumption; (c) appliance activity; and (d) user activity. Individual appliance and individual circuit consumption data are mostly used in event-less approaches. Appliance activities provide information about power events, and are required in event-based approaches. Finally, user activities refer to actions that people perform involving the use of electric appliances, for example, doing the laundry (washer, dryer, and iron).

Lastly, there is the data annotations entity. These can be either metadata or general comments. EMD-DF defines three different types of metadata annotations, namely: (a) local metadata, which refers to specific samples in the consumption data; (b) custom metadata that are defined by the dataset creator and can serve multiple purposes (e.g., include NILM Metadata schemas); and (c) RIFF Metadata, which is composed by the metadata chunks defined by the resource interchange file format (RIFF2).

The current implementation of the EMD-DF is an extension of the well-known waveform audio file format (WAVE3) that was originally created to store audio data. WAVE is an application of the RIFF standard in which the file contents are grouped and stored in separate chunks, each of which following a predefined format. As an example of the application of EMD-DF, the authors have converted the BLUED dataset to their format. The original BLUED distribution is comprised of over 6,500 files, and takes about 320 GB of disk space. The converted version contains only 34 files and takes less than 70 GB, which represents a decease in storage space of about 80%.

The current version of EMD-DF is implemented in Java, and according to the author, the current version of the code is available on Gitlab.4


The NILMTK is an open source toolkit, released in April 2014, to enable the analysis of existing datasets and algorithms. It also provides a unified interface for the addition of new datasets and algorithms.

In order to represent the datasets in NILMTK, the authors defined a common data format (NILMTK-DF). The data is stored using the hierarchical data format (HDF5) and is loaded to the working memory in small batches. In addition to storing electricity data, NILMTK-DF also stores relevant metadata according to the NILM-metadata framework, as well as other sensor modalities such as gas, water, and temperature.

In terms of features, the toolkit is composed of several software components written in Python, including: (a) dataset converters and parsers; (b) function for dataset diagnosis (e.g., gap and dropout rate detection); (c) statistical analysis (e.g., proportion of submetered energy); and (d) data preprocessing (e.g., down-sampling, and voltage normalization). Additionally, the NILMTK implements two reference benchmark disaggregation algorithms (combinatorial optimization—CO, proposed by Hart (1985), and factorial hidden Markov models (Kim et al., 2011; Kolter & Matthew, 2011)), as well a number performance metrics (e.g., EE, ETEA, and FTEAC).

In order to assess the feasibility of their toolkit, the authors performed some evaluations, including several dataset statistical analysis and energy disaggregation benchmarks. Regarding the benchmarks, the two default benchmark algorithms were tested against six datasets (REDD, Smart*, PSRI - now DataPort, AMPds, iAWE, and the UK-Dale) at 1-minute resolution. The obtained results indicated that FHMM performance was superior to CO in three datasets (REDD, Smart*, and AMPds), while for the remaining three datasets CO and FHMM performed similarly.

The source code of NILMTK is available on Github.5 As for this writing, the project includes contributions from 17 individuals, and the last update to source code was on March 2018. This suggests that the project is still attracting substantial attention, and there is a relative interest from the community around this project.

5.1 NILM-Eval

The NILM-Eval is a Matlab-based open source framework for running comprehensive performance evaluations of NILM algorithms across multiple datasets. NILM-Eval is very similar in scope to the NILMTK in the sense that it allows evaluations across multiple datasets with common performance metrics. Yet it was designed to facilitate the design and execution of large experiments that consider several different parameter settings for the different algorithms in repeated experiments, therefore enabling the quick evaluation and benchmark of such algorithms under different settings.

The authors of NILM-Eval have also thoroughly tested their systems' ability to evaluate and benchmark disaggregation algorithms. To this end, they used their own dataset (ECO) to evaluate four different algorithms, two of them event-based (Baranski & Voss, 2004) and (Weiss, Helfenstein, Mattern, & Staake, 2012) and two event-less (Kolter & Jaakkola, 2012; Parson, Ghosh, Weal, & Rogers, 2012). These algorithms were tested under different parameter configurations, and the results reported using the system default performance metrics (P, R, RMSE, and Dev). An extensive discussion of the evaluation results is out of the scope of this paper, yet these have shown that the event-based approaches performed better than the event-less counterparts.

The source code of NILM-Eval is available on Github6 but it naturally requires the commercial and expensive MATLAB package in order to be used. This may prevent the adoption of this framework by other researchers. In fact, this project has only one contributor, and the last update dates back to June 2015. This suggests that the project is no longer being supported by the author.


So far in this paper, we reviewed the public datasets, performance metrics, and tools that are commonly used to assess the performance of NILM algorithms and systems. We now discuss the shortcomings and main challenges of the current state-of-the-art.

6.1 Datasets

Regarding datasets, there are a number of limitations that make performance evaluation challenging: (a) missing data; (b) limited labeling; and (c) substantial differences in the available data.

By missing data, we do not refer only to the gaps that are very common in many of the existing datasets (Batra, Kelly, et al., 2014; Batra, Parson, et al. 2014), but also to the loads for which there are no submetered data. This happens due to many reasons, including the impracticality of installing plug-level meters in each and every load because they do not have a fixed plug (e.g., vacuum cleaners) and also the fact that some loads cannot be monitored using plugs (e.g., ceiling lights) (Pereira, Ribeiro, & Nunes, 2017). This challenge is particularly relevant when monitoring commercial/industrial spaces, where the number of individual loads can easily reach 100 or more (Jahromi, 2018).

One of the alternatives to increase the level of submetered data is by using circuit-level submetering as suggested in Jahromi (2018). This was already accomplished in datasets like AMPds (Makonin et al., 2016) and RAE (Makonin et al., 2018). However, since many loads can be attached to the same circuit (e.g., the kitchen circuit contains many different appliances), this solution alone will not guarantee that individual consumption is available for every load. Consequently, if the goal is to monitor individual loads, plug-level submetering is still required.

Alternatively, indirect sensing could be used to infer individual appliance activity that could then complement the circuit-level data. A similar approach was done for BLUED (Anderson, Bergés, et al., 2012), where environmental sensors were used to monitor ceiling lights. This solution, however, implies a considerable amount of postprocessing work in order to combine the different sensor streams, and subject to several challenges to ensure the correct synchronization of the internal clocks in each monitoring system (Anderson, 2014).

This brings us to the second limitation, the limited amount of labeled data. This limitation is particularly relevant in event-based approaches that require labeled transitions. In fact, the difficulty in labeling NILM datasets justifies the reduced number of event-based datasets. Producing fully labeled NILM datasets requires either the deployment of plug-level hardware and/or the lengthy, and error-prone manual inspection of the whole dataset. This process of labeling/annotating sensor data is transversal to many domains of machine learning, leading to a whole research community such as the International Workshop on Annotation of use R Data for UbiquitOUs Systems.7

In the context of NILM systems, on the one hand, the process cannot be fully automated because of the aforementioned issues with plug-level metering. On the other hand, a fully manual system would require an outstanding level of resources and still be prone to errors. Against this background, we believe that future work should look at ways of leveraging the potential of the existing data, for example by means of collaborative and semi-automatic annotation of datasets (Cao, Wijaya, Aberer, & Nunes, 2015; Pereira et al., 2017; Pereira & Nunes, 2015), or even the creation of synthetic datasets by means of data synthesizing (Buneeva & Reinhardt, 2017; Henriet, Simsekli, Fuentes, & Richard, 2018).

A third limitation that ultimately prevents fair cross-dataset NILM benchmarks is the wide differences among the available datasets. These differences occur because of two main aspects: (a) the type and granularity of the available measurements; and (b) the different formats (e.g., plain-text, CSV, audio files, etc.) in which the data is made available to the research community.

While some research efforts provide common interfaces to access datasets, very little work has been carried out so far toward understanding the real implications of such differences. For example, it is not possible to quickly identify which approaches can be tested in a particular dataset, or which metrics can be calculated from the underlying data. Thus, in the near future it would be of crucial importance to find a scalable and easy to understand method to describe this dataset-algorithm-metric nexus. We argue that one way of doing so is by extending the NILM metadata project with meta-information about the algorithms and metrics that are supported by each dataset. Something is those lines is already done in the UCI machine learning repository (Dheeru & Karra Taniskidou, 2017), where it is possible to select datasets for a particular task (i.e., classification, regression, clustering, and other).

Finally, the wide differences in the existing datasets give rise to one of the most interesting challenges of NILM performance evaluation—the need to define the complexity of the NILM problem in each dataset (and respective subsets). For example, in Egarter, Pöchacker, and Elmenreich (2015), the authors propose two complexity measures, Appliance Set Complexity, and Time Series Disaggregation Complexity. Early results show considerable differences between datasets, which makes direct comparisons very difficult even when using the same performance metric (Egarter et al., 2015; Nalmpantis & Vrakas, 2018).

Consequently, while defining additional complexity metrics may be a more appealing research direction, we argue that in the present the research community could greatly benefit from having the ability to generate set of data with comparable complexities using the measures proposed in Egarter et al. (2015). For example, since the proposed measures are independent of the disaggregation algorithms (Egarter et al., 2015), this could be easily achieved by subsetting existing datasets until the same complexity scores are achieved. Still, while such an approach may be feasible, it should be taken into consideration that the resulting datasets should always be representative of the real-world conditions. Otherwise, the disaggregation results will be directly comparable, but not meaningful.

6.2 Performance metrics

In terms of performance metrics, this review highlighted that the research community is still far from a consensus regarding which metrics should be used to assess and report the performance of NILM technology. We propose that in order to establish a consistent set of performance metrics, two main limitations must be addressed by the NILM research community: (a) the lack of a deep understanding on the behavior of traditional metrics when applied to NILM; and (b) the almost complete absence of domain/application specific metrics.

Regarding the former, most of the currently existing metrics have been inherited or extended from other application domains in machine-learning. For example, the precision and recall metrics that are widely used in ED problems have their origins in the information retrieval domain (Kagolovsky & Moehr, 2003). Consequently, future research should aim at fully understanding the behavior of the existing performance metrics when applied to the energy disaggregation problem. For example, in Pereira and Nunes (2017, 2018), the authors analyzed experimentally the behavior of several performance metrics when applied to classification and ED algorithms. Their results show very high correlations between the metrics in the classification problem, contrasting the much lower correlation values in the ED problem.

While the high correlations in classification problems appears to be in line with the relevant literature (e.g., Ferri, Hernández-Orallo, & Modroiu, 2009), ED proves to be a very distinct problem in part due to the highly unbalanced nature of the underlying data toward true negatives.

As such, future research should look at understanding how the characteristics of NILM data affect the mathematical properties of the different performance metrics (e.g., Sokolova & Lapalme, 2009). This knowledge combined with empirical evidence will be useful to fully understand the relationships between performance metrics, and ascertain to what extent the results and conclusions obtained using one particular metric can be extended to others.

Likewise, future work should also look at other metrics, in particular metrics that balance precision and recall (e.g., mean average precision, and break-even point; Manning, Raghavan, & Schütze, 2008), handle multiclass classification with unbalanced data (e.g., multiclass performance score (Kautz, Eskofier, & Pasluosta, 2017), and Mathews correlation coefficient (Matthews, 1975; Powers, 2011)), and the Jaccard index (Labatut & Cherifi, 2012; Powers, 2011; Real & Vargas, 1996), which can be used as a metric for ED/classification and EE.

Regarding the second limitation, the work on Pereira and Nunes (2018) briefly touched the topic of domain specific metrics, highlighting their potential to unveil important characteristics of the algorithms and the underlying datasets.

Consequently, while future work should aim at further understanding these metrics, there is also an opportunity to define new metrics. This is particularly relevant in the case of metrics that take into consideration concepts from cost-sensitive learning (Elkan, 2001), since it is well known that not all appliances contribute the same for the final EE (Anderson, Ocneanu, et al., 2012). For instance, missing many power events from a smaller appliance may be less costly than missing just one from a larger appliance. Thus it should have a smaller cost associated. On the contrary, an appliance that is rarely used should have a higher cost associated, since failing to correctly identify the few occurrences of that appliance will result in significant underestimations of its consumption.

Nevertheless, while computing the value of cost-sensitive metrics can be easily achieved by means of techniques such as ceiling analysis (e.g., Roncancio, Hernandes, & Becker, 2013), defining the correct values for the cost is not straightforward since it is highly dependent on the application domain. For example, if NILM is used to detect malfunctioning loads by studying the periodicity of the power events, there should be a high penalty for missed and erroneously detected events. Consequently, in order to define cost-sensitive metrics (and any other meaningful domain specific metrics), the different stakeholders must first clearly identify their NILM use-cases. Only then, the metrics that make sense in each case can be identified, as well as the performance thresholds that should be achieved in order to validate each use-case.

6.3 Tools and frameworks

With respect to frameworks and toolkits, we argue that important research breakthroughs can only be achieved if the research community manages to continuously and incrementally integrate contributions in a common framework. For instance, integrating EMD-DF in NILMTK would add to the latter the support to evaluate the performance of high-frequency NILM approaches. Likewise, integrating EMD-DF or NILMTK-DF into the existing dataset simulators (Buneeva & Reinhardt, 2017; Henriet et al., 2018), would ensure that the generated datasets are immediately ready to integrate with NILMTK or other platform that interfaces with these data formats.

Additionally, as the smart-grid becomes a reality, one should expect more datasets to emerge, some of which with a data granularity in the order of several kHz. As such, framework and toolkit developers should look at new techniques to compress, store and represent NILM data. For example, in Reinhardt (2017), the author proposes the creation of hybrid load signatures by representing data at different sampling rates (e.g., 1 Hz for steady-state, and several kHz for transients). Although this was originally proposed to reduce the bandwidth in communication channels, if integrated with the existing data formats, it can work as custom data compression tool for NILM datasets.

Lastly, NILM research can greatly benefit from the development of an online platform for performance evaluation. Among other benefits, this would guarantee that the proposed approaches are evaluated under the same conditions and enable the community to keep track of research progress. The idea of an online NILM platform was briefly mentioned as future work in Kelly (2016), and in the present emuNILM (Makonin, 2018) is the most recent effort toward creating such a platform. The main goal of emuNILM is to be able to emulate smart-meters and communication infrastructures so that NILM algorithms can be tested in real-world like situations. For example, by emulating the consumption of broken appliances it will be possible to understand how a particular algorithm responds in the presence of malfunctioning loads.


Energy disaggregation is a key cost-effective technology to monitor energy consumption and contribute to the many challenges faced by the transition to a reliable, sustainable, and competitive energy system. NILM is a maturing field showing promising results in measuring appliance-specific energy consumption, while keeping installation cost and hardware complexity low. Even if IoT technologies increase the availability of disaggregated data, legacy infrastructures will take decades to catch up and NILM seems the only viable alternative for practical and cost-effective energy disaggregation. Consequently, performance evaluation will continue to be a fundamental research direction in NILM, as this is the only mean to validate the proposed solutions and leverage the potential adoption of this technology by utilities worldwide at building and city scale.

In this paper, we reviewed the main challenges and advances in this are of NILM research: datasets, performance metrics, and frameworks/toolkits. Our goals of this paper are to: (1) motivate the importance of proper performance evaluation; (2) provide a solid background and comprehensive knowledge of the current state-of-the-art; and (3) outline the main challenges and future research opportunities.

To conclude, we should mention that in this paper we have only looked at performance evaluation from an algorithmic perspective. As such, it is important to point out the need to conduct research toward evaluating the value proposition of NILM technology, as this is only seldom reported in the literature. For example, it was until only recently that we saw the first publications regarding the assessment of the value proposition of disaggregated data as a tool to reduce energy consumption (Batra, Singh, & Whitehouse, 2015; Kelly & Knottenbelt, 2016), or trying to educate the research community about the practical issues of deploying such systems in real-world scenarios (Kosonen & Kim, 2016; Pereira, 2016). This, we believe, is of crucial importance to the large-scale adoption of NILM technology in years to come, and presents the perfect opportunity to conduct multidisciplinary research in the field of NILM.


The authors have declared no conflicts of interest for this article.


Research with disaggregated electricity end-use data in households: review and recommendations


  • 1 NILM Metadata Source Code, https://github.com/nilmtk/nilm_metadata
  • 2 RIFF file format, http://fileformats.archiveteam.org/wiki/RIFF
  • 3 WAVE file format: http://fileformats.archiveteam.org/wiki/WAV
  • 4 EMD-DF Source Code, https://gitlab.com/alspereira/EMD-DF
  • 5 NILMTK Source Code, https://github.com/nilmtk/nilmtk
  • 6 NILM-Eval Source Code, https://github.com/beckel/nilm-eval
  • 7 ARDUOUS Workshop, https://text2hbm.org/arduous