06 April 2016

Missing Data & Defaults: ES6 Destructuring as a code solution


Missing Data

Missing data is a significant issue in most organizations.  Missing data occurs when no value is stored for an observable factor/variable.


Causes

The causes for missing data can be as simple as participant dropout (see Analysis of clinical trials for some techniques for dealing with this), nonresponsiveness, or “informative missingness” (see Kuhn, Max, and Kjell Johnson. Applied Predictive Modeling. New York: Springer, 2013. Print.).


Problems Caused by Missing Data

The problem caused by missing data is that it makes descriptive, predictive and prescriptive models unreliable.  If a model is unreliable, it will be abandoned by users and will produce negative business outcomes.  Models should be useful and reasonably reliable.

CRM data

Often, when analyzing the CRM data of an organization to build a lead scoring model, we run into this issue.  Terrifyingly, many organizations use “yes/no” checkboxes in the qualification process -- does “unchecked” mean “no” or “unanswered”?  Most sales executives will tell you that their teams reliably fill in the values.  [To date, this has never been a true statement].  The effect of treating all “unchecked” values as meaning “no” can so badly skew a model that oftentimes we are forced either to disregard the variable(s) in the model or to impute values based on the characteristics of other prospects who are similar based on other factors (essentially, this is a model within a model -- a Maalox moment).

An example

One account on which we worked made matters even worse: all but one of the “sales qualification” questions when left unchecked implied a missing purchase prerequisite; one question implied purchase readiness…  The data indicated that 90+% of the prospects who had been through a sales interview had no checkboxes marked -- frightening. In the end, Stand Sure could only model and score prospects where at least one box was checked.  

CRM Guidance -- Monitor, Measure and Manage

Our advice to all sales executives and CRM administrators is to ask the same question two ways on the qualification screen (one positive and one negative answer), create a report that includes all prospects who have been through a sales interview [MONITOR], report the percentage of customers that have either checkbox ticked by sales professional [MEASURE] and compel compliance with completion and entry of the qualification interview questions [MANAGE].

Bugs -- Developer Errors

One of the most common developer errors we see when machine crawling and analyzing customer websites is the absence of guard clauses. One company with whom we have worked varied prices shown online by geography as certain resources were not fungible between regions.  When we crawled the website as part of a search engine optimization (SEO) and web performance optimization audit, a significant number of HTTP 500 errors were encountered.


Developers and CRM administrators must be supervised when applying default values to enterprise data

Developers once aware of these issues can quickly correct them, BUT when dealing with data that is consumed and analyzed further downstream, business stakeholders and subject-matter experts need to work together to guide the developer.


Most are using a dangerous coding technique

The approach taught to developers in many languages is the (short-circuited) logical or, represented by two vertical bars, a “double pipe”, viz. A || B, where A is the variable and B is the default (for example, see Crockford, Douglas. JavaScript: The Good Parts. Beijing: O'Reilly, 2008. Print. Page 21).  


The problem with this in many languages is that the values zero (0), null (see Hello, I'm Mr. Null. My Name Makes Me Invisible to ... - Wired), and blank (“”) and false are interpreted by this pattern as “falsy” and the default value is applied!


What we really want as a rule is to apply appropriate defaults if and only if data is truly missing.  In most programming languages, the keyword undefined means missing.  


A safer way to default variable assignments in current Javascript

The way to write the default assignment logic in node.js/Javascript to only apply default values when the data is truly missing looks like the following (which is similar to code used in our analytics pixel endpoint):



// old way
clickstreamVariables.sessionId = (clickstreamVariables.sessionId === undefined) ?
generateNewUUID() : clickstreamVariables.sessionId;
clickstreamVariables.userId =
(clickstreamVariables.userId === undefined) ? generateNewUUID() : clickstreamVariables.userId;
clickstreamVariables.sessionCount =
(clickstreamVariables.sessionCount === undefined) ? 0 : clickstreamVariables.sessionCount;
clickstreamVariables.sessionId =
(clickstreamVariables.sequence === undefined) ? 0 : clickstreamVariables.sequence;
clickstreamVariables.userAgent =
(clickstreamVariables.userAgent === undefined) ? "unknown" : clickstreamVariables.userAgent;
clickstreamVariables.callingPage =
(clickstreamVariables.callingPage === undefined) ? "" : clickstreamVariables.callingPage;
clickstreamVariables.referringPage =
(clickstreamVariables.referringPage === undefined) ? "" : clickstreamVariables.referringPage;
clickstreamVariables.time = (new Date()).toISOString();
// *** yuk! ***
       
Although the code works, it’s hard to wade through it and maintain it if the organization decides that new default values should be used.


A cleaner approaching using ES6

An easier to read approach is the following:


{
 let {
   sessionId = generateNewUUID(),
   userId = generateNewUUID(),
   sessionCount = 0,
   sequence = 0,
   userAgent = "unknown",
   callingPage = "",
   referringPage = "",
   time = (new Date()).toISOString()
 } = clickstreamVariables;
 // ...
 clickstreamVariables = { userId, sessionId, sessionCount, sequence, userAgent, callingPage, referringPage, time };
}


Code Explanation
The code above takes advantage of some of the features in ES6, the new version of Javascript, to wit:
    • The variables within matched braces, { … }, can now be local to only that block and are not “hoisted” up to function or global scope.
    • The let statement lets you declare a variable locally without overwriting the value of a variable with the same name that may exist in a higher scope.
    • Unlike var statements, which are initialized immediately, let statements are not initialized until they appear
    • Values can be extracted from an object and assigned to local variables. The

      let {
         sessionId = generateNewUUID(),
         userId = generateNewUUID(),
         sessionCount = 0,
         sequence = 0,
         userAgent = "unknown",
         callingPage = "",
         referringPage = "",
         time = (new Date()).toISOString()
       } = clickstreamVariables;

      statement is extracting the parameters with names on the  left side of the equals sign if they are present in the clickstreamVariables variable; otherwise, it is applying the default value which is specified on the right side of the equals sign.  NOTE:  the default is only applied if the named parameter is missing or explicitly assigned a value of undefined.
    • Once the object has been deconstructed, the original object is then “reconstructed”, taking advantage of the fact that in ES6 the following two lines of code produce the same object:
      • var obj = { x: x };
      • var obj = { x };
    • Thus, the line

      clickstreamVariables = { userId, sessionId, sessionCount, sequence, userAgent, callingPage, referringPage, time };

      overwrites the original variable, applying defaults for missing values.


Full code sample

To run this code, you will either need to run node.js with the flag --harmony-destructuring or install the Scratch JS extension in Google Chrome.


/* jshint expr: true */
// for node.js, node --harmony-destructuring is needed for this sample to work

(function(undefined) {
 "use strict";

 var log = console.log.bind(console), // dummy logger for demo
     info = console.info.bind(console),
     warn = console.warn.bind(console),
     error = console.error.bind(console),
     assess = (l,r, eq = true) => {
       var test = (l == r), // not strict equality test
        text = `${l} == ${r}`;
       if ((eq && (test === true)) || (!eq && (test !== true))) {
         info(text, true);
       }
       else {
         error(text, false);
       }
     };
 
 // the problem
 //
 // the following does not work reliably when foo="", foo = 0, foo = null, foo = false due to the "falsy" behavior of Javascript
 // foo = foo || bar
 //
 // defaults should only be applied when the value is MISSING (i.e. undefined)
 //
 // the following should not get overridden with defaults -- there are values
 //
 assess("" || "default", "");
 assess(0 || 1, 0);
 assess(null || {}, null);
 assess(false || {}, false);
 
 //
 // we only really want to apply defaults when undefined
 //
 assess(undefined || 1, 1);
 
 var oldValues,
     newValues,
     clone = (old) => JSON.parse(JSON.stringify(old)), // poor man's deep clone
     generateNewUUID = () => "newId", // dummy function for demo
     clickstreamVariablesNewUser = {
       callingPage: "http://www.example.com/foo",
       referringPage: "https://www.google.com"
     },
     clickstreamVariablesExistingUserNewSession = {
       userId: "UUIDv4_value",
       sessionCount: 1,
       callingPage: "http://www.example.com/foo"
     },
     clickstreamVariablesSubsequentPageView = {
       userId: "UUIDv4_value",
       sessionId: "UUIDv4_value",
       sessionCount: 1,
       sequence: 1,
       userAgent: "Mozilla/5.0 (X11; CrOS x86_64 7834.66.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.111 Safari/537.36",
       callingPage: "http://www.example.com/foo",
       referringPage: "http://www.example.com/bar"
     },
     sendUpodatedValuesBackToBrowser = (val) => val, // dummy function
     logAndUpdateValues = function logAndUpdateValues(clickstreamVariables) {
       
       // old way
       // clickstreamVariables.sessionId = (clickstreamVariables.sessionId === undefined) ? generateNewUUID() : clickstreamVariables.sessionId;
       // clickstreamVariables.userId = (clickstreamVariables.userId === undefined) ? generateNewUUID() : clickstreamVariables.userId;
       // clickstreamVariables.sessionCount = (clickstreamVariables.sessionCount === undefined) ? 0 : clickstreamVariables.sessionCount;
       // clickstreamVariables.sessionId = (clickstreamVariables.sequence === undefined) ? 0 : clickstreamVariables.sequence;
       // clickstreamVariables.userAgent = (clickstreamVariables.userAgent === undefined) ? "unknown" : clickstreamVariables.userAgent;
       // clickstreamVariables.callingPage = (clickstreamVariables.callingPage === undefined) ? "" : clickstreamVariables.callingPage;
       // clickstreamVariables.referringPage = (clickstreamVariables.referringPage === undefined) ? "" : clickstreamVariables.referringPage;
       // clickstreamVariables.time = (new Date()).toISOString();
       //
       // *** yuk! ***
       
       //  new way
       //  more readable
       //  note BLOCK scope and LET are in use
 
       {
         let {
           sessionId = generateNewUUID(),
           userId = generateNewUUID(),
           sessionCount = 0,
           sequence = 0,
           userAgent = "unknown",
           callingPage = "",
           referringPage = "",
           time = (new Date()).toISOString()
         } = clickstreamVariables;
         
         sessionCount += (sequence === 0) ? 1 : 0;
         sequence += 1;
         clickstreamVariables = { userId, sessionId, sessionCount, sequence, userAgent, callingPage, referringPage, time };
 
         log(JSON.stringify(clickstreamVariables));
         sendUpodatedValuesBackToBrowser(clickstreamVariables);
         return clickstreamVariables;
       }
     },
     test = {
       eq: (variableName, expected) => assess(newValues[variableName], expected),
      neq: (variableName, expected) => assess(newValues[variableName], expected, false)
     },
     runDemo = () => {
 
       oldValues = clone(clickstreamVariablesNewUser);
       newValues = logAndUpdateValues(clickstreamVariablesNewUser);
       test.neq("userId", "");
       test.neq("sessionId", "");
       test.eq("sessionCount", 1);
       test.eq("sequence", 1);
       test.eq("userAgent", "unknown");
       test.eq("callingPage", oldValues.callingPage);
       test.eq("referringPage", oldValues.referringPage);
       
       oldValues = clone(clickstreamVariablesExistingUserNewSession);
       newValues = logAndUpdateValues(clickstreamVariablesExistingUserNewSession);
       test.eq("userId", oldValues.userId);
       test.neq("sessionId", oldValues.sessionId);
       test.eq("sessionCount", oldValues.sessionCount + 1);
       test.eq("sequence", 1);
       test.eq("userAgent", "unknown");
       test.eq("callingPage", oldValues.callingPage);
       test.eq("referringPage", "");
 
       oldValues = clone(clickstreamVariablesSubsequentPageView);
       newValues = logAndUpdateValues(clickstreamVariablesSubsequentPageView);
       test.eq("userId", oldValues.userId);
       test.eq("sessionId", oldValues.sessionId);
       test.eq("sessionCount", oldValues.sessionCount);
       test.eq("sequence", oldValues.sequence + 1);
       test.eq("userAgent", oldValues.userAgent);
       test.eq("callingPage", oldValues.callingPage);
       test.eq("referringPage", oldValues.referringPage);
 
     };
 
 runDemo();
})();


Conclusion

ECMAScript 6 (ES6) introduces features that make it much easier to write and maintain code that properly handles missing values.


Cautionary Note:

There are other new features in the code sample above.  If you have questions, please let me know. You Don’t Know JS: ES6 & Beyond by Kyle Simpson is a good reference as is MDN.

No comments: