Predicting 2016 Presidential Election Results via Clustering

Goal

The goal of this article is to

Predict how undecided states will vote in the 2016 Presidential Election

using

U.S. Presidential Election results from 1980 - 2012

data indicating which states that are likely to vote for either Mrs. Clinton or Mr. Trump in 2016

The states that are omitted are the undecided states for which predictions are made.

Business Applications

Suppose

candidates = products

states = consumers

Then the following things can be inferred using the method shown in this article:

Consumer Characteristics:

Birds of a feather flock together.

If data from a new consumer is collected,

the new consumer probably shares qualitative/meta traits with the group to which (s)he appears to belong.

Different marketing messages might make sense based on the inferred demographic and psychographic traits.

Lead scoring:

It is likely that new consumers will behave in the same way in the sales process as other members of their group have performed.

If the number of prospects exceeds the capacity of sales professionals, prioritizing the prospects in clusters that have historically performed better should improve business results.

Preference prediction/Recommender systems:

It is likely that if a consumer “likes” a product, (s)he will probably also like another product in the product’s cluster. (“Like” can mean a product view, shopping cart action, purchase, etc.)

A “similar products” section (with more domain-specific wording) on the product page(s) on your website may cause users to look at more products (increasing time on site, pages/session and (hopefully) leads and sales).

Marketing emails recommending these products may also be useful.

Clustering Overview

The basic idea of clustering is to group together observations that are similar to each other.

Visually, our brains understand clustering. In the image above, there appear to be four groupings in the data.

When there are more than two variables being observed, an algorithmic approach is useful.

Election Data

The raw election data used in this article is in the form:

state candidate year votes party
1    AK     Obama 2012     0     D
2    AL     Obama 2012     0     D
3    AR     Obama 2012     0     D
4    AZ     Obama 2012     0     D
5    CA     Obama 2012    55     D
6    CO     Obama 2012     9     D

The data is transformed using functions found in the packages:

plyr

dplyr

tidyr

To make the code more readable, the pipe operators from the package magrittr are used.

Below is an example of data manipulation.

candidateParty <- results %>%
select(year, candidate, party) %>%
distinct() %>%
arrange(year, candidate, party)

There is a row for each candidate who received electoral college votes for each state in each election year.

Vote Totals

year

candidate

votes

year

candidate

votes

1980

Carter

49

1996

Dole

159

1980

Reagan

490

2000

Bush

271

1984

Mondale

13

2000

Gore

266

1984

Reagan

524

2004

Bush

286

1988

Bentsen

1

2004

Kerry

251

1988

Bush

426

2008

McCain

173

1988

Dukakis

111

2008

Obama

365

1992

Bush

168

2012

Obama

332

1992

Clinton

370

2012

Romney

206

1996

Clinton

379

NA

NA

NA

Summary Results

year

candidate

party

1980

Reagan

R

1984

Reagan

R

1988

Bush

R

1992

Clinton

D

1996

Clinton

D

2000

Bush

R

2004

Bush

R

2008

Obama

D

2012

Obama

D

State

Win Rate

State

Win Rate

State

Win Rate

AK

0.556

KY

0.778

NY

0.667

AL

0.556

LA

0.778

OH

1.000

AR

0.778

MA

0.667

OK

0.556

AZ

0.667

MD

0.667

OR

0.667

CA

0.778

ME

0.778

PA

0.778

CO

0.889

MI

0.778

RI

0.556

CT

0.778

MN

0.444

SC

0.556

DC

0.444

MO

0.778

SD

0.556

DE

0.778

MS

0.556

TN

0.778

FL

0.889

MT

0.667

TX

0.556

GA

0.556

NC

0.667

UT

0.556

HI

0.556

ND

0.556

VA

0.778

IA

0.778

NE

0.600

VT

0.778

ID

0.556

NH

0.889

WA

0.667

IL

0.778

NJ

0.778

WI

0.667

IN

0.667

NM

0.889

WV

0.556

KS

0.556

NV

1.000

WY

0.556

State

Dem

Rep

State

Dem

Rep

State

Dem

Rep

AK

0.00

1.00

KY

0.22

0.78

NY

0.78

0.22

AL

0.00

1.00

LA

0.22

0.78

OH

0.44

0.56

AR

0.22

0.78

MA

0.78

0.22

OK

0.00

1.00

AZ

0.11

0.89

MD

0.78

0.22

OR

0.78

0.22

CA

0.67

0.33

ME

0.67

0.33

PA

0.67

0.33

CO

0.33

0.67

MI

0.67

0.33

RI

0.89

0.11

CT

0.67

0.33

MN

1.00

0.00

SC

0.00

1.00

DC

1.00

0.00

MO

0.22

0.78

SD

0.00

1.00

DE

0.67

0.33

MS

0.00

1.00

TN

0.22

0.78

FL

0.33

0.67

MT

0.11

0.89

TX

0.00

1.00

GA

0.22

0.78

NC

0.11

0.89

UT

0.00

1.00

HI

0.89

0.11

ND

0.00

1.00

VA

0.22

0.78

IA

0.67

0.33

NE

0.10

0.90

VT

0.67

0.33

ID

0.00

1.00

NH

0.56

0.44

WA

0.78

0.22

IL

0.67

0.33

NJ

0.67

0.33

WI

0.78

0.22

IN

0.11

0.89

NM

0.56

0.44

WV

0.44

0.56

KS

0.00

1.00

NV

0.44

0.56

WY

0.00

1.00

State Clusters

One question to figure out is how many clusters to use.

If one wants a numeric way to estimate the number of clusters to use the nScree function from the nFactors package, as follows:

states[,-1] %>%
nFactors::nScree() %>%
nFactors::plotnScree()

The suggested number of clusters is a good minimum to use.

Since it is important that cluster assignments have business/analytical value, it is generally useful to increase the number of clusters (provided they are stable) until useful distinctions based on other data stand out.

kmeans is an automated approach to clustering that works as follows:

k random centers are selected within the bounds of the data (you pick the value of k)

the distance between each data point and each random center is calculated and the data point is assigned to the cluster corresponding to the nearest center

the centers are then recalculated as the mean (a.k.a. average) of the points in the cluster

Steps #2 and #3 are repeated until the centers stop moving (or until the maximum number of iterations allowed is reached (this prevents infinite loops))

It should be noted that different runs of a kmeans algorithm can produce different clusters based on which random centers are used at the beginning of the algorithm. This is an effect of the algorithm and not the data.

In the snippet below, code is shown to calculate 6 clusters from the state data. The first column, which contains the state name, is excluded and only the 0/1 preference columns are used.

The code also shows an easy way to plot clusters, via clusplot from the cluster package [note: clusplot needs more (non-parallel) rows than columns].

set.seed(8675309)
nCenters <- 6
states.km <- states[,-1] %>%
kmeans(centers = nCenters,
        nstart = 1e2)
clstr <- states.km$cluster

library(cluster)

clusplot2 <- Curry(clusplot,
                  color=T,
                  shade=T,
                  lines=0,
                  labels=2)

clusplot2(states[,-1],
         clstr,
         main = "States")

One of the key clustering challenges is determining if the cluster is “real”.

The centers initially chosen by the kmeans algorithm can impact the result.

One way to see if the clusters are “real” is to see if they hold up when variations of the data are used via a sampling process that allows replacement (this is called bootstrapping).

The clusterboot function in the fpc package performs bootstrap resampling and repeatedly runs the clustering algorithm indicated.

library(fpc)
nRuns = 100

clusterboot2 <- Curry(clusterboot,
                     clustermethod = kmeansCBI,
                     runs = nRuns,
                     iter.max = 1000,
                     seed = 8675309,
                     count = F)

suppressWarnings({
states.cboot <- clusterboot2(states[,-1],
                              krange = nCenters)
})

clstr <- states.cboot$partition
clusplot2(states[,-1],
         clstr,
         main = "States")

stabilities <- states.cboot$bootmean %>%
cut(breaks=c(0,
              0.6,
              0.75,
              0.85,
              1),
     labels=c("unstable",
              "measuring something",
              "stable",
              "highly stable"))

state

demRate

repRate

winRate

CA

0.67

0.33

0.78

CT

0.67

0.33

0.78

DE

0.67

0.33

0.78

IL

0.67

0.33

0.78

MD

0.78

0.22

0.67

ME

0.67

0.33

0.78

MI

0.67

0.33

0.78

NH

0.56

0.44

0.89

NJ

0.67

0.33

0.78

NM

0.56

0.44

0.89

PA

0.67

0.33

0.78

VT

0.67

0.33

0.78

Dem

Rep

Win

0.66

0.34

0.79

state

demRate

repRate

winRate

DC

1.00

0.00

0.44

HI

0.89

0.11

0.56

MN

1.00

0.00

0.44

RI

0.89

0.11

0.56

Dem

Rep

Win

0.94

0.06

0.5

state

demRate

repRate

winRate

IA

0.67

0.33

0.78

MA

0.78

0.22

0.67

NY

0.78

0.22

0.67

OR

0.78

0.22

0.67

WA

0.78

0.22

0.67

WI

0.78

0.22

0.67

Dem

Rep

Win

0.76

0.24

0.69

state

demRate

repRate

winRate

AK

0.00

1.00

0.56

AL

0.00

1.00

0.56

AZ

0.11

0.89

0.67

ID

0.00

1.00

0.56

IN

0.11

0.89

0.67

KS

0.00

1.00

0.56

MS

0.00

1.00

0.56

NC

0.11

0.89

0.67

ND

0.00

1.00

0.56

NE

0.10

0.90

0.60

OK

0.00

1.00

0.56

SC

0.00

1.00

0.56

SD

0.00

1.00

0.56

TX

0.00

1.00

0.56

UT

0.00

1.00

0.56

WY

0.00

1.00

0.56

Dem

Rep

Win

0.03

0.97

0.58

state

demRate

repRate

winRate

AR

0.22

0.78

0.78

GA

0.22

0.78

0.56

KY

0.22

0.78

0.78

LA

0.22

0.78

0.78

MO

0.22

0.78

0.78

MT

0.11

0.89

0.67

TN

0.22

0.78

0.78

WV

0.44

0.56

0.56

Dem

Rep

Win

0.24

0.76

0.71

state

demRate

repRate

winRate

CO

0.33

0.67

0.89

FL

0.33

0.67

0.89

NV

0.44

0.56

1.00

OH

0.44

0.56

1.00

VA

0.22

0.78

0.78

Dem

Rep

Win

0.36

0.64

0.91

Candidate Clusters

Below is the code to create candidate clusters. The first three columns contain non-preference data and are not used in the cluster computation.

Because there are fewer rows than columns, a cluster plot is not possible without some manipulation.

candidates <- results %>%
spread(state, votes)

candidates[,-c(1:3)] <- ifelse(candidates[,-c(1:3)]>0,1,0)

nCenters <- 8
candidates %<>%
as.data.frame()
rownames(candidates) <-
paste(candidates$candidate, candidates$year)
suppressWarnings({
candidates.cb <- clusterboot2(
   candidates[,-c(1:3)],
   krange = nCenters)
})

# hack to print a candidates graph
# clusplot(candidates[,-c(1:3)], candidates.cb$partition, ...) would fail
# because there are more columns (variables) than rows (observations)
# SVD is a technique used to factor the data into three matrices
# it discovers "latent" variables and concentrates the most significant to be first
# it is useful for reducing the number of dimensions to improve computational time
# the u matrix represents the rows rotated (candidates)
# the d matrix is a diagonal that represents scaling factors (i.e. strength)
# the v matrix represents the columns rotated (states)
# the hack here allows us to have two coordinates to plot
u <- candidates[,-c(1:3)] %>%
svd() %$%
u
rownames(u) <- paste(candidates$candidate, candidates$year)
clusplot2(u,
         candidates.cb$partition,
         main = "Candidates")

Candidate

Year

Party

Outcome

Gore

2000

D

lose

Kerry

2004

D

lose

Candidate

Year

Party

Outcome

Bush

2000

R

win

Bush

2004

R

win

Candidate

Year

Party

Outcome

Bush

1992

R

lose

Dole

1996

R

lose

Candidate

Year

Party

Outcome

Obama

2008

D

win

Obama

2012

D

win

Candidate

Year

Party

Outcome

Bush

1988

R

win

Reagan

1980

R

win

Reagan

1984

R

win

Candidate

Year

Party

Outcome

Bentsen

1988

D

lose

Carter

1980

D

lose

Dukakis

1988

D

lose

Mondale

1984

D

lose

Candidate

Year

Party

Outcome

Clinton

1992

D

win

Clinton

1996

D

win

Candidate

Year

Party

Outcome

McCain

2008

R

lose

Romney

2012

R

lose

Predictions

To assign a new candidate to a cluster,

calculate the distance between each candidate and each cluster center

assign the candidate to the cluster with the minimum distance

calcCandidateMostLike <- function (candData) {
dists <- matrix(NA, nrow = 1, ncol = nCenters)
for (i in 1:nCenters) {
   ri <- candidates.cb$result$result$centers[i,]
   dri <- rbind(ri, candData) %>%
     dist()
   dists[i] <- dri
}

mostLike <- dists %>% which.min()
candidates.cb$result %$%
   data.frame(candidate = names(partition),
              partition = partition) %>%
   filter(partition == mostLike) %>%
   select(candidate)
}

Data from http://projects.fivethirtyeight.com/2016-election-forecast/, gathered a couple weeks ago, is used to create candidate vectors for the 2016 Presidential Election.

For this analysis, for states in the FiveThirtyEight results that prefer a candidate at 60% or more, it will be assumed that the candidate will win the state. States with a smaller preference will be left as undecided and predictions will be made about how they will vote.

# 60% cut off
clinton2016 <- c("Clinton", 2016, "D",
                0, 0, 0, 0, 1, # AK, AL, AR, AZ, CA,
                1, 1, 1, 1, 0, # CO, CT, DC, DE, FL,
                0, 1, 0, 0, 1, # GA, HI, IA, ID, IL,
                0, 0, 0, 0, 1, # IN, KS, KY, LA, MA,
                1, 1, 1, 1, 0, # MD, ME, MI, MN, MO,
                0, 0, 0, 0, 0, # MS, MT, NC, ND, NE,
                1, 1, 1, 1, 1, # NH, NJ, NM, NV, NY,
                0, 0, 1, 1, 1, # OH, OK, OR, PA, RI,
                0, 0, 0, 0, 0, # SC, SD, TN, TX, UT,
                1, 1, 1, 1, 0, # VA, VT, WA, WI, WV,
                0) # WY

trump2016 <- c("Trump", 2016, "R",
              1, 1, 1, 1, 0, # AK, AL, AR, AZ, CA,
              0, 0, 0, 0, 0, # CO, CT, DC, DE, FL,
              1, 0, 0, 1, 0, # GA, HI, IA, ID, IL,
              1, 1, 1, 1, 0, # IN, KS, KY, LA, MA,
              0, 0, 0, 0, 1, # MD, ME, MI, MN, MO,
              1, 1, 0, 1, 1, # MS, MT, NC, ND, NE,
              0, 0, 0, 0, 0, # NH, NJ, NM, NV, NY,
              0, 1, 0, 0, 0, # OH, OK, OR, PA, RI,
              1, 1, 1, 1, 1, # SC, SD, TN, TX, UT,
              0, 0, 0, 0, 1, # VA, VT, WA, WI, WV,
              1) # WY

Ignoring the holdout/undecided states, the current candidates are most similar to the following candidates.

candidate

Gore 2000

Kerry 2004

candidate

McCain 2008

Romney 2012

Using the decided states, we observe the following results for the clusters:

election2016 <- data.frame(state = states$state,
                          stringsAsFactors = F)

election2016$Clinton_2016 <-
clinton2016[-c(1:3)] %>%
as.numeric()

election2016$Trump_2016 <-
trump2016[-c(1:3)] %>%
as.numeric()

election2016 %<>%
join(states.cboot$result$partition %>%
        adply(1,
              .id = c("state")),
      by = "state") %>%
select(state, Clinton_2016, Trump_2016,
        cluster = 4)

cluster2016 <- election2016 %>%
ddply("cluster", summarise,
       Clinton = sum(Clinton_2016),
       Trump = sum(Trump_2016))

cluster2016 %>%
kable(caption="Cluster preferences")

cluster

Clinton

Trump

1

12

0

2

4

0

3

5

0

4

0

15

5

0

8

6

3

0

To make predictions about the undecided states, assume that undecided states will follow the preference of their historic cluster.

undecidedStates <- election2016 %>%
ddply("state", summarise,
votes = Clinton_2016 + Trump_2016) %>%
filter(votes == 0)

undecidedStates %<>%
join(election2016, "state") %>%
select(state, cluster) %>%
join(cluster2016, "cluster")

undecidedStates %>%
select(state, cluster) %>%
kable(caption="Undecided States")

state

cluster

FL

6

IA

3

NC

4

OH

6

The votes from the undecided states to the cluster-favored candidate are assigned as follows:

# assign undecided states based on previous cluster
election2016[election2016$state %in%
              undecidedStates[undecidedStates$Clinton > 0,
                              "state"],]$Clinton_2016 <- 1
election2016[election2016$state %in%
              undecidedStates[undecidedStates$Trump > 0,
                              "state"],]$Trump_2016 <- 1

undecidedStates %>%
ddply(.(state), mutate,
       VotingFor = ifelse(Clinton > 0,
                          "Clinton",
                          "Trump")) %>%
select(state, VotingFor) %>%
arrange(state) %>% kable(caption = "Undecided State Predictions",
                          col.names = c("State","Voting For"))

State

Voting For

FL

Clinton

IA

Clinton

NC

Trump

OH

Clinton

After the assignment of the undecided states, we see that the current candidates are most like the following.

candidate

Obama 2008

Obama 2012

candidate

McCain 2008

Romney 2012

Thus, we would conclude that Mrs. Clinton will win.

05 December 2016

How to Collect and Log Web Clickstream Data

Clickstream Data

Client-side collection

Server-side ASP.NET MVC Listener

Full code with unit tests

30 August 2016

Predicting 2016 Presidential Election Results via Clustering

Goal

Business Applications

Clustering Overview

The basic idea of clustering is to group together observations that are similar to each other. Visually, our brains understand clustering. In the image above, there appear to be four groupings in the data. When there are more than two variables being observed, an algorithmic approach is useful.

Election Data

Vote Totals

Summary Results

year candidate party 1980 Reagan R 1984 Reagan R 1988 Bush R 1992 Clinton D 1996 Clinton D 2000 Bush R 2004 Bush R 2008 Obama D 2012 Obama D

State Clusters

state demRate repRate winRate CA 0.67 0.33 0.78 CT 0.67 0.33 0.78 DE 0.67 0.33 0.78 IL 0.67 0.33 0.78 MD 0.78 0.22 0.67 ME 0.67 0.33 0.78 MI 0.67 0.33 0.78 NH 0.56 0.44 0.89 NJ 0.67 0.33 0.78 NM 0.56 0.44 0.89 PA 0.67 0.33 0.78 VT 0.67 0.33 0.78

Dem Rep Win 0.66 0.34 0.79

state demRate repRate winRate DC 1.00 0.00 0.44 HI 0.89 0.11 0.56 MN 1.00 0.00 0.44 RI 0.89 0.11 0.56

Dem Rep Win 0.94 0.06 0.5

state demRate repRate winRate IA 0.67 0.33 0.78 MA 0.78 0.22 0.67 NY 0.78 0.22 0.67 OR 0.78 0.22 0.67 WA 0.78 0.22 0.67 WI 0.78 0.22 0.67

Dem Rep Win 0.76 0.24 0.69

Dem Rep Win 0.03 0.97 0.58

state demRate repRate winRate AR 0.22 0.78 0.78 GA 0.22 0.78 0.56 KY 0.22 0.78 0.78 LA 0.22 0.78 0.78 MO 0.22 0.78 0.78 MT 0.11 0.89 0.67 TN 0.22 0.78 0.78 WV 0.44 0.56 0.56

Dem Rep Win 0.24 0.76 0.71

state demRate repRate winRate CO 0.33 0.67 0.89 FL 0.33 0.67 0.89 NV 0.44 0.56 1.00 OH 0.44 0.56 1.00 VA 0.22 0.78 0.78

Dem Rep Win 0.36 0.64 0.91

Candidate Clusters

Candidate Year Party Outcome Gore 2000 D lose Kerry 2004 D lose

Candidate Year Party Outcome Bush 2000 R win Bush 2004 R win

Candidate Year Party Outcome Bush 1992 R lose Dole 1996 R lose

Candidate Year Party Outcome Obama 2008 D win Obama 2012 D win

Candidate Year Party Outcome Bush 1988 R win Reagan 1980 R win Reagan 1984 R win

Candidate Year Party Outcome Bentsen 1988 D lose Carter 1980 D lose Dukakis 1988 D lose Mondale 1984 D lose

Candidate Year Party Outcome Clinton 1992 D win Clinton 1996 D win

Candidate Year Party Outcome McCain 2008 R lose Romney 2012 R lose

Predictions

candidate Gore 2000 Kerry 2004

The basic idea of clustering is to group together observations that are similar to each other.

Visually, our brains understand clustering. In the image above, there appear to be four groupings in the data.

When there are more than two variables being observed, an algorithmic approach is useful.

year

candidate

party

1980

Reagan

R

1984

Reagan

R

1988

Bush

R

1992

Clinton

D

1996

Clinton

D

2000

Bush

R

2004

Bush

R

2008

Obama

D

2012

Obama

D

state

demRate

repRate

winRate

CA

0.67

0.33

0.78

CT

0.67

0.33

0.78

DE

0.67

0.33

0.78

IL

0.67

0.33

0.78

MD

0.78

0.22

0.67

ME

0.67

0.33

0.78

MI

0.67

0.33

0.78

NH

0.56

0.44

0.89

NJ

0.67

0.33

0.78

NM

0.56

0.44

0.89

PA

0.67

0.33

0.78

VT

0.67

0.33

0.78

Dem

Rep

Win

0.66

0.34

0.79

state

demRate

repRate

winRate

DC

1.00

0.00

0.44

HI

0.89

0.11

0.56

MN

1.00

0.00

0.44

RI

0.89

0.11

0.56

Dem

Rep

Win

0.94

0.06

0.5

state

demRate

repRate

winRate

IA

0.67

0.33

0.78

MA

0.78

0.22

0.67

NY

0.78

0.22

0.67

OR

0.78

0.22

0.67

WA

0.78

0.22

0.67

WI

0.78

0.22

0.67

Dem

Rep

Win

0.76

0.24

0.69

Dem

Rep

Win

0.03

0.97

0.58

state

demRate

repRate

winRate

AR

0.22

0.78

0.78

GA

0.22

0.78

0.56

KY

0.22

0.78

0.78

LA

0.22

0.78

0.78

MO

0.22

0.78

0.78

MT

0.11

0.89

0.67

TN

0.22

0.78

0.78

WV

0.44

0.56

0.56

Dem

Rep

Win

0.24

0.76

0.71

state

demRate

repRate

winRate

CO

0.33

0.67

0.89

FL

0.33

0.67

0.89

NV

0.44

0.56

1.00

OH

0.44

0.56

1.00

VA

0.22

0.78

0.78

Dem

Rep

Win

0.36

0.64

0.91

Candidate

Year

Party

Outcome

Gore

2000

D

lose

Kerry

2004

D

lose

Candidate

Year

Party

Outcome

Bush

2000

R

win

Bush

2004

R

win

Candidate

Year

Party

Outcome

Bush

1992

R

lose

Dole

1996

R

lose

Candidate

Year

Party

Outcome

Obama

2008

D

win

Obama

2012

D

win

Candidate

Year

Party

Outcome

Bush

1988

R

win

Reagan

1980

R

win

Reagan

1984

R

win

Candidate

Year

Party

Outcome

Bentsen

1988

D

lose

Carter

1980

D

lose

Dukakis

1988

D

lose

Mondale

1984

D

lose

Candidate

Year

Party

Outcome

Clinton

1992

D

win

Clinton

1996

D

win

Candidate

Year

Party

Outcome

McCain

2008

R

lose

Romney

2012

R

lose

candidate

Gore 2000

Kerry 2004

State

Voting For

FL

Clinton

IA

Clinton

NC

Trump

OH

Clinton

After the assignment of the undecided states, we see that the current candidates are most like the following.

candidate

Obama 2008

Obama 2012

candidate

McCain 2008

Romney 2012

Thus, we would conclude that Mrs. Clinton will win.