Protobi

← All blog posts View post

<p><style>
 .blog img { padding-left: 40px; }
</style></p>
<p><script type="text/javascript" src="/javascripts/prism.js"></script></p>
<link rel="stylesheet" href="/stylesheets/prism.css"/>

<p>Here's how to to compute candidate market segmentations using <a href="http://www.r-project.org/">R</a> and profile them in <a href="http://protobi.com">Protobi</a>, using the most recent
<a href="http://protobi.com/post/stackoverflow-developer-survey">StackOverflow Developer Survey</a> as a case study.</p>
<p>This yields a simple segmentation of developers based on what tasks they spend their time on ( note
how little time in any country spent looking for a new job...)</p>
<p><img style="margin: 0 auto; width: 360px;" src="/images/blog/so-segmentation/so-clu3-overall.png" /></p>

<h1 id="overview">Overview</h1>
<p>There are many good ways to develop candidate segmentations, including K-means clustering, CART/CHAID, and even sheer business insight.
Different analysts can reasonably differ on the best
algorithm to use for a given task, or even take multiple approaches using Random Forests.</p>
<p>But whatever algorithm(s) used, almost every segmentation analysis will generate multiple alternative segmentations.
Additionally, it's common practice to give each segment a mnemonic name.</p>
<p>But how do you evaluate and choose among the candidates?  How do you get a qualitative sense of each segment, to choose a name?</p>
<p>We argue here that a good first step is to simply look at it. and show how using Protobi.</p>
<p>The goal here is not to show the <b>only</b> way to go about it, but to show one practical workflow.</p>
<h1 id="stackoverflow-survey">StackOverflow Survey</h1>
<p>StackOverflow conducted a survey of its members.  Click here to <a href="https://app.protobi.com/v3/datasets/53e0ec7e04e4be020000000b#filter/">explore the data in Protobi</a>.</p>
<p>One of the questions was "In an average week, how do you spend your time at work?"
for an array of tasks, such as "New feature development", "Meetings", or "Looking for a new job".</p>
<p>Respondents selected a single choice of "None", "1-2 hours", "2-5 hours", "5-10 hours", "10 to 20 hours", or "20+ hours".
The above graph shows percent of respondents who selected "10 to 20 hours" or "20+ hours" for each task.</p>
<p>This example develops a segmentation based on responses to this section.</p>
<h1 id="latent-class-segmentation-in-r">Latent Class Segmentation in R</h1>
<p>Here we use Latent Class Analysis using the <a href="http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf">poLCA library in R</a>
to derive candidate segmentations,
and attach predicted segment membership back to the original datafile for evaluation.
(An analogous process can be done using LatentGold, QUICK CLUSTER in SPSS or FASTCLUS in SAS.)</p>
<p>The basic steps are:</p>
<ol>
<li>Import data</li>
<li>Recode basis variables</li>
<li>Segment respondents into various numbers of clusters</li>
<li>Marge back predicted segment membership</li>
<li>Export data</li>
</ol>
<p><b>Step 1:</b>  Load the relevant packages and read the input dataset.  You can find the data here in SAV format at
<a href="/data/examples/2013_StackOverflowRecoded.sav">2013_StackOverflowRecoded.sav</a> and
 <a href="/data/examples/2013_StackOverflowRecoded.csv">2013_StackOverflowRecoded.csv</a>.</p>
<p><b>Step 2:</b>  A quirk of <code>poLCA</code> is that the variables used as the segmentation basis must be coded as a sequence
of integers starting with 1 (i.e. <code>1, 2, 3, ...</code>).   Here the values are already coded as a sequence of integers,
but starting at 0, so we increment them by 1 using the <code>recode</code> method in the <code>car</code> library.</p>
<p><b>Step 3:</b>  Run cluster analyses to create solutions with 2-, 3-, 4-, 5- and 6-clusters, respectively.
The <code>poLCA</code> algorithm treats all basis variables as categorical, not ordinal or continuous.
There's an inherent ordinality in our coding, which it thus can't recognize whereas the more sophisticated algorithms in
 <a href="http://statisticalinnovations.com/products/latentgold.html">LatentGold</a> from Statistical Innovations can.</p>
<p>Note that we set <code>na.rm=TRUE</code> which means that respondents with missing values
will be included, and <code>NA</code> treated as its own category.</p>
<p><b>Step 4:</b>  Finally, we merge the predicted class memberships from each solution back to the main data frame</p>
<p><b>Step 5:</b> Export as a new CSV file.  That's the data we'll view in Protobi.</p>
<p>The complete R program is below:</p>
<pre><code class="language-r">
# Step 1:  Load packages,  libraries and data
install.packages("car");
install.packages("poLCA");
install.packages("scatterplot3d");  # required by poLCA
install.packages("MASS"); # required by poLCA
library(car); # for recoding
library(poLCA); # for segmentation

so <- read.csv("2013_StackOverflowRecoded.csv", header=TRUE, sep=",")

# Step 2: Recode basis variables to positive integers starting at one
so$rs14_1 <- recode(so$q14_1,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_2 <- recode(so$q14_2,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_3 <- recode(so$q14_3,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_4 <- recode(so$q14_4,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_5 <- recode(so$q14_5,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_6 <- recode(so$q14_6,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_7 <- recode(so$q14_7,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_8 <- recode(so$q14_8,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_9 <- recode(so$q14_9,"5=6;4=5;3=4;2=3;1=2;0=1;")

# Step 3: Compute segmentation using above columns as the segmentation basis
q14rs <- cbind(rs14_1, rs14_2, rs14_3, rs14_4, rs14_5, rs14_6, rs14_7, rs14_8, rs14_9) ~ 1

q14clu2 <- poLCA(q14rs, so, nclass=2, na.rm=FALSE); # BIC(2): 163650.3
q14clu3 <- poLCA(q14rs, so, nclass=3, na.rm=FALSE); # BIC(3): 161445.0
q14clu4 <- poLCA(q14rs, so, nclass=4, na.rm=FALSE); # BIC(4): 160172.9
q14clu5 <- poLCA(q14rs, so, nclass=5, na.rm=FALSE); # BIC(5): 159428.3
q14clu6 <- poLCA(q14rs, so, nclass=6, na.rm=FALSE); # BIC(6): 159209.8

# Step 4: merge estimated segment membership back to main data frame
so$q14_clu2 <- q14clu2$predclass
so$q14_clu3 <- q14clu3$predclass
so$q14_clu4 <- q14clu4$predclass
so$q14_clu5 <- q14clu5$predclass
so$q14_clu6 <- q14clu6$predclass

# Step 5: export augmented data as a new CSV
write.table(so, file="2013_StackOverflowRecoded_lca.csv", sep=",", col.names=TRUE,qmethod="double", na="", row.names=FALSE)
</code></pre>

<h1 id="visualize-segments-in-protobi">Visualize segments in Protobi</h1>
<p>We had first created a project based on the original dataset, and organized that view nicely.  So we updated the
project in-place with the new augmented dataset.  This allows us to keep the same
map but add/drop fields with possibly new records or field values.</p>
<p>There are several new fields, corresponding to each cluster solution, including the 3-cluster solution, <code>q14clu_3</code>.
At first it's unnamed, with just the values 1, 2 and 3.  We can get a sense of their character by drilling into each
value and looking for significant differences.</p>
<h3 id="candidate-segment-1">Candidate segment 1</h3>
<p>For instance, below we click into value <code>q14clu_3</code> = <code>1</code>:</p>
<p><img style="margin: 0 auto; width: 360px;" src="/images/blog/so-segmentation/so-clu3-1.png" />
<img style="margin: 0 auto; width: 360px;" src="/images/blog/so-segmentation/so-clu3-1-q14.png" /></p>
<p>Here the values for respondents in this segment are shown in blue.  The baseline distribution for all respondents is shown as a light grey shadow for comparison.</p>
<p>We can see that Segment <code>1</code> is significantly less likely (as indicated by the gray arrow icon)
to spend a lot of time on new features or refactoring,
and a lot more time on meetings, technical support, new skills  and everything else.  We might call these "All but dev".</p>
<h3 id="candidate-segment-2">Candidate segment 2</h3>
<p>Below is segment <code>2</code>.  These respondents are quite the opposite, focused almost exclusively on new features and code quality:</p>
<p> <img style="margin: 0 auto; width: 360px;" src="/images/blog/so-segmentation/so-clu3-2.png" />
 <img style="margin: 0 auto; width: 360px;" src="/images/blog/so-segmentation/so-clu3-2-q14.png" /></p>
<h3 id="candidate-segment-3">Candidate segment 3</h3>
<p>Finally is segment <code>3</code>.  These respondents are even more likely than Segment <code>2</code> to spend a lot of time on new features and quality,
yet even more likely than segment <code>1</code> to spend a lot of time in meetings, tech support and learning new skills.
We might call this segment "Dev and growth" (to be literal) or perhaps "Entrepreneur" (to apply a bit of descriptive license).</p>
<p> <img style="margin: 0 auto; width: 360px;" src="/images/blog/so-segmentation/so-clu3-3.png" />
 <img style="margin: 0 auto; width: 360px;" src="/images/blog/so-segmentation/so-clu3-3-q14.png" /></p>
<h1 id="profile-a-segmentation">Profile a segmentation</h1>
<p>So now we can name the segments:</p>
<p> <img style="margin: 0 auto; width: 360px;" src="/images/blog/so-segmentation/so-clu3-named.png" /></p>
<p>Clicking and contrasting is fun and informative for exploratory analysis.  But to present it to the client,
we might aim for a more concise crosstab (which we can copy to Excel and create a stylized custom chart):</p>
<p> <img style="margin: 0 auto; width: 440px;" src="/images/blog/so-segmentation/so-clu3-crosstab.png" /></p>
<h1 id="compare-alternative-segmentations">Compare alternative segmentations</h1>
<p>Wait ... what about the four-cluster solution?  How's that different?  Might that be better?  Let's take a look!</p>
<p>One thing we can do is crosstab two candidate solutions.  That's easy to do in Protobi by dragging the header of one to the header of the other.</p>
<p>For instance, we can compare the 4-cluster solution to the 3-cluster solution.  Here we can see that</p>
<ul>
<li>segment <code>4-1</code> corresponds to <code>3-3</code> ("Dev and growth")</li>
<li>segment <code>4-4</code> corresponds to <code>3-2</code> ("Primarily dev").</li>
<li>segments <code>4-2</code> and <code>4-3</code> split <code>3-1</code> ("All but dev.")</li>
</ul>
<p>   <img style="margin: 0 auto; width: 440px;" src="/images/blog/so-segmentation/so-clu4-vs-clu3.png" /></p>
<p>This post provides a brief tutorial on how to estimate candidate segmentations in an external software package,
and visualize the resulting segmentations in Protobi.</p>
<h1 id="summary">Summary</h1>
<p>Try Protobi with your next segmentation project, and let our expert analysts show you how.</p>

Date

Status

Slug edit

Thumbnail

Categories Manage