How many ideas might you expect to find in customers' responses to open-ended survey question? Here's an interesting empirical analysis of text verbatim coding from a recent survey, looking at actual data compared to expected values under Zipf's Law and Heap's Law.

The survey question was "Why did you choose the product you selected?". Respondents provided free-text responses. 200 responses were coded in Protobi using the new verbatim coding widget by a professional analyst.

Four responses were blank and excluded from this analysis. Codes are sanitized for display.
Frequency distribution of individual codes
The first graph shows the frequency distribution. There appears to be a "long-tail" distribution where there are a few high-frequency codes, and many lower-frequency codes.
Frequency of individual codes vs Rank (Zipf's Law)

According to Zipf's Law we'd expect the frequency of each code to be inversely proportional to its rank:

z(r) = z_{max} \cdot r ^ {-\alpha}

Zipf's law can be derived from the power-law probability distribution, which describes many long-tail phenomena.

Here the blue line represents actual frequencies. The green line is the expected value per theory, with a slope \alpha = 1.0, and intercept equal to the frequency of the most common response. The grey line is the least squares estimate, with a slope \alpha = 1.001.

Simply put, a slope of -1 means that we'd expect the 2nd most common code to appear about 1/2 as often as the 1st, the 3rd most common code to appear about 1/3 as often, the 4th most common code to appear 1/4 as often, etc.

Per the graph below, Zipf's Law appears to match extremely well (except at the tails which is often the case in practice):

Unique codes encountered vs responses seen (Heap's Law)

A practical question is "How many respondents do we need to discover all the codes that there are to be found (above some minimum prevalence)?". Or conversely, with our planned sample size, how many ideas will likely be out there left undiscovered? This is a variation of the Coupon Collector's Problem [2] (or Baseball Card Collector's problem?).

The number of codes we'd expect to encounter at any given sample size we might expect to be described by Heap's Law:

N(t) \approx k \cdot t ^ \beta
W(n) \approx k \cdot ^ \beta

where N(t) is the number of distinct codes we would expect to find in t responses, and k and \beta are estimated empirically.

It turns out that Heaps's Law is a consequence of Zipf's Law above [1]. And in the special case where \alpha=1 in the Zipf distribution, then there is an exact formula for Heap's Law based on the Lambert W function [3]:

N(t) = \frac {t} {W(t)}

(Amazingly, this dataset yields \alpha=1.001 and Lambert W appeared in another completely unrelated analysis we recently did to calculate optimal price in a discrete choice model).

In the graph below, the blue line shows the actual number of unique codes encountered in the first N responses. The grey line represents the least-squares estimate, yielding an estimated exponent of \beta = 0.460, well in the range of 0.4 to 0.6 based on other analyses of English-language text. Again we see that the data matches Heap's Law very well.

So what...

We're actively exploring ways to make text open end responses a rich source of insight for market researchers. And make text analysis fun and easy. This is an early step, looking at the data with thought-leading clients.

A practical outcome may be simple diagnostic metrics that help identify if the data is undercoded (i.e. there may be ideas yet to be discerned) or overcoded (i.e. we might be making more distinctions than the sample size might support).

Research questions

Looking at the text responses and the process of coding them raises a number of interesting questions. At a high level:

  • When do "interesting" ideas appear? The most common answers are presumably kind of known. The really rare responses may not be relevant.
  • What makes a response "interesting" to an end client? Which ideas does the client consider to be the "pearls"?
  • Do end clients and product/marketing managers code differently than analysts? Do they make different distinctions?
Other questions are more technical:
  • Do most verbatim questions follow these curves? Do they fall in a close or wide range?
  • How often do responses include multiple ideas that match several codes?
  • Is it common for codes to coalesce and split as analysis proceeds?
  • Can the computer learn from the initial codes and provide good auto-guesses as coding proceeds?

If you have text survey data and are interested in mining it further, contact us at


  • [1] Lu¨ L, Zhang Z-K, Zhou T (2010) Zipf’s Law Leads to Heaps’ Law: Analyzing Their Relation in Finite-Size Systems. PLoS ONE 5(12): e14139. doi:10.1371/journal.pone.0014139
  • [2] Marco Ferrante, Monica Saltalamacchia (2014) The Coupon Collector’s Problem MATerials MATemàtics, Volum 2014, treball no. 2, 35 pp. ISSN: 1887-1097 Publicació electrònica de divulgació del Departament de Matemàtiques de la Universitat Autònoma de Barcelona
  • [3] R.M. Corless, G.H. Gonnet, D.E.G. Hare, D.M. Jeffrey and D. E. Knuth, "On Lambert's W Function", Technical Report CS-93-03, University of Waterloo, January 1993.