Hofstadter’s Law: It always takes longer than you expect, even when you take into account Hofstadter’s Law.

— Douglas Hofstadter, Gödel, Escher, Bach

Writing prose and writing code have a lot in common, but perhaps the biggest similarity is that neither writers nor programmers can get things done on time. Writers are notorious for missing deadlines. Programmers are notorious for being wildly off with estimates. The question is, why?

Today, I had an idea for how to answer this question. What I found was eye opening.

Looking into my books

I wrote both of my books, Hello, Startup and Terraform: Up & Running, using Atlas, which manages all the content in Git. That means that every line of text, every edit, and every change was captured in the Git commit log.

So what does it really take to write two books?

Hello, Startup

Let’s start with my first book, Hello, Startup, which is 602 pages long and contains roughly 190,000 words. I ran cloc in the Hello, Startup Git repo and got the following output (truncated for readability):

-------------------------------------------------------------
Language                     files          blank        code
-------------------------------------------------------------
AsciiDoc                        18           3392       20495
HTML                            20             35        3149
Java                           103            654        2308
Ruby                             8             28         110
JSON                             1              0          79
Python                           1             34          67
CSS                              4             25          58
[...]
-------------------------------------------------------------
SUM:                           182           4241       26571
-------------------------------------------------------------

So the 602 pages comes from 26,571 lines of text. The vast majority of those lines are in AsciiDoc, which is the Markdown-like language used in Atlas to write almost all the content. The rest consists of HTML and CSS, which are used in Atlas to define the layout and structure of the book, plus a whole bunch of other programming languages (Java, Ruby, Python, etc.) which are used in the many code examples throughout the book.

But the 602 pages and 26,571 lines we see are just the final result. They don’t capture the roughly 10 months of writing, rewriting, editing, proofreading, copyediting, researches, notes, and other work it took to get there. To get more insight, I used git-quick-stats to analyze the entire commit log for the book:

Herman van der Veer:
  insertions:    163756 (95%)
  deletions:     131425 (95%)
  files:         1693 (59%)
  commits:       544 (32%)

So, I added 163,756 lines and deleted 131,425 lines, for a total of 295,181 lines of code churn. That is, I wrote and deleted 295,181 lines of to produce a final output of 26,571 lines. That’s a ratio of more than 10:1! For every 1 line that got published, I actually had to write 10!

I admit that lines added and removed in Git are not a perfect measure of the editing process, but if anything, this data understates the work involved, as much of the editing process isn’t reflected in the Git commit log at all. For example, I wrote the first few chapters in Google Docs before switching to Atlas and I made many rounds of edits on my computer without commits in between.

The data is far from perfect, but I suspect the order of magnitude is correct: a 10:1 ratio of raw text to published text.

Terraform: Up & Running

Let’s see if the numbers are similar for my second book, Terraform: Up & Running, which is 206 pages and contains roughly 52,000 words. Here’s the (truncated) output from cloc:

---------------------------------------------------------------
Language                     files          blank          code
---------------------------------------------------------------
AsciiDoc                         8           1268          4283
HCL                             64            507          1730
Markdown                        40            583          1453
Go                               8            164           427
Bourne Shell                    20             57           142
[...]
---------------------------------------------------------------
SUM:                           170           2670          8410
---------------------------------------------------------------

Those 206 pages come from 8,410 lines of text. Again, the majority of the text is AsciiDoc, though you can see this book has even more code examples, primarily in HCL, which is the underlying language used in Terraform. There’s also lots of Markdown, which I used to document those HCL examples.

Let’s use git-quick-stats to check the edit history for this book:

Herman van der Veer:
  insertions:    32209 (85%)
  deletions:     22402 (89%)
  files:         1662 (70%)
  commits:       256 (28%)

Over roughly 5 months, I added 32,209 lines and deleted 22,402 lines, for a total of 54,611 lines of code churn. In this case, the editing process is even more understated, as Terraform: Up & Running started as a blog post series, which went through considerable churn before I moved it over to Atlas and Git. The blog post series is about half the length of the book, so it seems reasonable to increase the total churn by 50%. That gives us 54,611 * 1.5 = 81,916 lines of churn to produce 8,410 lines in the final result.

Again, we see a ratio of roughly 10:1! No wonder writers miss deadlines. We’re being held to a schedule for a 250 page book, but to write such a book, we actually have to write 2,500 pages.

What about programming?

So how does writing a book compare to writing code? I decided to take a look at a few open source Git repos of various levels of maturity, from a few months old, all the way up to 23 years old.

terraform-aws-couchbase (2018)

terraform-aws-couchbase is a set of modules open sourced in 2018 to deploy and manage Couchbase on AWS. Here’s the (truncated) cloc output:

---------------------------------------------------------------
Language                      files          blank         code
---------------------------------------------------------------
HCL                              30            662         2298
Bourne Again Shell                7            326         1622
Markdown                         21            664         1270
Bourne Shell                     13            260         1069
Go                                7            211          795
[...]
---------------------------------------------------------------
SUM:                             94           2177         7481
---------------------------------------------------------------

And here are the git-quick-stats totals:

total:
	  insertions:    24417 (100%)
	  deletions:     13276 (100%)
	  files:         926 (100%)
	  commits:       333 (100%)

That’s 37,693 lines of code churn to produce 7,481 final lines of code, or a 5:1 ratio. Even on a repo that’s less than 5 months old, we’ve already rewritten every line 5 times! No wonder software estimation is hard. We come up with a time estimate for ~7,500 lines of code, not realizing that we’ll have to write ~35,000 lines of code to get there!

Let’s see what happens when we look at a slightly older codebase.

Terratest (2016)

Terratest is an open source library that was created in 2016 for testing infrastructure code. Here’s the (truncated) cloc output:

-------------------------------------------------------------
Language                     files          blank        code
-------------------------------------------------------------
Go                              78           1204        4466
Markdown                        16            446         949
HCL                             20            159         451
[...]
-------------------------------------------------------------
SUM:                           136           1857        6140
-------------------------------------------------------------

And here are the git-quick-stats totals:

total:
  insertions:    29247 (100%)
  deletions:     19879 (100%)
  files:         1254 (100%)
  commits:       394 (100%)

That’s 49,126 lines of code churn to produce 6,140 final lines of code, or an 8:1 ratio for this ~2 year old repo. But Terratest is still fairly young, so let’s go back in time a bit more.

Terraform (2014)

Terraform is an open source library released in 2014 for managing infrastructure as code. Here’s the (truncated) cloc output:

--------------------------------------------------------------
Language                      files          blank        code
--------------------------------------------------------------
Go                             3858         232296     1326665
Markdown                        218           6783       23160
JSON                             24             15        7685
HCL                             794            917        4885
Assembly                         33            281        2459
Protocol Buffers                 18            583        1827
Bourne Shell                     36            328        1717
Perl                             16            271        1554
[...]
--------------------------------------------------------------
SUM:                           5045         241890     1371718
--------------------------------------------------------------

And here are the git-quick-stats totals:

total:
  insertions:    7568830 (100%)
  deletions:     5377136 (100%)
  files:         82960 (100%)
  commits:       17125 (100%)

That’s 12,945,966 lines of code churn to produce 1,371,718 final lines of code, or an 9:1 ratio. Terraform is about 4 years old now, but it hasn’t hit a 1.0 release yet, so even with a 9:1 ratio, it’s not a fully “mature” codebase. Let’s go back a few more years.

Express.js (2010)

Express is a popular open source JavaScript web framework that was released in 2010. Here’s the (truncated) cloc output:

-------------------------------------------------------------
Language                      files          blank       code
-------------------------------------------------------------
JavaScript                      147           3282      11673
Markdown                          8            734       3314
[...]
-------------------------------------------------------------
SUM:                            174           4033      15325
-------------------------------------------------------------

And here are the git-quick-stats totals:

total:
  insertions:    123129 (100%)
  deletions:     101082 (100%)
  files:         10696 (100%)
  commits:       5024 (100%)

That’s 224,211 lines of code churn to produce 15,325 final lines of code, or a 14:1 ratio. At the time of writing, Express is ~8 years old, on version 4.x, and the most popular and battle-tested web framework for Node.js. It seems that once we’re north of a 10:1 ratio, we can confidently say that a codebase is “mature.” Let’s see what happens if we go back even further in time.

jQuery (2006)

jQuery is a popular open source JavaScript library that came out in 2006. Here’s the (truncated) cloc output:

---------------------------------------------------------------
Language                     files          blank          code
---------------------------------------------------------------
JavaScript                     177          11020         44390
HTML                            50            116          1958
[...]
---------------------------------------------------------------
SUM:                           250          11500         47559
---------------------------------------------------------------

And here are the git-quick-stats totals:

total:
  insertions:    398332 (100%)
  deletions:     331814 (100%)
  files:         12143 (100%)
  commits:       6090 (100%)

That’s 730,146 lines of code churn to produce 47,559 final lines of code, or a 15:1 ratio for this ~12 year old repo. Let’s go back another 10 years and see what we find.

MySQL (1995)

MySQL is a popular open source relational database that came out in 1995. Here’s the (truncated) cloc output:

-------------------------------------------------------
Language              files          blank         code
-------------------------------------------------------
C++                    3189         336798      1859803
C/C++ Header           3394         134989       812273
JavaScript             3668          39539       439770
JSON                     89             17       103555
Pascal                  187          13973        92623
CSS                     558           2420        92385
C                       215          15307        90811
Java                    526          12790        54568
Perl                    125           7522        23273
CMake                   264           2965        17860
SQL                      41           1225        17667
[...]
-------------------------------------------------------
SUM:                  12835         577395      3662869
-------------------------------------------------------

And here are the git-quick-stats totals:

total:
  insertions:    35992625 (100%)
  deletions:     22570374 (100%)
  files:         570117 (100%)
  commits:       79845 (100%)

That’s 58,562,999 lines of code churn to produce 3,662,869 final lines of code, or a 16:1 ratio for this ~23 year old repo. Wow! Roughly speaking, every single line of MySQL has been rewritten 16 times.

Conclusion

Here’s a summary of the data we’ve seen for books:

Name Churn Lines Ratio
Hello, Startup 295,181 26,571 11:1
Terraform: Up & Running 81,916 8,410 10:1

And here’s a summary of what we’ve seen for programming:

Name Released Churn Lines Ratio
terraform-aws-couchbase 2018 37,693 7,481 5:1
Terratest 2016 49,126 6,140 8:1
Terraform 2014 12,945,966 1,371,718 9:1
Express 2010 224,211 15,325 14:1
jQuery 2006 730,146 47,559 15:1
MySQL 1995 58,562,999 3,662,869 16:1

So what do all these numbers mean?

The 10:1 rule of writing and programming

Give that my data set is limited, I can only draw a few preliminary conclusions:

  1. The ratio of “raw materials” to “finished product” in a book is roughly 10:1. Keep this in mind the next time an editor asks you for a timeline! If you want to write a 300 page book, you’ll probably have to write around 3,000 pages.

  2. Similarly, the ratio of “code churn” to “lines of code” in mature and non-trivial software is also at least 10:1. Keep this in mind the next time a manager or customer asks you for a time estimate! To build a 10,000 line app, expect to write roughly 100,000 lines.

These can be summarized as what I shall dub the 10:1 rule of writing and programming:

Good software and good writing requires that every line has been rewritten, on average, at least 10 times.

Next steps

Of course, lines of code and lines of writing are an imperfect measure, but I think given enough data, we may be able to determine if the 10:1 rule actually holds up, and if it’s useful to help improve estimation.

Some questions I’d love to know the answer to:

  • Could we use the ratio of code churn to lines of code as a quick measure of the maturity of a piece of software? For example, for critical pieces of infrastructure, such as databases, programming languages, and operating systems, should we only trust code that has at least a 10:1 ratio?
  • Does the amount of churn depend on the type of software? For example, Bill Scott found that at Netflix, only about 10% of the UI code lasted more than a year, and the other 90% had to be thrown away. What are the rates of churn in backend code, databases, CLI tools, and so on?
  • What percentage of code churn comes after the initial release? That is, what percentage of work can be classified as “software maintenance?”

If you have written a book and can do a similar analysis, I’d love to hear what numbers you found! And if anyone has time to automate this analysis, I’d love to see what the ratios are across a variety of open source projects. Share your thoughts in the comments!

Update, 08/13/18

This blog hit the front page of Hacker News and Reddit’s r/programming!

Two quick notes from those discussions:

First, it looks like similar 10:1 rules show up in film, journalism, music, and photography! How cool is that?

Second, a common response is that even a single character change may show up in Git as an “inserted line” or “deleted line”, so when you see 100,000 lines were changed, it doesn’t mean that all the text in those lines was rewritten. This is true, but as I wrote above, there are also many types of changes missing from the data:

  1. I don’t do a commit for every single line that I change. In fact, I may change a line 10 times, and commit only once.
  2. This is actually even more pronounced for code. While doing a code-test cycle, I may change a few lines of code 50 times over, but only do one commit.
  3. For my books, a lot of edit rounds and writing happened outside of Git (e.g., I wrote some of the chapters in Google Docs or Medium and O’Reilly does copyediting in a PDF).

My guess is that these two factors roughly cancel out. It won’t be exact, of course, and the actual ratio may be 8:1 or 12:1, but the order of magnitude is probably correct, and 10:1 is easier to remember.

Update, 08/14/18

GitHub user Decagon created a repo called hofs-churn that contains a Bash script to easily calculate code churn for your Git repos. He also ran it across a variety of open source repos such as React.js, Vue, Angular, RxJava, and many others, and the results are pretty interesting!