Tuesday, April 24, 2012

OpenStack Nova Internals – pt1 – Overview


For the last 18 months or so I've been working on the OpenStack Nova project as a core developer. Initially, the project was small enough that you could find your way around the code base pretty easily. Other than being PEP8 compliant there wasn't a whole lot of idioms you needed to follow to get your submissions accepted. But, as with any project, the deeper you get the more subtle the problems become. Consistently dealing with things like exception handling, concurrency, state management, synchronous vs. asynchronous operations and data partitioning becomes critical. As a core reviewer it gets harder and harder to remember all these rules and as a new contributor it's very intimidating trying to get your first submission accepted.

For that reason I thought I'd take a little detour from my normal blog articles and dive into some of the unwritten programming idioms behind the OpenStack project. But to start I'll quickly go over the OpenStack source code layout and basic architecture.

I'm going to assume you know about Cloud IaaS topics (image management, hypervisors, instance management, networking, etc.), Python (or you're a flexible enough programmer where the languages don't really matter) and Event-driven frameworks (aka Reactor Pattern).

Source Layout

Once you grab the Nova source you'll see it's pretty easy to understand the main code layout.
git clone https://github.com/openstack/nova.git
The actual code for the Nova services are in ./nova and the corresponding unit tests are in the related directory under ./nova/tests. Here's a short explanation of the Nova source directory structure:


├── etc
│   └── nova
├── nova
│   ├── api - the Nova HTTP service
│   │   ├── ec2 - the Amazon EC2 API bindings
│   │   ├── metadata
│   │   └── openstack - the OpenStack API
│   ├── auth - authentication libraries
│   ├── common - shared Nova components
│   ├── compute - the Nova Compute service
│   ├── console - instance console library
│   ├── db - database abstraction
│   │   └── sqlalchemy
│   │         └── migrate_repo
│   │               └── versions - Schema migrations for SqlAlchemy
│   ├── network - the Nova Network service
│   ├── notifier - event notification library
│   ├── openstack - ongoing effort to reuse Nova parts with other OpenStack parts.
│   │   └── common
│   ├── rpc - Remote Procedure Call libraries for Nova Services
│   ├── scheduler - Nova Scheduler service
│   ├── testing
│   │   └── fake - “Fakes” for testing
│   ├── tests - Unit tests. Sub directories should mirror ./nova
│   ├── virt - Hypervisor abstractions
│   ├── vnc - VNC libraries for accessing Windows instances
│   └── volume - the Volume service
├── plugins - hypervisor host plugins. Mostly for XenServer.


Before we get too far into the source, you should have a good understanding of the architectural layout of Nova. OpenStack is a collection of Services. When I say a service I mean it's a process running on a machine. Depending on how large an OpenStack deployment you're shooting for, you many have just one of each service running (perhaps all on a single box) or you can run many of them across many boxes.

The core OpenStack services are: API, Compute, Scheduler and Network. You'll also need the Glance Image Service for guest OS images (which could be backed by the Swift Storage Service). We'll dive into each of these services later, but for now all we need to know is what their jobs are. API is the HTTP interface into Nova. Compute talks to the hypervisor running each host (usually one Compute service per host). Network manages the IP address pool as well as talking to switches, routers, firewalls and related devices. The Scheduler selects the most appropriate Compute node from the available pool for new instances (although it may also be used for picking Volumes).

The database is not a Nova service per se. The database can be accessed directly from any Nova service (although it should not be accessed from the Compute service ... something we're working on cleaning up). Should a Compute node be compromised by a bad guest we don't want it getting to the database.

You may also run a stand-alone Authentication service (like Keystone) or the Volume service for disk management, but it's not mandatory.

OpenStack Nova uses AMQP (specifically RabbitMQ) as the communication bus between the services. AMQP messages are written to special queues and one of the related services pick them off for processing. This is how Nova scales. If you find a single Compute node can't handle the number of requests coming in, you can throw another Compute node service into the mix. Same with the other services.

If AMQP is the only way to communicate with the Services, how do the users issue commands? The answer is the API service. This is an HTTP service (a WSGI application in Python-speak). The API service listens for REST commands on the HTTP service and translates them into AMQP messages for the services. Likewise, responses from the services come in via AMQP and the API service turns them into valid HTTP Responses in the format the user requested. OpenStack currently speaks EC2 (the Amazon API) and OpenStack (a variant of the Rackspace API). We'll get into the gory details of the API service in a later post.

But it's not just API that talks to the services. Services can also talk to each other. Compute may need to talk to Network and Volume to get necessary resources. If we're not careful about how we organize the source code, all of this communication could get a little messy. So, for our inaugural article, let's dive into the Service and RPC mechanism.

Notation

I'll be using the Python unittest notation for modules, methods and functions. Specifically,

nova.compute.api:API.run_instance equates to the run_instance method of the API class in the ./nova/compute/api.py file. Likewise, nova.compute.api.do_something refers to the do_something function in the ./nova/compute/api.py file.

Talking to a Service

With the exception of the API service, each Nova service must have a related Python module to handle RPC command marshalling. For example:

The network service has ./nova/network/api.py
The compute service has ./nova/compute/api.py
The scheduler service has ./nova/scheduler/api.py

... you get the idea. These modules are usually just large collections of functions that make the service do something. But sometimes they contain classes that have methods to do something. It all depends if we need to intercept the service calls sometimes. We'll touch on these use cases later.

The Scheduler service nova.scheduler.api has perhaps the most simple interface on it, consisting of a handful of functions.

Network is pretty straightforward with a single API class, although it could have been implemented with functions because I don't think there are any derivations yet.

Compute has an interesting class hierarchy for call marshaling, like this:
BaseAPI -> API -> AggregateAPI
and BaseAPI -> HostAPI

But, most importantly, nova.compute.api.API is the main work horse and we'll get into the other derivations another day.

So, if I want to pause a running instance I would import nova.compute.api, instantiate the API class and call the pause() method on it. This will marshal up the parameters and send it to the Compute service that manages that instance by writing it to the correct AMQP queue. Finding the correct AMQP topic for that compute service is done with a quick database lookup for that instance, which happens via the nova.compute.api:BaseAPI._cast_or_call_compute_message method. With other services it may be as simple as importing the related api module and calling a function directly.

Casts vs. Calls

AMQP is not a proper RPC mechanism, but we can get RPC-like functionality from it relatively easily. In nova.rpc.__init__ there are two calls to handle this cast() and call(). cast() performs an asynchronous invocation on a service, while call() expects a return value and therefore is a synchronous operation. What call() really does under the hood is it dynamically creates a ephemeral AMQP topic for the return message from the service. It then waits in an eventlet green-thread until the response is received.

Exceptions can also be sent through this response topic and regenerated/rethrown on the callers side if the exception is derived from nova.exception:NovaException. Otherwise, a nova.rpc.common:RemoteError is thrown.

Ideally, we should only be performing asynchronous cast()'s to services since call()'s are obviously more expensive. Be careful on which one you choose. Try not to be dependant on return values if at all possible. Also, try to make your service functions idempotent if possible since it may get wrapped up in some retry code down the road.

If you're really interested in how the rpc-over-amqp stuff works, look at nova.rpc.impl_kombu

Fail-Fast Architecture

OpenStack is a fail-fast architecture. What that means is if a request does not succeed it will throw an exception which will likely bubble all the way up to the caller. But, since each OpenStack Nova service is an operation in eventlet, it generally won't destroy any threads or leave the system in a funny state. A new request can come in via AMQP or HTTP and get processed just as easily. Unless we are doing things that require explicit clean-up it's generally ok to not be too paranoid a programmer. If you're expecting a certain value in a dictionary, it's ok to let a KeyError bubble up. You don't always need to derive a unique exception for every error condition ... even in the worst case, the WSGI middleware will convert it into something the client can handle. That's one of the nice things about threadless/event-driven programming. We'll get more into Nova's error handling in later posts.

Well, that sort of explains about Nova's source code layout and how services talk to each other. Next time we'll dig into service managers and drivers to see how services are implemented on the callee side.

Monday, March 26, 2012

Out-of-Band Communications ...

Over the years, from junior to senior, I've always difficulty with one particular group of people in office environments ... smokers. Not because of the time they take for their breaks, or the "dirty" habit, or the health risks they're putting themselves at. Hell, I can waste time and put myself at far greater risk in a heart beat. As a non-smoker I just noted that people disappeared for a half an hour several times a day. Huh?! I remember they'd all come back in together, laughing out loud about some silly thing that happened during the break and I sort of envied them.

And for a time it seemed to be a largely positive thing. Whenever there was a problem in the office or with a customer or with a piece of code, it was usually brought to light by the smokers. The smokers invented the Standup Meeting. In fact, they were having two or more stand up meetings than the rest of us daily (and that was before the standup meeting was ever a "thing".)

But after a while, not all the decisions the smokers were making were being shared with the rest of the group. They would return to their desks and press on with their newly minted insights. I know many of us non-smokers would put on our jackets and suffer the Canadian Winter and the smoke just so we could be privy to the conversations. We had to. If we wanted to have a say in the decision making process we had to go outside.

The reality was, the smokers were directing the company.

And there wasn't that many of them either. The ratio of smokers to non-smokers was perhaps 10-20% at most. But they did cover a wide swath of the departments. Sales, marketing, development, QA, senior management, HR ... the smokers usually had a great sample of the entire company. They knew everyone. They would wave and say "Hi" to many of the new people far before the rest of us could. We had to depend on company mixers or "team building events" that might happen once or twice a year.

What were they doing right? They were talking, cross-discipline, about work ... frequently.

Think about that. They pushed themselves away from their desks, in a concerted manner, to go and talk with each other. No one held a gun to their heads. It wasn't a reminder on their calendars. All praise nicotine addiction!

What were they doing wrong? Their conversations were Out-of-Band.

Out-of-Band Communications can be a big problem with any large team. Agile is all about communications and sharing, but after a while the Stand-up Meetings become drudgery, people stop reading the wiki pages and the reports. Peer-programming duos become entrenched, emails get auto-filtered ... and in-person conversations take over. Why? Because anything impersonal is dull as dishwater. 

(Yes, stand up meetings are impersonal. Sit down Mr. Agile Consultant)


And please don't think I'm picking on the smokers. They were just how I learned of this phenomena. It's the people that work together in the same department shutting out other departments. It's geographically separated divisions setting their own vision. It's the echo chamber. Replace "smokers" with "coffee fiends" or "gym rats" or "let's go for lunch"-ers ... same result.

Informal groups form and decisions get made. Political lines get drawn and consensus established.

And who can blame them?! Talking to a small group in person is way more satisfying than trying to garner the attention of a faceless sea via electronic tools. Phone calls are always inconvenient for someone.

When your agile project is nothing more than a stack of story codes you've lost the meaning of agile.

This is a tough nut to crack. Are more stand ups the answer? No. But managers need to be aware when these out-of-band communications start to take over and find ways to keep them in the open. Developers need to be aware of the posses they belong to and force themselves to keep them open.

We need more craving-based meetings :) We all need to make a concerted effort to talk to the people in our team/division/company more and when we have an insight, we have to share it. Tiny groups are very powerful things, don't let them become fiefdoms.

Be aware of your out-of-band communications.

To be a better developer, I know this is something I need to work on.

Tuesday, November 08, 2011

Python Halifax!

A one-day Python gathering here in Halifax? Say it ain't so?!

It's true ... register now!

http://pythonhalifax.org/

Tuesday, August 30, 2011

The Pain of Unit Tests and Dynamically Typed Languages


Don't get me wrong. I love me some Python and I love me some unit tests. The problem is, I find strict unit tests in dynamically typed languages aren't nearly as useful as they are in statically typed languages.

I can hopefully better illustrate this with an example. Let's say we have some code like below. Note: Don't worry about what's actually happening (I completely made it up), but just be aware that decide() calls on compute() and both have conditionals/exceptions/loops, etc which make them good candidates for unit testing … in that, they could easily bust our application if someone starts messing with either of them.

Also note that decide() is dependent on a bunch of other functions: get_seed(), get_first()/second()/third() … which we will assume are equally unit tested.


So, we write our unit tests and they're really good. They mess with the fence post cases of compute(). They make get_whatever() exception out to test the retry code. All of this is done with Mocks and Fakes so we're not dependent on the other code. These are unit tests after all; we shouldn't be calling external functions/methods. This means that in our decide() unit tests, we've mocked out compute()We green bar our test suites and all is good. 


Until someone changes compute() to something like this:


Now it's some weird function that takes a boolean (or something that can be coerced into a boolean) and returns a string. Fortunately the author fixed the unit tests for compute() and everything green bars again.

But the application no longer works. decide() is broken … and we won't discover it until decide() gets called.

Why? Because with most dynamically typed languages, the contracts between functions or methods are decided at runtime. When compute() is called by decide() the virtual machine will try to coerce the values for first and second as best as it can. Likewise, when the response from compute() comes back to decide() the virtual machine will try to apply the +100 to it the best way it can. Maybe it will work, maybe it won't. Almost certainly it won't be what the developers intended.

If we are using a statically typed language such as C/C++/Java/C# things are a little easier. Our function would probably look something like


So, when compute() changes to
the compiler can catch it and complain long before the unit tests are ever run.

What does this mean for us Python/Ruby/PHP developers? It means we have to start moving towards integration tests in addition to our unit tests. Unit tests alone are not enough to let us sleep at night.

As I've said before, integration tests, while great, are a royal pain in the buttocks. They are fragile and difficult to maintain. Do we need to go full-on end-to-end integration testing? No. A normal unit test has a call-out depth of zero. We only test the function in question. But, what we can do is start to write some 1-depth unit tests (or baby integration tests, if you prefer). A 1-depth unit test would allow decide() to call compute() (and get_first/second/third/seed, etc) but no further. All calls beyond the 1-function call depth would be stubbed out as normal.

How do we decide where to place our 1-depth unit tests? I find that file or module boundaries are good places; places where it's not really easy to scan with your eyes. I will also skip third party libraries. I'm really just worried about the code I can immediately control. Or perhaps you might want to write them between highly dependent and equally complex functions?


Do you need to include every external call? Probably not. Make it a judgment call. Is the return type from a call sufficiently complex that it's likely to change? That's a great place to push your call depth down a level. Does one of your functions parameters have a dictionary of assorted types? If so, that sounds suitably fragile. Use common sense, there's no one-size-fits-all rule to this.

But, I would strongly encourage you to place these tests in a separate directory away from your strict unit tests. Developers should know what they're getting into when they open those files.

I look forward to hearing your thoughts on this.

Thursday, June 30, 2011

Effective Units Tests and Integration Tests


A unit is the smallest testable part of an application.

More often than not, I see unit tests that forget this core tenet. Picture a method or function as a sandbox. When you are writing a unit test, never let a call get outside of your sandbox.


You unit test should be testing the conditional logic and correctness of the code in your sandbox. Let another test worry about all that external/dependent code. When writing a suite of unit tests we're interested in code coverage. We want to exercise as much of our code base as possible. That's a lot of tests. We need to make sure these tests are not fragile, difficult to write or hard to maintain. And we need to make sure they're quick to execute.

The code in your sandbox really only has one or two purposes:
  • To execute some external operation, and/or
  • To compute some result
Your unit test should be testing all possible code paths through your sandbox. What can cause your code path to change?
Each of these constructs can make one pass through your code different from the next one. That's the stuff your unit tests should be exercising.

Computing a result includes:
  • Checking return values
  • Checking thrown exceptions
  • Checking set member variables and/or globals
Your unit test should be throwing all kinds of input at your code to ensure the return values are correct. Also that your code throws the appropriate exceptions on bad input. If your code doesn't return any values you may need to check that any global or member variables were correctly set.

How to stay in your sandbox
Nearly all programming languages have some sort of Fake, Mock or Stub library available for it. The purpose of these libraries is to short-circuit the external code and replace it with your test code. Something that:
  1. Always returns an expected value
  2. Has no state
  3. Has no external state dependencies (i.e. a database initialized to known values)
  4. Returns nearly instantly
  5. Has no conditional logic in it
In "XUnit Test Patterns", Gerard Meszaros has a nice breakdown of Fakes, Mocks, Stubs and Dummies:
  • Dummy objects are passed around but never actually used. Usually they are just used to fill parameter lists.
  • Fake objects actually have working implementations, but usually take some shortcut which makes them not suitable for production (an in memory database is a good example).
  • Stubs provide canned answers to calls made during the test, usually not responding at all to anything outside what's programmed in for the test. Stubs may also record information about calls, such as an email gateway stub that remembers the messages it 'sent', or maybe only how many messages it 'sent'.
  • Mocks are objects pre-programmed with expectations which form a specification of the calls they are expected to receive.
I try to only use Dummies and Mock objects for unit tests. I don't like storing state in my test doubles ... after a while you spend so much time making an effective stub you don't know if you're testing the stub or the code in the sandbox. That's why I like mock objects. They usually just consist of "return value" or "raise exception" ... simple, maintainable and readable.

If you find yourself putting an If statement in a Mock, you're doing something wrong.

How can you verify an If-Then statement?
Most of the time you just want to be sure that an external method or function was actually called. Do you need to actually call that method? Hell no. Not in a unit test. Many of these Mock libraries also have provisions for "Was Called" semantics. Essentially a flag is set if a Mock method was called. If your library doesn't have one, it's pretty darn simple to simple to set one up:

So, rather than let your code call outside of your sandbox, replace the external method (via monkey-patching, dependency injection or some other mechanism) to use your mock instead.

Does all code need a Unit Test?
No. Consider the following gist:


What would a unit test accomplish here? Nothing. Just make sure you have tests around each of the methods within foo().

When Unit Tests aren't sufficient
Let's say we did write a unit test for foo() in the above example. Would it accomplish anything? No, because we aren't checking the semantic ordering of the calls within foo(). Putting logic in to ensure that do_this() was called before do_that() and do_something_else() would just be a waste of time.

What if foo() was changed to this:


Our unit test would still pass just fine. But it would be wrong. What we need here is an Integration Test. Integration tests do check the semantic ordering of statements. With an integration test we allow ourselves to step outside of our sandbox. We may even use test doubles to help us build our integration tests (including Fakes and Stubs).

But now our mandate for these tests have changed: No longer are we concerned with code coverage, but instead we're interested in testing specific usage scenarios that are critical to our customers acceptance of the software. Other than manual testing, integration testing is the only way to ensure we're building software that will work for the customer.

We don't need 100% code coverage, instead we want the 80% of the every day scenarios the user will be experiencing when they use our software. If we were making a word processor, do we need integration tests for mail merge, form builder and the math editor? Probably not. Sure, it would be nice to have, but it's not critical. What we absolutely need are integration tests for: launch, enter some text, setting basic styles like bold, italics, etc, printing, saving and loading, page flow, etc.

Do we need to do integration testing for every edge case where an error might occur? No (again, it would be nice to have) ... but realistically we need to test: disk full, disk failed, out of memory, out of paper, etc. The most common scenarios.


But, if integration testing is so good and so important, why not just do integration testing?

Because integration testing is hard, slow, brittle, hard to read and hard to maintain. Everything unit tests aren't. In other words ... they're a pain in the ass. Integration testing is sometimes frowned upon because of these limitations, but they're being compared to unit tests. Integration tests and unit tests are very different animals. They serve two different purposes and give two different levels of comfort to the developer. I don't even think they're comparable at all. The worst mistake a development team can make (regarding testing) is to mix and match their integration tests with their unit tests. You get the worst of both worlds. Keep them clearly separate.

Also, treat your integration tests like your core code base. Unit tests can be hacky since they're so small. But integration tests are complicated and need to be carefully maintained. They need to be documented properly. They need fantastic logging capabilities with rich output. They need to follow the same coding styles as your core code base. They need to be structured, refactored and updated so that they're always easily readable. Unit tests are rarely refactored, since they're so small and atomic, there's usually nothing to change.

So, to grow an effective body of Unit and Integration tests for your application, remember these rules:

  • Don't step outside your sandbox for unit tests
  • Use Mocks and Dummies for your unit tests
  • Check all branching logic in your unit tests
  • Go for very high code coverage for your unit tests
  • Integration tests are difficult beasts. Go for high-impact user stories in your integration testing. Mostly Happy Day scenarios.
  • Don't spend a lot of time on the edge-case failure conditions in your integration tests
  • Keep your integration test code clean and maintainable. Refactor as frequently as your core code base. 
  • Set up a dedicated integration test server that runs on every commit to trunk (they're slow and hard to set up remember)

Wednesday, June 22, 2011

Technical Rabbit Holes …

Last time we had a good discussion about iterations and how there's a potential for dead-air and confusion near the end of a time-box. Tricky problem with lots of nuances. But there's another place where we can get fooled by the perceived safety of time boxes and iterations. Complex code.

Most interesting software is complex. You're not really treading the high-wire when you're cookie cutting variations on the same app over and over again. The fun stuff comes in when you're solving new or hard problems.
The issue with difficult problems is it's easy to go down technical rabbit holes. To me, technical rabbit holes are when there's a gap between the what the developer is working on and the understanding of it by the customer. It's not that developer is doing something dishonest and obscuring their work. It's not that the customer is unable to understand the problem. The problem is trust. There's too much of it. The developer has just enough rope to potentially hang his part of the project and the customer is willing to let it happen.

And, I know I like to pick on Scrum, but this is not Agile's fault. Agile does a great job of forcing small units of work to be produced and highlights customer interaction and feedback. Agile solved the old problem of “Going Dark” as Steve McConnell correctly describes it: The customer anxiously waiting at the end of a pipe for a product to drop out.

So, then, how can a technical rabbit hole occur?

Well, we know what a good project should look like. We may go off course a little from iteration to iteration, but the customer is there to keep us on track and reach the goal.
When there is no clearly accessible customer or a poor customer proxy we know it's easy to go off course for real. We keep hitting our iterations, we keep delivering code, but it's not the right code (it's not solving the customers problem.)
Dev shops often forget that there are two aspects of writing software: Research and Development. Too often these functions are rolled together into just “Development” and sadly the total becomes less than the sum of the parts. The customer needs to understand whether the developer is doing R or D.
In the above illustration, the Y axis is the customers understanding of what the developer is doing. Green = “I see what's happening.” Red = “I don't really understand, but you say it's important.” You can see that for nearly two iterations, the customer wasn't really comfortable in what the developer was doing. Two iterations. Probably in the best case, that's four weeks of time. That's a long time.

Can you see how this could go bad?
How many iterations do you have to go before the customer finally says “Stop!”?

And, again, it's not that the developer is doing something bad. He's not being tricky or deceitful. He's mostly likely really chasing down some complex code. The problem is he's doing in a manner that abandons the customer. Agile development means developing code together with the customer. It's not Us vs. Them. The customer put too much trust in the developer.

Many times these things work out. But there are times when they don't, when the rabbit hole goes too deep and the developer gets lost. Now we've just spent a lot of time and money and we aren't any closer to the goal for it.

What can we do to identify and prevent technical rabbit holes?

Get a Second Opinion
As with any possible fatal disease, you should get a second opinion. Bring in another developer or two to bounce your approach off. Go deep. Talk about all the nitty gritty issues you foresee. Make them own the problem as much as you do. Ask questions, do they understand your vision? Do they understand your development plan? Do they have any concerns about your approach?

It's these people that are going to help you stay in the light. Perhaps they can articulate your plan to the customer better than you can? Perhaps they can see other ways to deliver on your vision without requiring the customer to put blind faith in you?

Never swim alone in deep water.

Incremental Completeness
Technical rabbit holes can easily occur in any project with some degree of technical complexity. What we should be striving for is Incremental Completeness. Delivering small chunks of working code (with tests) to trunk that, not only help the developer with the R part of Research and Development but, show the customer that we understand where the destination is.

Every iteration, try to put something in the code drop that the customer can see the results of. Perhaps even a simple command line tool the user can run on the release to play with the feature in some fashion (the customer is getting a working build every iteration, right?)

Which gets us to our next point ...

Educate the Customer

Don't assume the customer knows anything about what you're building. They may be able to spout off all kinds of technical mumbo jumbo, but is it code? If they don't understand the piece your building, how can they expect to confirm they have it at the end of the iteration?

Use analogies for the data structures you're building. Try to put it in terms they'll understand. Draw pictures. Share the whiteboard. Explain the complexity.

You may not get it at the first sitting and don't try to force it in a single go. You're telling them a story. Put it in their terms. You're bringing them on a journey, take your time. Just make sure they're not just nodding mindlessly at you.

You're not trying to turn them into computer scientists, but you want to get to the point where the customer can call YAGNI. You want them to tighten the requirements to the point where they can say “we're not going to hit that condition frequently enough … so just print an error message.”

Always Commit
This probably goes without saying, assuming you're using a modern day revision control system, but you have to commit and push your code. Do it frequently. Ideally, no later than every 3 days. Get others to look at your branch. Keep it in the light. Let other developers see what you've been up to. Tell them the same story you told the customer … what do they think? Do they agree that your description is a fair explanation of what you're actually doing? Do they think you're on course?

Don't lose sight of trunk

Merge. Merge. Merge. Pull from trunk frequently . Keep those tests green. If you're doing something that is going to break trunk, use a scaffolding technique: build around the existing mechanism and provide a means to turn it on and off. Later, once your implementation is complete, you can remove the dead code. But, until then, you still have working code. Each iteration you should be pushing something working (with tests) back into trunk.

Letting trunk get a way from you is a sure way to get lost in a technical rabbit hole. If you're adding new code when your tests don't run, you're in trouble. Do you remember what you changed that broke the tests? Are you getting closer or further away from them working again?

As a developer, it's probably your greatest risk.

Mr. Customer: Share the vision

This one is for the customer ... it's ok to sound like a broken record. Keep talking about what the final product will look like. Give lots of detail. Make sure everyone knows where they are going. Make sure everyone knows what's important and what's not. Help the development team understand your business, who your customers are, what they want to experience from this software.

If you keep everyone focused on the goal (and the deadline), it's harder to wander off course. You want the developer to say “You know, I'd like to do this differently, but there won't be time right now. Let me find an intermediate step.”

TL;DR

Technical Rabbit-Holes can be prevented with Eyeballs and Education. Show people what your doing and explain how you're doing it. Don't let too much trust get the better of you.

Good luck … and keep your head above ground!

Thursday, April 28, 2011

Iterations and Time-boxing are (Mostly) Useless

Sorry, this is a rather long post, but I've been thinking a lot about Scrum and iterations lately and haven't had time to blast this out in smaller chunks.

If you know me, you know I'm not a big fan of Scrum. I don't like the hand-waving concept of Velocity and I don't like the fact that it places a priority on process over code. XP, I feel, is a much more important for software development. Agile (aka customer responsiveness) can be achieved without the dogma. But that's not the point of this post. In this post I want to question the value of timeboxing and the benefits that modern revision control systems grant developers. So, stick with it, I spend some time setting it up before getting into the meat of the post.

Back in the 90's I was using Spiral Development Model on a project. Spiral is Waterfall with shorter, fixed-length, time spans between milestones. In XP, this idea was formalized as Iterations by dropping the Design phase and using automated testing to replace the long, drawn-out, testing phase.

The rationale behind iterations and Spiral was simple: Estimating is hard. The further out you look, the greater your error rate.
Spiral's insight was to reduce risk by keeping the milestones close together (and ensuring that each milestone is a shippable product).

Scrum expanded on this idea and permitted the functionality of the project to change from iteration to iteration. It does this by bringing the customer back into the picture each iteration for sign off and review of “what's most important now”. All good stuff.
At the start of the project, the customer has a general idea of what they want from the development effort. Working with the team, these requirements manifest themselves as Stories (aka Requirements ... friggin' Scum wants to rename everything)
Now the developers enter the picture and give some guesses about how difficult each story is. This is usually done as T-Shirt Sizes (Small, Medium, Large) or some other metric. In a perfect world, these sizings are not equated to Time ... they are relative efforts and no planning occurs more than one iteration out. But in reality, the customer needs to have some idea for when to expect their product. It could be as coarse a resolution as 1st Half of Next Year or 4th Calendar Quarter, etc.

So ... we're back to using Time again. T-Shirt sizes get equated to some unit of time (let's say, Small = 1 day, Medium = 2 days, Large = 3 days). Iteration sizes are set into something manageable such as 2-3 weeks each.

Finally, given the number of developers it's a pretty easy exercise to give a Wild Assed Guess (WAG) as to when the product will ship. If everyone is happy, the dance can begin with the first iteration.
When the developers get into the first iteration they start coding on the most important stories (as selected by the customer). Things are wonderful. But, when they get into the second iteration some things may have changed. The priority of the stories may have changed, the first iteration dev effort may have uncovered new stories or perhaps, there is already technical debt that needs to be addressed. Regardless, the original WAG deadline is likely changing (for better or worse).

The reality is that time-boxing the iteration introduces waste near the deadline. Let's say I'm 2 weeks into a 3 week iteration and I've finished all my work early, what should I do? Should I start working on another story from the next iteration (which includes bug fixes)? Should I perform some other non-development task such as manual testing or documentation? Should I refactor?

Surprisingly, the Source Revision Control System (RCS) that you are using can have a significant impact on your decision. Personally, I think Distributed Revision Control Systems (DRCS) such as git, bazaar or mercurial are the most significant change in software engineering within that last 10 years. Not because having a non-central repository is so revolutionary, but because branching and merging has become such a low-cost operation that the speed of software development has increased dramatically. Development shops that utilize these tools can see some big improvements in how they deal with the “Rough Edge” at the end of the sprint.

Let's compare dev shops that use a RCS that supports fast/low-cost branching and merging to those at don't.

As I mentioned above, having rigid deadlines means potentially having dead time on your hands.
You either have to find small, low-risk stories that you can take on within the remaining time or do busy-work to fill the gaps. If your customer is more willing to let the deadline slip to get the functionality they desire, things aren't much better. You're simply moving the dead time out a little further.
But if you're using a RCS that has low cost branches and good merge capabilities, things get much easier because the developers can keep working on upcoming stories without affecting the current iteration.
But I can branch and merge with svn or ” I hear you say. Yes, you can, but not quickly. When keeping in sync with trunk is expensive, it doesn't happen. When you don't sync with trunk the costs of your merges increase, merges are larger and your unit test maintenance efforts increase. When branching and merging costs are low, it's easy for developers to start and new branch and keep working without upsetting trunk or the deliverable for the current iteration. Look at the NVIE branching model as your reference standard.

So, the million dollar question ... if I can keep working and still ship my promised set of stories as they become available, why time box in the first place?

Pull models such as Kanban don't really rely on timeboxing. Instead, developers pull stories from the backlog when they need to. The customer can re-prioritize the backlog as they desire. Finally, when a feature is completed it's merged into trunk and becomes part of the deliverable. The customer can decide which version of trunk they want to use and don't need to wait for an N week deadline to pass before grabbing the product. Developers are encouraged to keep the Work In Progress (WIP) to a minimum to prevent having lots of half-finished efforts.
We, as developers, can still give estimates on each feature and we can still do our burn up/down charts to track progress, but we don't need to actually block out time frames. Actually what we're doing is taking the idea of iterations to the logical extreme: each story is an iteration.

I can think of only one situation where time boxing is useful and that's in Scrum of Scrums. When the senior managers/customers have to monitor more than one team of developers they need to be able to snapshot the state of progress across teams. The ideal situation however, would be a Scrum of Kanbans:
Senior stakeholders get a top-down view of the state of the development process without slowing down the developers every 2-3 weeks. For them, this is a simple synchronization mechanism for keeping the cats herded.

This gets us back to one last point on estimating. As I've said before, estimating sucks. Developers hate having to break down tasks to such a fine resolution that they can be tracked in 4-8 hr intervals. If I have to think about a problem to 4hr resolution I may as well just code it. Why not 1 hour? Why not task out to every 30 minutes? Simple, it's a case of diminishing returns. The time it takes to document the tasks outweighs the effort itself.

I think a better approach is to not to track tasks at high resolution. Instead keep things in the 1-3 day range with a clear deliverable and simply track the sentiment of the developer towards the story. On the first day of the iteration, the developer should be very positive that they can deliver this story in this iteration (otherwise, why did they sign up for it?) ... but as the iteration progresses their sentiment may change. And it's going to North or South very quickly. “Oh, that's not good”, “This is getting bad.”, “I'm screwed”.
This is what we should be getting from our Standup Meetings. When we report status and obvious impediments to our peers we should also be giving a vibe on our ability to deliver.

This gets ever better for the Senior Managers, not only can they see their Scrum of Kanban stories getting executed, but they can boil up the sentiment from each of the teams to see how the iteration is going.
What do you think? Would your daily development process be better if you didn't have to break down tasks to super-fine resolution? As a manager, could you better estimate ability-to-deliver based on higher-level sentiment vs. tasks completed?

UPDATE: Glen Campbell was nice enough to give his perspective on this post. Jump over there to read his post and the comments to see my response.