Monday, September 16, 2013

Notification Usage in OpenStack – Report Card

Notifications are the key to creating a comprehensive monitoring and billing solution for OpenStack. Notifications are messages placed on the OpenStack message bus (AMQP, ZeroMQ, etc) whenever an important event occurs.

The original spec for the notification system is available here:

During new instance creation, Nova emits a couple of dozen events which can help operators maintain quality of service, billing accuracy and error detection … at the very least. Without notifications, down-stream systems would have to parse log files or access the production database directly.

Edit: log files are a very reliable mechanism for billing purposes, but for lower-volume applications it does involve a more complex deployment than using a queuing system. 

Notifications are atomic messages, meaning they have everything they need to give the consumer information about the event that occurred. They are not transactional in nature. However, there are ways to tie related events together through a common Request-ID, which is assigned as soon as a request hits the service API. And, since notifications all go through the queuing system, they're resilient to high-load situations.

Adoption of notifications is very easy if the service utilizes the Oslo common library. If you don't use the Oslo library you run the risk of generating non-compliant messages. One project recently ran into this problem when they were generating non-dict payloads on exceptions.

The message payload is left to each service to define and it's just a nested set of key-value pairs. The messages from Nova are the best defined and there are some efforts to provide optional standards-based notification formats.

Other projects have their own payload formats:

Projects like Ceilometer and StackTach consume these events and produce very valuable metering, monitoring and billing information.

Which Notifications Should Services Emit?

An OpenStack service should emit an event for the following conditions:
  1. For each CrUD operation. Anything that mutates the state of the system.
  2. Retrieval operations, the R in CRUD, are optional and likely too chatty, but if the retrieval is something the client is billed for, an event should be emitted. This is usually done as an end-of-day operation under a specific “---.usage” event.
  3. Most notifications are sent on the INFO queue, but exceptions and error conditions in the system should be emitted on the ERROR queue.
  4. Around any operation that is complex, multi-staged or could take a long time to perform, separate .start and .end notifications should be emitted. This can help with performance tuning and stall detection.

The Importance of Usage Notifications

Throughout the day CrUD notifications are generated. If these are billable events we need to have a secondary notification that summarizes the individual notifications. The sum of the individual events should match the summarized notification. Having two sources of billable data allows for later audits.

Depending on the volume of the summary it may be better to generate these .usage events on a per-tenant basis as opposed to a per-resource basis. One event per active tenant vs. one event per billable resource.

If your service does not produce billable events, the usage events probably aren't required.

The Swift Exception

Edit: Clarified a few points after feedback from the Swift team. 

Swift doesn't use Oslo rpc (but it does use other oslo components) and has to deal with tremendous transactional volume. Nearly every Swift operation is a billable action so they have to use a completely different scheme for billing support. This involves a highly distributed log-parsing and storage mechanism. Swift has extensive support for statsd metric generation, which could likely be integrated with ceilometer's UDP collector. While the insights that can be collected from raw metrics are not as rich as an interrelated set of notifications, Swift does offer very powerful hooks for monitoring systems.

Notifications vs. Web Hooks

Most services use web hooks (HTTP requests) to other OpenStack services for interop notifications. Ideally, this will change when Taskflow is available. Until then, web hooks are fine for interop since it's using the public API for these services.

Notifications are used primarily for monitoring and billing. Billing especially needs the power of the queuing system underneath it in case the target system is unavailable.

Concerns about Oslo use in Libraries

The Taskflow team has a concern about oslo adoption. Oslo requires a copy-paste from the master repo into the target project vs. being a standard python library (aka available via pip install.

This is a known concern by the olso team and is being addressed here (similar concerns with the oslo.config library):

Hopefully this will shake out in the Icehouse release to make it easier for projects and libraries to use Oslo.


The Report Card

The following table shows the current state of notification support in OpenStack Services (and some incubation wanna-be projects). We would be happy to have any notifications coming out, but to be a good citizen, each type of notification listed above should be supported. Not supporting Oslo is an instant failure.

If there are errors in this report card, please let me know and I'll adjust accordingly. 
Oslo Adoption
Usage Audit
.info & .error
Object Storage
Image Service
(uses custom notifier)
(and .warn)
(on User and Project objects)
Block Storage
Ceilometer” *
Heat” *
Database Service
Queue Service “Marconi” *
* The team has indicated that notification support via oslo is slated for the Icehouse release (pending blueprint approval).
** Probably not critical for this service.
*** Glance plans to move to the standard oslo implementation with more extensive notifications in Icehouse.

Edit: removed TaskFlow from the report card since it's a library (vs a service) and can hook into any notification system via its emitter mechanism. 

The Results

Things are good. Notification support in OpenStack is strong, especially in the most visible projects. Newer projects are jumping on the notification bandwagon and there should be much greater adoption during the Icehouse release. Swift is a notable exception, but perhaps there will be an effort for it to tie into Ceilometer at some point.

While there are still some growing pains around the adoption of the Oslo library, its use is generally pervasive. 

We have big hopes that the TaskFlow project will mature and that existing projects will start to use it. Having a common state management library will mean a central location for notification generation. You can learn more about it here:

Monday, April 01, 2013

The Monitoring Stack (the state of the art)

Last week I had the pleasure of attending the first annual Monitorama conference. This was a conference aimed "towards advancing the state of open source monitoring and trending software."

For many of us starting in this area, our concept of monitors consists of top, some apache, mysql and application log files and perhaps an external ping service that tells us when our web site is unavailable. Anything beyond that generally ran into the commercial product realm. We were scared off of monitoring by these old monolithic products that required huge licensing fees and armies of professional services people. Thankfully, times have changed.

And our application footprint has grown. No longer are we just deploying web servers and databases. Our application stack starts with our automated testing framework and runs through continuous integration and continuous deployment. Jenkins, Travis, Puppet/Chef, etc ... they're all critical. It also includes our deployment partners ... that army of SaaS applications we use to make our life easier. Any SaaS solution worth its salt has a status API available for tracking availability. Our monitoring needs are now wide and diverse.

My first exposure to the next generation of monitoring tools came with the awesome Etsy post "Measure anything, measure everything"

The concept of Measure Everything wasn't new to me. I'd been working on StackTach for OpenStack around the same time and understood the value of getting a visual representation of the internals of an application. Even from my old management days we used to say "you can't manage what you can't measure." I lived this with my Google Analytics experiences from running various web sites and my software development management interests were aiming towards Six Sigma techniques over the hand-wavey agile methods. Essentially, numbers are good. But this was giving us a way to apply those same measurement techniques to running software. It was a lens into the black box. Could the days of parsing log files be over?

The first generation of these new monitoring tools included Zenoss, Nagios, RRDtool, Cacti, Munin and Gaglia to name a few. They were built out of necessity and often have some really nasty warts that people just hate. This latest generation of tools have learned from their mistakes.

The Etsy tool chain started with statsd with graphite. This introduced to me the concept of using UDP packets for instrumenting the running applications ... which was, pretty brilliant.  For those unfamiliar with statsd and graphite, here's the flow:

  • your application wants to measure something, so it sends a UDP packet to the statsd service. UDP packets are lossy and unreliable but fast for large amounts of data. Most large video networks send via UDP packets.
  • statsd is a node.js in-memory data aggregator (it accumulates received data and every so often sends it to graphite)
  • graphite is a django app that archives received data and gives a funky web interface for presenting and querying the data. 
There are a number of cool things happening here:
  1. adding statsd integration to an existing application is very easy. No special libraries needed and sockets are available in nearly all languages. 
  2. since statsd uses UDP there is very little risk of the production application crashing if statsd fails. The packets just get lost. 
  3. since statsd is in-memory, it can process a lot of data very quickly. But rather than take on the task of archiving and disk access, it simply forwards the results to something that can do it better. 
  4. graphite has an easy REST interface which makes it easily accessible by technical product managers to create their own dashboards and status reports. 
Side note: if your application is written in python and you want to experiment with this stuff without touching your existing code base, have a look at the Tach application. This monkey-patches your python application and sends the output to statsd or graphite directly. Pretty cool. Although it was originally written for use with OpenStack, it can work with anything.

But the real insight here is a set of atomic, well focused tools that could be put together to create a monitoring stack. The tool chest of the devops team just expanded. 

As our experiences with statsd and graphite had grown within the company we also saw where the monitoring stack failed. A UDP-based approach won't work for billing or auditing. For these scenarios you need to have a reliable transport for events. In OpenStack we publish event notifications to AMQP queues for consumption by various other tools. These are important events, often with large payloads. When the StackTach application is unavailable these queues can grow very quickly, and we don't want to drop events. This is manageable for something like OpenStack Compute, but other applications like Storage produce an incredible amount of data across a wide range of servers. Using a notification-based system would be difficult. Instead we needed to look at syslog-based archiving and processing solutions. The new monitoring stack offers tools like LogStash and, in the OpenStack case, Slogging.

Then there is the post-processing. To add value to the raw events we often need to apply other functions to the data such as times series averaging. This can be tricky. We need to wait for all the collected data to arrive before we can start the post-processing. We may need to ensure proper ordering. Historically this would be done with cron jobs and batch processing, but the new monitoring stack includes tools like Riemann which can do this post-processing inline. 

It seems evident that Nagios isn't going anywhere any time soon, but there are some other tools offering alternatives such as Shinkin and Sensu.

Recently our team has been working on bringing what we've learned with StackTach to the OpenStack-blessed monitoring solution called Ceilometer. Without standing back and looking at the larger monitoring community it would have been very easy to want to recreate an entire monitoring stack on our own. But now it's clear that we can focus on the minimal set of missing functionality and augment that from an already powerful set of tools. This is a very attractive proposition for one simple reason ... the project has an end in sight. There are lots of fun problems out there to tackle and knowing you don't have to reinvent the wheel is very compelling.

There is a cost though. The monitoring stack today consists of a variety of tools all written in different languages and each with different care and and feeding instructions. One could argue that the workload on operations will only increase by mixing and matching. My knee-jerk reaction is to agree, but I know that the greater win is to get familiar with all of these new tools. In production, these monitoring tools need monitoring as well. So we may have to monitor Java, Ruby, Python and C# VM's running bytecode from a potential variety of languages. 

If this all seems too daunting, perhaps the hosted offerings are a better choice for you. For nearly every open source offering there are hosted offerings. Look at logglypapertrail, pagerduty, librato, datadog, hostedgraphite, boundary, new relic, etc. 

This brings me back to Monitorama. The Monitorama conference had a format that worked very well for me for the following reasons:
1. It was only two days long.
2. Day One focused on hearing about "the state of the art" from industry leaders.
3. Day Two was tactical with the tools and included a hackathon which let you understand where the real-world pain lived in each of these components.
4. It was small enough so you could actually talk to people and have meaningful conversations.

The Day One talks made it clear that "Alert Fatigue" (a term borrowed from the medical industry) is a big problem. Too many alerts hitting our inbox. Some are important, most are noise. There are people working on it, but it's perhaps the biggest source of angst for operations currently.

Side story: for the hackathon I started work on a tool that allowed members of the company to track external events that might affect production. Things like sales events, big holidays, new customer deployments or internal events such as new code deployments, hardware upgrades, etc. The idea was to have these events show up on the spikes in the dashboard graphs so we could say "That spike was due to Foo and that ravine was due to Blah." I made some good progress for the day and then one of the other attendees showed me his side project Anthricite, which does all this and more. The author was sitting in the room next to me. What are the odds? 

For a while I was getting disillusioned with this space because I saw it dominated with commercial solutions or that the problem was so big it would be a lifetime of work to build as open source. But now I see there are viable open source components and there is enough of the stack available that we can focus on some of the smaller missing pieces. Also, there is a smart community out there facing the exact same problems and actively working on solutions. There is a light at the end of the tunnel.

I may not attend ... but definitely the US one next year. But for now, I've got some products to learn.

(ps> if you made it this far, I wrote a little about monitoring from an OpenStack perspective here)

Wednesday, October 31, 2012

Debugging OpenStack with StackTach and Stacky

At the OpenStack Grizzly Design Summit I introduced StackTach v2. I first introduced StackTach earlier this year but never really had the time to update it. Since then I've been working on business metric collection efforts and StackTach was the perfect tool for it, so it got dusted off. The latest changes to StackTach are up on GitHub.

Also, I'm introducing Stacky, which is a command line tool for StackTach. It in a separate repo on GitHub (since you'll likely want to install Stacky in many places).

Here's a video that explains what it's all about, how to install and use it (best viewed full screen at max resolution)

EDIT: This video is a little old. There has been a lot of new functionality in StackTach around auditing and usage as well as new functions in Stacky. But you'll still get the general idea about what's going on.

Look forward to your feedback!

Monday, October 22, 2012

My Travel Kit ...

This is a cross-post to my other blog. I've had lots of people asking me to write about my travel regiment ... so here it is: My Travel Kit

Thursday, September 27, 2012

OpenStack Nova Internals – pt2 – Services

This is a long one, so go grab a coffee.

Recently, I was working on a little OpenStack instrumentation problem. Let me tell you about it.

"You can't manage what you can't measure." is how the old saying goes. Certainly no greater truth than trying to increase software performance. This is what I was tasked with ... measuring OpenStack performance.

As I mentioned in Part 1, OpenStack uses Rabbit MQ to send messages between services (a service being an OpenStack Nova component like scheduler, compute, network, api, etc.)

I could already measure times from the client perspective using novaclient's --timings option. I could measure times inside a service using tach, which could be sent up to statsd or graphite. But there's another black hole that calls could go in ... Rabbit itself. As I mentioned, we stuff an RPC call on the queue and the related service picks it up and handles it. As every admin knows, if the processing service (the "worker") is too slow, the queues can grow quite large. We need to track that.

For our RPC calls, the "inflight" time is pretty simple on the surface:

Total Call Time = Time in Queue + Time in Service (+ Time in Queue for the response on a two-way call)

We could certainly monitor the message rate in rabbit and look for the change in processing speed, but we can get more information by injecting a real message in the system. For a normal production system you'd certainly want to watch both. Ideally, we can find out what the cause of the slow down is, or at least get some hints.

So, I thought I'd write a little utility program to inject a ping()-like message into the queue and send it to each service (you'll see why in a bit). We'll record the most basic information:
1. Time to the service
2. Time from the service

But non-HTTP OpenStack services are built on eventlet which, as I mentioned before, is a sync library. While these libraries make programming easier by eliminating most classic thread/locking problems, the downside is that the number of greenthreads can grow. And a call can get held up when the number of greenthreads grows. Most nova services are pretty light weight, but some have to make HTTP or out-of-process binary calls to get the work done.

For example,
  • the network service may have to talk to the switch, 
  • the compute node may have to talk to a busy hypervisor or 
  • the image service might be slow. 
The services have to be able to handle this rush of calls.

Can we track that?

The most basic way would be to run a time.sleep(1) call, during which eventlet will just pass control to another greenthread and revisit it later. But what if eventlet is busy due to a lot of greenthreads? Our 1 second call will take longer. That's our overhead. (Later, we can talk to eventlet directly and ask how many greenthreads are active. But that's another post.)

That was the plan ... so how to implement this? Here were my requirements:
  • I want something that can get a list of all active / enabled services and ping them. But that means I need to put a ping() method in every service in OpenStack Nova. Where should that live? 
  • Also, this program is going to run a long time, so I'd like to leverage existing service deployment and process control scripts (puppet/chef/etc). 
  • I'd like to reuse the Nova RPC library rather than duplicate that effort. 
  • But I also need to talk to the database ... can I reuse the Nova DB library? 
  • This is all going to require a configuration file to set up ... could I reuse the Nova configuration mechanism? 
  • I'll need unit tests. Nova has a very nice testing framework that integrates nicely with the continuous integration / code review system. The mechanism will have to change as Nova changes. I don't want to have to wait from the users that the protocols are out of sync (in Nova, the API is strictly versioned, but the RPC protocol is only loosely versioned).
Perhaps just writing a utility program isn't the correct approach. Perhaps I should make this another Nova service.

Perhaps I can bite off more than I can chew and try to get this accepted as a core service? If not, I'll refactor and pull it out as an optional external service.

Also, since the service is so simple I thought it would make for a good "how to" post ... so here we are!

Alright, enough chat ... how do we do that?

Let's start with the launcher. The ./bin file that going to fire this puppy up. We're not an HTTP service, we don't need paste or auth or any of that stuff. We just need to get our configuration and spin up eventlet. Just look at the launcher for the compute node

Pretty simple: load stuff, load stuff, load stuff ... we're "main" so parse the args, configure logging, place our monkey patches (another time :), create the service, make it available, pump the events until we die.

Of course, the magic line is service.Service.create(binary='nova-compute'). Something funky is going on there. Somehow that thing is finding the implementation for the compute service. Let's look at the ./nova/ create() method. Yes, that's magic in there.

"nova-compute" turns into "compute", which evaluates to "compute_manager", the flag "--compute_manager" is looked up for the code to load. The default is defined in nova/

And, nova.compute.manager.ComputeManager is loaded (or whatever you set it to in nova.conf)

Oh, notice the --periodic_interval flag in there? That's pretty cool, that's how often our internal timer should trigger. It's like our built-in cron service. We're going to use that to issue our ping()'s.

So what does a nova.???.manager.???Manager look like? Let's make one. In our case it will be nova.inflight.manager.InflightManager (we'll have to make a new ./nova/inflight directory for it with an empty file)

This is the core of the inflight service I wanted (there's some other sugar in there to actually do the work, but that's not really important for here). What is important are lines 80-89. This is the handler for the periodic task event that will occur every N seconds. In our case, we're going to send a ping to the first item in the list of services and then move that service to the end of the list. A circular queue of pings.

Of course, to make this work we will need to include an --inflight_manager=nova.inflight.manager.InflightManager flag to our nova.conf

And, for illustration, you can see how we can add new flags to nova. The nice part is we only have to define them in the place they are used and the framework will include them in the grand configuration.

When we run ./bin/inflight-manager executable it will launch the framework and load our new InflightManager class. And then, every few seconds, the check_inflight() method will get called.

Next, look at lines 56-60. This little dictionary is a map of the service type to the api for that service. If we look at the topic column in the Service table (or service['topic'] in the result set) we'll have the type of the service we are hoping to talk to. This lets us talk to it in the correct way.

Now it gets a little tricky. Since I wrote part 1 of this series there have been some changes to the RPC abstraction in OpenStack. Previously I mentioned there was an API onto the service which was responsible for marshalling parameters and results to/from AMQP. This is still the case, but now there's a new layer just underneath it. (Most) all services now also have a related file. The difference between my_service/ and my_service/ is as follows:

  • is the thing that other services should use to talk to the service, just like before.
  • handles the light-weight versioning of the RPC protocol that I hinted at earlier.
The idea is, eventually, we'll be able to have a means to mix old and new services in large deployments by impedance matching the RPC protocols. Currently, it'll just puke.

Since our inflight service should be able to test itself, and it will show up in the list of Services in the database, we're going to need a ./inflight/ file. The meat of it is pretty simple:

Note the version number in there. We'll need to bump that whenever we change the api. The topic is just the name of the queue within Rabbit that the message will be written to. In the same way, the service framework will look in the "inflight" queue topic for methods to call. We'll have to add that --inflight_topic flag to our nova.conf.

Where are we?

We have a service than can launch and generate calls to other services periodically. We have an RPC API on our service so other services can call us (in reality we'll only be calling ourself).

What we're missing is an inflight() call in each of the other service API's. OOP 101 says we can put this in a common base class and make it available to all services. And it's almost just that simple :) Then, we need to put a handler in each of the service ???Manager classes to actually do the work.

All of the RPC stuff within nova has been moved into openstack.common, since it's something that glance, quantum and other components can use if desired.

In order to add this common inflight() method to our base RPC API class, we're going to have to make a change to openstack.common. This is a different project. What happens behind the scenes is this: When something is submitted to openstack nova, it gets merged in a different branch. Our CI tools layer the openstack.common project on top of it. So, when we git clone nova we get a copy of openstack.common ... but we shouldn't make changes to it in nova itself. Instead we need to make our changes to openstack.common separately and remerge with nova trunk to see those changes in our working branch.

Tricky and somewhat confusing, I know. But it makes sense in the big picture.

Just remember, don't mess with stuff in ./nova/nova/openstack/* unless you're in the openstack.common repository.

Again, the flow is:
Service API -> Service RPC Proxy -> AMQP -> Service Manager

Let's add inflight() to the rpc proxy object in openstack/common/rpc/ of the openstack.common project (the baseclass for all rpcapi implementations)

There. Now we can call inflight() from our InflightManager class and have the message sit on the wire.

Next we need to add a common inflight() implementation to each of the service managers (including our InflightManager). Once again, we'll add it to the base class of all the Managers. Remember, the Managers are the part of the service that contains the implementation of the service methods.

Fortunately, this part is pretty easy. All Managers derive from nova/ and don't really have to sweat much about the versioning since that's the senders job. We just have to add our implementation which, in this case, will just spawn a greenthread that sleeps for a second and returns the actual time.

Phew. We did it.

In reality, there are some other minor tweaks and actual code to make this service really useful, but the purpose here was to illustrate making your own service. If you like, you can look at the whole thing in the review branch.

If this was at all useful I look forward to your feedback in the comments or via Twitter.

Next time, we'll look at the HTTP interface and how REST calls are dealt with.

Tuesday, July 17, 2012

How OUYA Can Win ...

Myself and nearly 40,000 other people recently backed OUYA: A New Kind of Video Game Console. To understand why, simply watch the video ... it says all the right things:
1. We love video games
2. Mobile development is hard
3. It used to be easy to write games
4. Console development is too expensive

If you're a hobbyist programmer the attraction is pretty clear: Open the box and start coding a game your friends can play on the TV. You can be a hero in the eyes of your buddies.

There are a million places OUYA could fall apart. The most obvious being the founders are a bunch of lunatics and spent their $5+ million on bubble wrap. But, let's assume they actually deliver (a much more interesting scenario.)

For me, the biggest hurdle is that Android development is still very complicated.

I got into programming when I was a kid, on a TRS-80. I spent way too much time in my room writing games. I reproduced all the games I played at the arcade on that machine in BASIC. Centipede, Missile Command, Space Invaders, Asteroids ... and they all sucked. BASIC was too slow. I learned Z-80 Assembly Language and started work on my own game (Blubber Blaster, but that's another another story). This one little box took me all the way into complex programming.

The hook was set deep. I knew programming was my thing and games got me addicted.

Why was it so addictive? Easy, there were no roadblocks. Kids are all about instant gratification and the early computers had no roadblocks. You turned on the computer and POW, you were ready to go.

Look at that.

No Operating System to speak of. No discernible file system. No menus to confuse you.  Just start coding. The simplicity is brilliant.

Fortunately, the documentation was equally brilliant.

This manual was excellent. I wore that book out. In the end I think it just disintegrated. But it was everything I needed to get started writing stupid games.

Until I figured out how to use the cassette to CSAVE and CLOAD my programs, when I shut the computer off, all my work disappeared ... and the next day, I would faithfully retype them. But, again, the beauty was the simplicity. Save and Load operations had a single context: "Save this program. Load that program." There were no directories, no file types, no upgrades ... nothing. Save it. Load it.

The editor was a command line joke. But it was simple enough for me to get started while not getting in my way. Yes, later it slowed me down, but it worked well enough for long enough to get me hooked.

BASIC is another story. BASIC, rightfully, gets a bad rap for line numbers and GOTO statements, but as a budding game developer it held a special allure. First of all, the entire program had to fit in a single file (there was no file system remember). No modules, no import statements, no linking ... nothing. One size fits all. There was minimal syntactical sugar. Comma and semi-colon was about the extent of it, and yet, that still tripped us up all the time. How many times did I type in some game code from a book only for it to fail with a missing comma? Too many. Perhaps that's half the reason I needed to learn to do it myself.

Also, BASIC had a FOR-loop. That's it. FOR X = 1 to 10 STEP 2 ... done. Your entire game had to exist within a for-loop or a GOTO loop. Think about that. No double buffering concerns. No event pumps. No structures. No polymorphic actor classes. It was "Draw image. Erase image. Draw new image." I learned the principles of animation by understanding how graphics were drawn on the computer screen. My Atari and C-64 friends had the luxury of sprites and sound registers, but us TRS-80 users were essentially given a bag of sand and told to go make a CPU.

It kept it simple. You could hold an entire game in your head. Keyboard input was done by PEEK'ing a memory mapped location. We didn't have to register listeners or event handlers.


Anyway ... let's wrap this up. I hope you see where I'm going with it.

OUYA has a great opportunity to bring this sort of excitement back to beginning game developers if they keep it simple. Reduce the roadblocks. Hide the complexity. Android development is the exact opposite of everything I've described in this article. OUYA needs to focus just as much on the out-of-the-box programming experience as it does the controllers. I don't give a rats ass if the OUYA has the latest Tegra processor or not. I want a simple programming environment that I can put in front of my 7yo and make him forget LEGO ever existed. Even Scratch is too complex in my mind. Strong documentation and dirt simple. Otherwise, a Raspberry PI + Python/Pygame would have a better chance of beating them.

Hopefully they keep this in mind. I'm rooting for them!

Tuesday, April 24, 2012

OpenStack Nova Internals – pt1 – Overview

For the last 18 months or so I've been working on the OpenStack Nova project as a core developer. Initially, the project was small enough that you could find your way around the code base pretty easily. Other than being PEP8 compliant there wasn't a whole lot of idioms you needed to follow to get your submissions accepted. But, as with any project, the deeper you get the more subtle the problems become. Consistently dealing with things like exception handling, concurrency, state management, synchronous vs. asynchronous operations and data partitioning becomes critical. As a core reviewer it gets harder and harder to remember all these rules and as a new contributor it's very intimidating trying to get your first submission accepted.

For that reason I thought I'd take a little detour from my normal blog articles and dive into some of the unwritten programming idioms behind the OpenStack project. But to start I'll quickly go over the OpenStack source code layout and basic architecture.

I'm going to assume you know about Cloud IaaS topics (image management, hypervisors, instance management, networking, etc.), Python (or you're a flexible enough programmer where the languages don't really matter) and Event-driven frameworks (aka Reactor Pattern).

Source Layout

Once you grab the Nova source you'll see it's pretty easy to understand the main code layout.
git clone
The actual code for the Nova services are in ./nova and the corresponding unit tests are in the related directory under ./nova/tests. Here's a short explanation of the Nova source directory structure:

├── etc
│   └── nova
├── nova
│   ├── api - the Nova HTTP service
│   │   ├── ec2 - the Amazon EC2 API bindings
│   │   ├── metadata
│   │   └── openstack - the OpenStack API
│   ├── auth - authentication libraries
│   ├── common - shared Nova components
│   ├── compute - the Nova Compute service
│   ├── console - instance console library
│   ├── db - database abstraction
│   │   └── sqlalchemy
│   │         └── migrate_repo
│   │               └── versions - Schema migrations for SqlAlchemy
│   ├── network - the Nova Network service
│   ├── notifier - event notification library
│   ├── openstack - ongoing effort to reuse Nova parts with other OpenStack parts.
│   │   └── common
│   ├── rpc - Remote Procedure Call libraries for Nova Services
│   ├── scheduler - Nova Scheduler service
│   ├── testing
│   │   └── fake - “Fakes” for testing
│   ├── tests - Unit tests. Sub directories should mirror ./nova
│   ├── virt - Hypervisor abstractions
│   ├── vnc - VNC libraries for accessing Windows instances
│   └── volume - the Volume service
├── plugins - hypervisor host plugins. Mostly for XenServer.

Before we get too far into the source, you should have a good understanding of the architectural layout of Nova. OpenStack is a collection of Services. When I say a service I mean it's a process running on a machine. Depending on how large an OpenStack deployment you're shooting for, you many have just one of each service running (perhaps all on a single box) or you can run many of them across many boxes.

The core OpenStack services are: API, Compute, Scheduler and Network. You'll also need the Glance Image Service for guest OS images (which could be backed by the Swift Storage Service). We'll dive into each of these services later, but for now all we need to know is what their jobs are. API is the HTTP interface into Nova. Compute talks to the hypervisor running each host (usually one Compute service per host). Network manages the IP address pool as well as talking to switches, routers, firewalls and related devices. The Scheduler selects the most appropriate Compute node from the available pool for new instances (although it may also be used for picking Volumes).

The database is not a Nova service per se. The database can be accessed directly from any Nova service (although it should not be accessed from the Compute service ... something we're working on cleaning up). Should a Compute node be compromised by a bad guest we don't want it getting to the database.

You may also run a stand-alone Authentication service (like Keystone) or the Volume service for disk management, but it's not mandatory.

OpenStack Nova uses AMQP (specifically RabbitMQ) as the communication bus between the services. AMQP messages are written to special queues and one of the related services pick them off for processing. This is how Nova scales. If you find a single Compute node can't handle the number of requests coming in, you can throw another Compute node service into the mix. Same with the other services.

If AMQP is the only way to communicate with the Services, how do the users issue commands? The answer is the API service. This is an HTTP service (a WSGI application in Python-speak). The API service listens for REST commands on the HTTP service and translates them into AMQP messages for the services. Likewise, responses from the services come in via AMQP and the API service turns them into valid HTTP Responses in the format the user requested. OpenStack currently speaks EC2 (the Amazon API) and OpenStack (a variant of the Rackspace API). We'll get into the gory details of the API service in a later post.

But it's not just API that talks to the services. Services can also talk to each other. Compute may need to talk to Network and Volume to get necessary resources. If we're not careful about how we organize the source code, all of this communication could get a little messy. So, for our inaugural article, let's dive into the Service and RPC mechanism.


I'll be using the Python unittest notation for modules, methods and functions. Specifically,

nova.compute.api:API.run_instance equates to the run_instance method of the API class in the ./nova/compute/ file. Likewise, nova.compute.api.do_something refers to the do_something function in the ./nova/compute/ file.

Talking to a Service

With the exception of the API service, each Nova service must have a related Python module to handle RPC command marshalling. For example:

The network service has ./nova/network/
The compute service has ./nova/compute/
The scheduler service has ./nova/scheduler/

... you get the idea. These modules are usually just large collections of functions that make the service do something. But sometimes they contain classes that have methods to do something. It all depends if we need to intercept the service calls sometimes. We'll touch on these use cases later.

The Scheduler service nova.scheduler.api has perhaps the most simple interface on it, consisting of a handful of functions.

Network is pretty straightforward with a single API class, although it could have been implemented with functions because I don't think there are any derivations yet.

Compute has an interesting class hierarchy for call marshaling, like this:
BaseAPI -> API -> AggregateAPI
and BaseAPI -> HostAPI

But, most importantly, nova.compute.api.API is the main work horse and we'll get into the other derivations another day.

So, if I want to pause a running instance I would import nova.compute.api, instantiate the API class and call the pause() method on it. This will marshal up the parameters and send it to the Compute service that manages that instance by writing it to the correct AMQP queue. Finding the correct AMQP topic for that compute service is done with a quick database lookup for that instance, which happens via the nova.compute.api:BaseAPI._cast_or_call_compute_message method. With other services it may be as simple as importing the related api module and calling a function directly.

Casts vs. Calls

AMQP is not a proper RPC mechanism, but we can get RPC-like functionality from it relatively easily. In nova.rpc.__init__ there are two calls to handle this cast() and call(). cast() performs an asynchronous invocation on a service, while call() expects a return value and therefore is a synchronous operation. What call() really does under the hood is it dynamically creates a ephemeral AMQP topic for the return message from the service. It then waits in an eventlet green-thread until the response is received.

Exceptions can also be sent through this response topic and regenerated/rethrown on the callers side if the exception is derived from nova.exception:NovaException. Otherwise, a nova.rpc.common:RemoteError is thrown.

Ideally, we should only be performing asynchronous cast()'s to services since call()'s are obviously more expensive. Be careful on which one you choose. Try not to be dependant on return values if at all possible. Also, try to make your service functions idempotent if possible since it may get wrapped up in some retry code down the road.

If you're really interested in how the rpc-over-amqp stuff works, look at nova.rpc.impl_kombu

Fail-Fast Architecture

OpenStack is a fail-fast architecture. What that means is if a request does not succeed it will throw an exception which will likely bubble all the way up to the caller. But, since each OpenStack Nova service is an operation in eventlet, it generally won't destroy any threads or leave the system in a funny state. A new request can come in via AMQP or HTTP and get processed just as easily. Unless we are doing things that require explicit clean-up it's generally ok to not be too paranoid a programmer. If you're expecting a certain value in a dictionary, it's ok to let a KeyError bubble up. You don't always need to derive a unique exception for every error condition ... even in the worst case, the WSGI middleware will convert it into something the client can handle. That's one of the nice things about threadless/event-driven programming. We'll get more into Nova's error handling in later posts.

Well, that sort of explains about Nova's source code layout and how services talk to each other. Next time we'll dig into service managers and drivers to see how services are implemented on the callee side.