Wednesday, February 5, 2014

Getting From Here to There

As part of my work to make myself a better leader, I'm reading the book "What Got You Here Won't Get You There." The phrase itself resonates with me strongly now as I'm in reviews season and writing reviews for a number of senior engineers and engineering managers. One of the standard refrains I see in self-review "areas for improvement" is the desire to improve something very technology specific. A manager might say "I want to jump in more to the code so I can help take problems off my team." A senior engineer looking to grow to the staff engineer level might say something like "I want to get better at understanding distributed systems." Heck, even I am tempted to put "I want to spend more time in the early phases of architecture design so that I can help improve our overall technology stack." We're all falling victim to the problem of "what got us here won't get us there."

Engineers should spend their first several years on the job getting better at technology. I consider that a given and don't love to see that goal in reviews even for junior engineers, unless the technology named is very specific. Of course you're getting better at technology, programming, engineering, that's your job. You are a junior individual contributor, and your contribution is additive. You take tasks off of a team's list, get them done, create value by doing, and work on getting better at doing more and more. You got here because you put this coding time in.

After a certain point, it is more important to focus on what will make you a multiplier on the team. Very very few engineers write code of such volume and complexity that simply by writing code they enhance the entire organization. For most of us, even those in individual contributor roles, the value comes through our work across the team, teaching junior engineers, improving processes, working on the architecture and strategy so that we simply don't write as much code to begin with. There is a certain level of technical expertise that is necessary to get to this multiplier stage. As an individual contributor, a lot of that expertise is in knowing what you don't know. What do you need to research to make this project successful? You don't have to be a distributed systems expert but you should know when you're wading into CAP theorem territory.

It's harder for managers. Every time you switch jobs, you're interviewed with an eye towards the question, "can this person write code?" We don't trust managers that can't code, we worry that they're paper-pushers, out of touch. But when you hire an engineering manager, you often don't want them writing code. Managers that stubbornly hold onto the idea that they must write a lot of code are often either overworked, bottlenecks, or both. The further up you go in the management chain, the less time you will have to write code in your day job. The lesson here is not "managers should carve out lots of time to write code". It is, instead, don't get pushed into management until you've spent enough time coding that it is second nature. If you think that writing more code is the unlock for you to manage your team better, to grow as a better leader, you're going backwards. The way forward into true team leadership is not through writing more code and doing the team's scutwork programming. It is through taking a step back, observing what is working, what is not working, and helping your team fix it from a macro level. You're not going to code your team out of crunch mode by yourself, so spend your time preventing them from getting into crunch mode in the first place.

Coding is how you got here, engineering is what we do. But growing to levels of leadership requires more than just engineering. You've gotta go beyond what got you where you are, and get out of your comfort zone. Work on what makes you a multiplier, and you'll get there.

Wednesday, January 15, 2014

SOA and team structure

This week I sat on a panel to discuss my experiences moving to a service-oriented architecture (SOA) at Rent the Runway. Many of the topics were pretty standard: when to do such a thing, best practices, gotchas. Towards the end there was an interesting question. It was phrased roughly like this:
What do you do when you need to create a new feature, and it crosses all sorts of different services? How do you wrangle all the different teams so that you can easily create new features?
I think this is a great question, and illustrates one of the common misperceptions, and common failure modes, of going to a service-oriented architecture.

Let me be clear: SOA is not designed to separate your developers from each other. In my team, developers may work across many different services in order to accomplish their tasks, and we try to make it clear that the systems are "internal open source". You may not be the expert in that system, but when a feature is needed, it's expected that the team creating the feature will roll up their sleeves and get coding.

At big companies SOA is sometimes done in order to create areas of ownership and development. At my previous company, SOA was (among other things) a better way to expose data in an ad-hoc, but monitored, manner, without having to send messages or allow access to databases. The teams owning the services may be in a totally different area of the company from their clients. To access new data, you needed to coordinate with the owning team to expose new endpoints. It created overhead, but data ownership and quality of the services themselves was an important standard and losing velocity to maintain these standards was an acceptable tradeoff.

That is not the way a small startup should approach SOA, but if you don't anticipate this it may become an unintended outcome. When your SOA is architected as mine is, with a different language powering the services than that which powers the client experience (Java on the backend, Ruby on the frontend), you start to segregate your team into backend and frontend developers. We address this separation by working in business-focused teams that have both backend and frontend developers. Other SOA-based companies approach the challenge of separation by creating microservices that only do enough to support a single feature, so that a new feature doesn't always require touching existing functionality. Some companies do SOA with the same framework (say, Ruby on Rails) that powers their user-facing code, so there's no language or framework barrier to overcome when crossing service boundaries.

SOA is a powerful model for creating scalable software, but many developers are reluctant to adopt it because of this separation myth. There are many ways to approach SOA that work around this downside. Acknowledging the risk here and architecting your teams with an eye to avoiding it is important in successfully adopting this model in an agile environment.

Tuesday, December 31, 2013

2013: The Constant Introspection of Management

I've been a serious manager for a little over a year now, and it has been my biggest challenge of 2013. Previously I had managed small teams but looking back, while I thought that would prepare me to lead multiple teams and manage managers myself, in reality nothing could be further from the truth.

The hardest part for me of going from individual contributor/architect/tech lead to managing in anger has been the lack of certainty. I believe that, for experienced managers, management can have a level of rigor and certainty, but I'm not there yet. I am striving to be a compassionate manager and that requires developing a level of emotional intelligence that I have never before needed. And so as a result I would have to say that the past year has probably been one of the most emotionally draining of my life (and that is not even counting the baby I had in the middle of it).

I do not want to be a dispassionate leader who views people like pawns on a chessboard. But the emotional resilience that is required for management makes me understand how folks with that tendency may find it easier. A successful manager needs to care about her people without taking the things they do personally. Every person who quits feels like an indictment of all the ways you failed them. If only I had given better projects, fought harder for their salary, coached them better, done more to make them successful! If you wonder why it seems sometimes that management roles get taken up by sociopaths, just think that for every interaction you have with a difficult coworker, your manager has probably had to deal with ten of them. It's not a surprise that a certain bloody-mindedness develops, or, more likely, survives.

In addition to the general emotional angst of all those people, there's the general feeling of utter incompetence. As an engineer, I know how to design successful systems. I can look back on a career of successful projects, and I know many of the best practices for building systems and writing code. Right now, if I were to design a system that ultimately failed for a technical reason, I would be able to pinpoint where the mistakes were made. I am a beginner all over again when it comes to the big-league management game, and it's discouraging. I miss doing what I'm good at, building systems, and I'm afraid that I've given it up for something I may never do particularly well.

One of my friends who has faced the same struggle put it best. 
I like the autonomy/mastery/purpose model of drive. This feels like an issue with mastery. Not building means moving away from something where you have mastery to something new. There’s fear of losing mastery.[1]
As a new manager I believe you lose both autonomy and mastery for a time being, and arguably autonomy is lost forever. You are always only as good as your team, and while some decisions may ultimately rest on your shoulders, when you choose to take the "servant leadership" path you do sacrifice a great deal of autonomy. But I think for many engineers the loss of mastery hits hardest. When you've spent ten plus years getting really, really good at designing and developing systems, and you leave that to think about people all day? It's hard, and no, there isn't always time for side projects to fill the gap. In an industry that doesn't always respect the skills of management, this is a tough pill to swallow. After all, I can become the greatest manager in the world, but if I wanted to work in that role at Google they would still give me highly technical interviews.

So why do it? In the end, it has to be about a sense of purpose. I want to have a bigger impact than I will ever be able to have as an architect or developer. I know that leading teams and setting business direction is the way to ultimately scratch the itch I have for big impact, for really making a lasting difference. And I know that a great manager can have a positive impact on many, many people. So here's to growing some management mastery, and making 2014 a year of purpose and impact.

Wednesday, July 31, 2013

Replatforming? The Proof is in the Hackday

It's pretty common for teams, especially startups, to get to a point with their tech stack where the platform they've been working on just isn't cutting it anymore. Maybe it won't scale, maybe it won't scale to the increased development staff, maybe it was all written so poorly you want to burn it to the ground and start fresh. And so the team takes on the heroic effort we know and love: replatforming.

When you're replatforming because the current system can't handle the necessary load, it's pretty easy to see if your effort was successful. Load test, or simply let your current traffic hit it and watch it hum along where you once were serving 10% failures. But if the replatforming is done to help development scaling, how do you know the effort was a success?

I accidentally discovered one answer to this question today. You see, Rent the Runway has been replatforming our systems for almost the last two years. We've moved from a massive, and massively complex Drupal system to a set of Java services fronted by a Ruby thin client. Part of the reason for this was load, although we arguably could've made Drupal handle the load better by modernizing certain aspects of our usage. But a major reason for the replatforming was that we simply weren't able to develop code quickly and cleanly, and the more developers we added the worse this got. We wanted to burn the whole thing to the ground and start fresh.

We didn't do this, of course. We were running a successful business on that old hideous platform. So we started, piece by piece, to hollow out the old Drupal and make it call Java services. Then, with the launch of Our Runway, we began to create our new client layer in Ruby using Sinatra. Soon we moved major pages of our site to be served by Ruby with Java on the backend. Finally, in early July, we moved our entire checkout logic into Java, at which point Drupal is serving only a handful of pages and very little business logic. 

So yesterday we had a hackday, our first since the replatforming. We do periodic hackdays although we rarely push the entire team to participate, and yesterday's hackday was one of the rarer full-team hackdays. Twenty-some tech team members and four analytics engineers participated, with demos this morning. I was blown away with what was accomplished. One project by a team of four enabled people to create virtual events based on hashtags and rent items to those events, pulling in data from other social media sources and providing incentives to the attendees to rent by giving them credits or other goodies as more people rented with that hashtag. This touched everything from reservations to checkout. We had several projects done by solo developers that fixed major nasty outstanding issues with our customer service apps, and some very nice data visualizations for both our funnel and our warehouse. All told we had over ten projects that I would like to see in production, whether added to existing features, as new features, or simply in the data dashboard displayed in our office. 

Compare this to the last full-team hack day and the differences are striking. That hack day had very few fully functional projects. Many were simulations of what we could do if we could access certain data or actions. Most didn't work, and only a handful were truly something we could productionize. The team didn't work any less hard, nor was it any less smart than it is now. The major difference between that hack day and now is that we've replatformed our system. Now adding new pages to the site is simple and doesn't require full knowledge of all the legacy Drupal. Getting at data and actions is a matter of understanding our service APIs, or possibly writing a new endpoint. Even our analytics data quality is better. 

So if you're wondering whether your replatforming has really made a difference in your development velocity, try running a hackday and seeing what comes out. You may be surprised what you learn, and you'll almost certainly be impressed with what your team can accomplish when given creative freedom and a platform that doesn't resist every attempt at creativity.

Monday, May 20, 2013

ZooKeeper and the Distributed Operating System

From the draft folder, this sat moldering for a few months. Since I think the topic of "distributed coordination as a library/OS fundamental" has flared up a bit in recent conversations, I present this without further editing.

I'm preparing to give a talk at Ricon East about ZooKeeper, and have been thinking a lot of what to cover in this talk. The conference is focused on distributed systems at a somewhat advanced level, so I'm thinking about topics that expand beyond the basics of "what is ZooKeeper" and "how do you use it." After polling twitter and getting some great feedback I've decided to focus on the question that many architects face: When should I use ZooKeeper, and when is it overkill?

This topic is interesting to me in many ways. In my current job as VP of Architecture at Rent the Runway, we do not yet use ZooKeeper. There are things that we could use it for, but in our world most of the distributed computing we do is pure horizontally scalable web services. We're not yet building out complex networks of servers with different roles that need to be centrally configured, managed, or monitored beyond what you can easily do with simple load balancers and nagios. And many of the questions I answer on the ZooKeeper mailing list are those that start with "can ZK do this?" The answer that I prefer to give is almost always "yes, but keep these things in mind before you roll it out for this purpose." So that is what I want to dig into more in my talk.

I've been digging into a lot of the details of ZAB, Paxos, and distributed coordination in general as part of the talk prep, and hit on an interesting thought: What is the role of ZooKeeper in the world of distributed computing? You can see a very clear breakdown right now in major distributed systems out there. There are those that are full platforms for certain types of distributed computing: the Hadoop ecosystem, Storm, Solr, Kafka, that all use ZooKeeper as a service to provide key points of correctness and coordination that must have higher transactional guarantees than these systems want to build intrinsically into their own key logic. Then there are the systems, mostly distributed databases, that implement their own coordination logic: MongoDB, Riak, Cassandra, to name a few. This coordination logic often makes different compromises than a true independent Paxos/ZAB implementation would make; for an interesting conversation check out a Cassandra ticket on the topic.

In thinking about why you would want to use a standard service-type system vs implementing your own internal logic, it reminds me very much of the difference between modern SQL databases and the rest of the application world. The best RDBMSs are highly tuned beasts. They cut out the middleman as much as possible, taking over functionality from the OS and filesystem as it suits them to get absolutely the best performance for their workload. This makes sense. The competitive edge to the product they are selling is its performance under a very well-defined standard of operation (SQL with ACID guarantees), as well as ease of operation. And in the new world of distributed databases, owning exactly the logic for distributed coordination (and understanding where that logic falls apart in the specific use cases for that system) will very likely be a competitive edge for a distributed database looking to gain a larger customer base. After all, installing and administering one type of thing (the database itself) is by definition simpler than installing and administering 2 things (the database plus something like ZooKeeper). It makes sense to prefer to burn your own developer dollars to engineer around the edge cases, so as to make a simpler product for your customers.

But ignoring the highly tuned commercial case of distributed databases, I think that ZooKeeper, or a service like it, is a necessary core component of the "operating system" for distributed computing. It does not make sense for most systems to implement their own distributed coordination, any more than it makes sense to implement your own file system to run your RESTful web app. Remember, to do distributed coordination successfully requires more than just, say, a client library that perfectly implements Paxos. Even with such a library, you would need to design your application up-front to think about high availability. You need to deploy it from the beginning with enough servers to make a sane quorum. You need to think about how the rest of the functioning of your application (say, garbage collection, startup/shutdown conditions, misbehavior) will affect the functioning of your coordination layer. And for most of us, it doesn't make sense to do that up-front. Even the developers at Google didn't always think in such terms, the original Chubby paper from 2006 mentions most of these reasons as driving the decision to create a service rather than a client library.

Love it or hate it, ZooKeeper or a service like it is probably going to be a core component of most complex distributed system deployments for the foreseeable future. Which is all the more reason to get involved and help us make it better.

Friday, February 8, 2013

Branching Is Easy. So? Git-flow Is Not Agile.

I've had roughly the same conversation four times now. It starts with the question of our deployment/development strategy, and some way in which it could be tweaked. Inevitably, someone will bring up the well-known git branching model blog post. They ask, why not use this git-flow workflow? It's very well laid out, and relatively easy to understand. Git makes branching easy, after all. The original blog post in fact contends that because branching and merging is extremely cheap and simple, it should be embraced.
As a consequence of its simplicity and repetitive nature, branching and merging are no longer something to be afraid of. Version control tools are supposed to assist in branching/merging more than anything else.
But here's the thing: There are reasons beyond tool support that would lead one to want to encourage or discourage branching and merging, and mere tool support is not reason enough to embrace a branch-driven workflow.

Let's take a moment to remember the history of git. It was developed by Linus Torvalds for use on the Linux project. He wanted something that was very fast to apply patches, and supported the kind of distributed workflow that you really need if you are supporting a huge distributed team. And he made something very, very, very fast, great for branching and distributed work, and difficult to corrupt.

As a result git has many virtues that align perfectly with the needs of a large distributed team. Such a team has potentially long cycles between an idea being discussed, being developed, being reviewed, and being adopted. Easy and fast branching means that I can go off and work on my feature for a few weeks, pulling from master all the while, without having a huge headache when it comes to finally merge that branch back into the core code base. In my work in ZooKeeper, I often wish I bothered to keep a git-svn sync going because reviewing patches is tedious and slow in svn. Git was made to solve my version control problems as an open source software provider.

But at my day job, things are different. I use git because a) git is FAST and b) Github. Fast makes so much of a difference that I'm willing to use a tool with a tortured command line syntax and some inherent complexity. Github just makes my life easier, I like the interface, and even through production outages I still enjoy using it. But branching is another story. My team is not a distributed team. We all sit in the same office, working on shared repositories. If you need a code review you can tap the shoulder of the person next to you and get one in 5 minutes. We release frequently; I'm trying to move us into a continuous delivery model that may eventually become continuous deployment if we can get the automation in place. And it is for all of these reasons that I do not want to encourage branching or have it as a major part of my workflow.

Feature branching can cause a lot of problems. A developer working on a branch is working alone. They might be frequently pulling in from master, but if everyone is working on their own feature branch, merge conflicts can still hit hard. Maybe they have set things up so that an automated build will still run through every push they make to that branch, but it's just as likely that tests are only being run locally and the minute this goes into master you'll see random failures due to the various gremlins of all software development. Worst of all, it's easy for them to work in the dark, shielded from the eyes of other developers. The burden of doing the right thing is entirely on the developer and good developers are lazy (or busy, or both). It's too easy to let things go for too long without code review, without integration, and without detecting small problems. From a workflow perspective, I want something that makes small problems come to light very early and obviously to the whole team, enabling inherent communication. Branching doesn't fit this bill.

Feature branching also encourages thinking about code and features as all or none. That makes sense when you are delivering a packaged, versioned product that others will have to download and install (say, Linux, or ZooKeeper, or maybe your iOS app). But if you are deploying code to a website, there is no need to think of the code in this binary way. It's reasonable to release code behind feature flags that is not complete but flagged off, for purposes of keeping the integration of that new code in for testing in other environments. Learning how to write code in such a way as to be chunkable, flaggable, and almost always safe to go into production is a necessary skill set for frequent releases of any sort, and it's essential if you ever want to reach continuous deployment.

Release branching may still be a necessary part of your workflow, as it is in some of our systems, but even the release branching parts of the git-flow process seems a bit overly complex. I don't see the point in having a develop branch, nor do I see why you would care about keeping master pristine, since you can tag the points in the master timeline where you cut the release branch. (As an aside, the fact that the original post refers to "nightly builds" as the purpose of the develop branch should raise the eyebrows of anyone doing continuous integration.)  If you're not doing full continuous deployment you need to have some sort of branch that indicates where you cut the code for testing and release, and hotfixes may need to go into up to two places, that release branch and master, but git-flow doesn't solve the problem of pushing fixes to multiple places. So why not just have master and release branches? You can keep your release branches around for as long as you need them to get live fixes out, and even longer for historical records if you so desire.

Git is great for branching. So what? Just because a tool offers a feature, and does it well, does not mean that feature is actually important for your team. Building a whole workflow around a feature just because you can is rarely a good idea. Use the workflow that your team needs, don't cargo cult an important element of your development process.

Sunday, December 30, 2012

Make it Easy

One of my overriding principles is: make it easy for people to do the right thing.

This seems like it should be a no-brainer, but it was not always obvious to me. Early in my career I was a bit of a self-appointed build cop. The team I worked on was an adopter of some of the agile/extreme programming principles, and the result of that was a 40+ person team all working against the same code base, which was deployed weekly for 3 distinct business purposes. All development was done against trunk, using feature flags. We managed to do this through the heavy use of automated unit/integration testing; to check code in, you were expected to write tests of course, and to run the entire test suite to successful completion before checking in.

Unsurprisingly, people did this only to a certain level of compliance. It drove me crazy when people broke the build, especially in a way that indicated they had not bothered to run tests before they checked in. So I became the person that would nag them about it, call them out for breaking things, and generally intimidate my way into good behavior. Needless to say, that only worked so well. People were not malicious, but the tests took a LONG time to run (upwards of 4 hours at the worst), and on the older desktops you couldn't even get much work done while the test suite ran. In the 4 hours that someone was running tests another person might have checked in a conflicting change that caused errors; was the first person really supposed to re-merge and run tests for another 4 hours to make sure things were clean? It was an unsustainable situation. All my intimidation and bullying wasn't going to cause perfect compliance.

Even ignoring people breaking the build, this was an issue we needed to tackle. And so we did, taking several months improve the overall runtime and make things easier. We teased out test suites into specific ones for the distinct business purposes combined with a core test suite. We made it so that developers could run the build on distributed hardware from their local machine. We figured out how to run certain tests in parallel, and moved database-dependent tests into in-memory databases. The test run time went way down, and even better, folks could kick off the tests remotely and continue to work on their machine, so there was much less reason to try and sneak in an untested change. And lo and behold, compliance went way up. All the sudden my build cop duties were rarely required, and the whole team was more likely to take on that job rather than leaving it to me.

Make it easy goes up and down the stack, far beyond process improvements. I occasionally find myself at odds with folks that see the purity of implementing certain standards and ignore the fact that those standards, taken to extreme, make it harder for people to do the right thing. One example is REST standards. You can use the http verbs to modify the meanings of your endpoints and make them do different things, and from a computer-brain perspective, this is totally reasonable. But this can be very bad when you must add the human brain perspective to the mix. Recently an engineer proposed that we change some endpoints from being called /sysinfo (which would return OK or DEAD depending on whether a service was accepting requests), and /drain (which would switch the /sysinfo endpoint to always return DEAD), into one endpoint. That endpoint would be /sys/drain. When called with GET, it would return OK or DEAD. When called with PUT, it would act as the old drain.

To me, this is a great example of making something hard. I don't see the http verb, I see the name of the endpoint, and I see the potential for human error. If I'm looking for the status-giving endpoint, I would never guess that it would be the one called "drain", and I would certainly not risk trying to call it to find out. Even knowing what it does, I see myself accidentally calling the endpoint with GET, now I didn't drain my service before restarting it. Or I accidentally called it with PUT and now it's been taken out of the load balancer. To a computer brain, GET and PUT are very different, and hard to screw up, but when I'm typing a curl or using postman to call an endpoint, it's very easy for me as a human to make a mistake. In this case, we're not making it easy for people using the endpoints to do the right thing, we're making it easy for them to be confused, or worse, to act in error. And to what benefit? REST purity? Any quest for purity that ignores human readability does so at its peril.

All this doesn't mean I want to give everyone safety scissors. I generally prefer to use frameworks that force me and my team to do more implementation work rather than making it trivially easy. I want to make the "easy" path the one that forces folks to understand the implementation to a certain level of depth, and encourages using only the tools necessary for the job. This makes better developers of my whole team, and makes debugging production problems more science than magic, not to mention the advantage it gives you when designing for scale and general future-proofing.

Many great engineers are tripped up by human nature, when there's really no need to be. Look at your critical human-involving processes and think: am I making it easy for people to do the right thing here? Can I make it even easier? It might take more work up front on your part, or even more verbosity in your code, but it's worth it in the long run.