Jeff Ruane

Lessons Learned

In September, my current employer hired me to work on a large, complex project with tight deadlines. That deadline was a few days ago, so it's time to pause and reflect on the last few months. A strict NDA covers this project, so I can't go into too many details, but I hope it's still worth sharing. It'll be worth it for me to write it and sort out my thoughts.

Technical project managers and account managers had already worked with the clients for several months when I was hired, so the requirements were more-or-less set from the start. Requirements drifted, of course, but requirement drift and feature scope were minimal overall, to my surprise. I was hired because I have specific experience with the data involved, which was public utility data (energy bills and meter readings, for example). I felt comfortable almost immediately. I knew the pitfalls and common issues we would encounter, so I was confident.

Unfortunately, I had mistaken confidence with overconfidence. I was convinced of my ability to complete the work quickly and give accurate estimates. My confidence became a blind spot. I would assure stakeholders that we were on track even though I should've expected "unknown unknowns," in the great words of a war criminal.

During my first few weeks, I performed some HR-required onboarding tasks, talked to IT/DevOps about access, and started reviewing some of the relevant code bases. After that, I spent a few weeks meeting with the non-technical project team and creating architecture plans with my boss. Several weeks spent pouring over data specifications with stakeholders before writing code may seem like overkill when the timeline was only three months long. However, the longer I've worked in data, the more I'm convinced that this work isn't just "best practice" or anything. It's essential. When data specifications and objects aren't carefully examined, defined, and checked for internal consistency, it leads to immense problems that snowball over time. For example, significant toil occurs when the engineering team doesn't know the difference between fields like bill_amount and total_bill_amount. If the project team doesn't know either, it goes back to the client, and they have to talk to whoever designed/maintains the system. It's not unheard of that nobody in our organization or the clients knows the difference between fields or the precise definition of a field. When things like this happen, the client needs to redesign their system, or engineers are forced to guess what data means. It's a guarantee that an excruciating operational burden will eventually occur. Data will eventually be corrupted, and engineers will have to figure out how to remove and replace the corrupt data with (hopefully!) the corrected data. Often, this happens on a live production database. I promise you, dear reader, these are "when, not if" scenarios if the work isn't done upfront to define data objects down to the most minute detail.

With data definitions and architecture diagrams in hand, it came time to start working on technical implementation. The road started to get rockier at that point. To further complicate matters, most engineers are located in India, making even the most straightforward communication difficult.

This is not to say that asynchronous work can't be performed well. Plenty of organizations have figured out ways to make it work (GitLab being an excellent example with their Handbook). These solutions are not one-size-fits-all, however. They're often based on department or organization-wide agreements and a strong culture of enforcing those agreements. Plenty of tools help with asynchronous work (Jira, Git, Slack, Email). Of course, these solutions only create less friction for the organization's communication strategy. They can't be the solution in and of themselves.

However, I'd previously worked with many people from the team in India. I even flew to Mumbai a few years ago and worked in person with them. "This is old hat," I figured, "I know how to make this engineering org structure work." Again, overconfidence.

The project's technical lead lives in India as well. He's an incredibly sharp young man, not to mention friendly and helpful. I have a tremendous amount of respect for him. That said, having to run technical decisions past someone on an 11.5-hour time difference creates a painful situation. I didn't feel empowered to make decisions independently, as the only answer I ever got regarding architecture was "Ask Ravi." There were two engineers in my local timezone that I could talk to about issues, but they don't have a ton of experience in the data domain nor the authority to make those kinds of decisions.

When the tech lead took a well-deserved vacation for a week, my confidence started to wane, and I slowly became aware that the project could be in jeopardy. I realized that the bus factor2 was dangerously high. If one person being gone for a week has the potential to grind development to a halt, or his boss is required to interrupt his vacation for work-related issues, we have a severe problem. This was intensified because, as I mentioned, the project is under an NDA, so there was a limited pool of people to work with on issues that would come up. At one point, one of those few people took paternity leave for a month. At that point, a single North American engineer worked on this with me. Remember that this project was significant enough that the company might not make it if we didn't finish it on time.

Around this same time, with a little over a month and a half until the deadline, I started working on implementation. I quickly learned that our AWS environments have an extremely conservative access policy. DevOps is shielded from the rest of the organization (which is fine—it's a small team, and if everyone who needed help Slacked them directly, I'd be worried for their sanity). To make an IT or DevOps request, we create a ticket through their ticketing system and wait. This is also fine. Sustainable workflow for IT teams is complex to achieve, and this is a good start at it. The issues arose when I discovered that I had access to almost nothing I needed to do my job. I didn't have permission to look at logs of a failed job run. I didn't have permission to examine IAM roles and policies for AWS permission issues. I didn't have permission to create a new instance to test something. Mind you, dear reader, this is in the development AWS account, not production. A few times, I had to create a ticket just for an SRE to look at a log and tell me what it said. That's too strict of an access policy and wreaks havoc on the development cycle.

Eventually, the CTO and the director of engineering caught wind that there could be an issue with that timeline. They talked to me in ways that I was not comfortable with. I was scolded repeatedly in meetings with several peers over a week or two. I would try to give better updates, leave notes all over Jira and Slack, and work until 2 or 3 in the morning, but it was never sufficient. My CTO asked me directly, "What have you even been doing the last two months?" He told me and my boss that he didn't think we had any idea what we were doing. I was furious and heartbroken. I felt worthless.

I don't want to speak poorly about anybody. He's a sweet and empathetic person on a personal level. Still, in professional conversations, he can be blunt and may have let his anxiety around delivering the project get the best of him. He's the CTO, and he shouldn't beat around the bush, I get it. However, he crossed the line at a few points. It left me feeling like I had a target on my back. In a blameless culture, it doesn't matter how we got into a situation. What matters is where we go from here, and I don't think any part of how things went down was in that spirit. I would've preferred that to be a private conversation if he had performance concerns. I haven't had a one-on-one with him since, but I plan to express my opinions.

In retrospect, he was partially correct. We should have raised concerns that the design phase was taking longer than expected. That was well-spent time, however, and I will fight for that position. If we had spent significantly less time on that, "data issues" would now be on the list of blockers and delays, and we'd likely still be working on bugs currently. Those issues take significant time to solve as they often involve client input. We should've asked him to assign more people to the project sooner. The other side of the coin is, however, how were the CTO and director of engineering only aware of timeline risks once we were at a crisis point? I will take some responsibility. I should have raised alarms once I realized how slow the development cycle would be. But if this is a make-or-break project for the company, it's in their job description to ensure things are on track and progressing smoothly. That's why they make the big bucks, and I won't take responsibility for an executive neglecting one of their fundamental job functions.

Since those conversations, development has been progressing at a feverish pace. He assigned two more engineers and a couple of QA engineers. As of last week's deadline, we are feature-complete and working through lingering edge cases. A few of us had to pull a few all-nighters. There may have been a few tears shed, but we fucking did it, and it feels incredible. These last few days could be the textbook definition of teamwork, dedication, and grit.

Back to my overconfidence, though. I had done so many projects similar to this that I thought I knew it all. But this is a new company, new infrastructure, and new culture. Sure, I was an expert in the data when I was hired. However the development cycle still needed to be discovered. The project workflow and stakeholder expectations still needed to be discovered. The relationship with IT/DevOps was unknown. I had no business being as confident as giving a timeline without knowing those things. That confidence blinded me as well. I knew, of course, I'd run into hurdles, but I thought, "I can figure out anything with some time." But if I can't access logs, the infrastructure the project runs on, or the roles and policies governing that infrastructure, then I can't figure it out with time. I relied on other teams in ways I didn't expect, and I didn't recognize that blind spot until very late in the project timeline.

The lesson here isn't to be noncommittal regarding timelines or be in constant self-doubt. The lesson is to learn the right questions to ask and ask them to the people who can help. Passivity is an absolute killer. If I had actively sought answers to these issues early, I could've talked to DevOps and had tickets written for them to work on at an average pace, all while making good time for the architecture planning. I'm called a senior engineer, and this process has greatly humbled me. I still need to gain more technical leadership skills before taking that next step to becoming a staff engineer.

  1. For additional context, Andrew Bosworth wrote an excellent blog post on the topic, which I'd highly recommend.

  2. From Wikipedia: "The bus factor is a measurement of the risk resulting from information and capabilities not being shared among team members, derived from the phrase 'in case they get hit by a bus'"

#work #writing