Governance Controls and DevOps

This post describes the impact DevOps methodologies and tooling can have at established companies on their existing governance controls. I’ve been on a journey the past few years in exactly this space and thought sharing that journey might be interesting for others. To be clear, the journey is still very much in flight. Some of what you’ll read is aspirational, some is actual and some has evolved.

Early in my career, I was part of a team of five engineers at a small microelectronics company. We did everything. Milling the hardware fixtures for our test systems. Writing the software to execute a suite of tests on a particular hybrid electronic device. Recording, packaging and sending the test data. Internal auditing ourselves…what? Whatever the task was, one or all of us would take it on.

Fast forward and I was working at a company writing software for point-of-sale systems. For one particular project, I needed a server and submitted a request to the “infrastructure team”. Approximately 3 months later, a meeting shows up on my calendar and we talked about what that server should be, why I needed it, how long it would be before I got it, etc. Once the server was delivered, there was a whole other “release management” team I needed to engage to get builds and deploys running. This was when I started to understand the concept of segregation of duty and the governance controls that require it.

In my current role, I work in the infrastructure engineering space, we build servers, we deploy apps, we harden for PCI and we do all other sorts of odd jobs. I’ve been faced with a challenge: how can I ensure my engineers don’t have to monitor a queue and how can I ensure the developers and site reliability engineers (aka full stack engineers) are able to deploy at will and not have to wait in an infrastructure queue. It is our queue the application developers dread. To our application developer partners the answer is easy: “just let me log into the server. Oh, and can I just build my own servers too?” This is where my Information Security and Governance partners remind me, we have controls, our vendors and our legal contracts require us to follow those controls and simply letting the application development engineers on the servers, with full control, is unacceptable.

I found myself in an interesting spot. I identify as a software engineer, being able to do whatever I want is super appealing. As someone accountable for the infrastructure engineering discipline, I understand the controls and the ramifications of not adhering to those so what my engineering mind wants to do is super appalling. I figured there had to be some happy place between appealing and appalling. I discovered these governance controls are not set in stone, we can talk about them, evolve them to reflect how the teams work best and align them to agile and DevOps methodologies.  Thus the journey began.

I’m a big fan of this framework when it comes to tackling architecture, process, engineering and general problem solving: [problem > attributes of problem > pattern to be solved > potential solutions > solution]

Here are the problem statements to define the starting point of reality:

  • My governance controls required the concept of segregation of duties between infrastructure and application teams
  • Application teams practice both agile and DevOps principles and embody site reliability engineering methodologies
  • Site reliability engineers need to support their applications
  • Application deployments cannot be deployed, manually on the server, by the developer who performed the coding to ensure collusion is prevented
  • Only infrastructure engineers are allowed to be “root” on servers
  • Only infrastructure engineers are allowed to even log onto a server
  • Periodic reviews are conducted to collect evidence to demonstrate all of these statements are accurate and being met

If I boil these problem statements down to attributes that define a desired behavior for the problem, I have three. To me these are simple attributes:

  • Engineers are unable to take unilateral action on production changes
  • Engineers are able to access servers at appropriate levels for support
  • Unexpected actions are detected

These attributes are generalized to engineer rather than infrastructure or application developer. Infrastructure as code is a thing, we should generalize the craft of engineering between infrastructure and applications because there are all sorts of automation parallels that we’ll benefit from. Taking each attribute to a deeper level…

Engineers are unable to take unilateral action on production changes
We are describing a few things here. We want engineers to take action and make changes in production. However, we do not want those to be unilateral. This means someone else has to be involved. Ideally the other engineer involved will have domain knowledge of the change being pushed into production. This is how application development engineers work and it is how infrastructure engineers should work. ‘Unable’ means not only is there involvement by another engineer, the change cannot programmatically move to production until there is some sort of approval recorded.

Engineers are able to access servers at appropriate levels for support
On-call can suck and everyone knows it, yet expects it as part of their technology professional life. With the declaration of Agile we expect our application teams to support, in production, the code they have crafted. Yet we don’t let them on the server…? We cannot have unilateral action so we cannot have access to servers with the ability to make changes. Worth noting: the changes we are concerned with are binary / compiled file changes, state changes…not restarting a service or truncating logs due to space errors. “Are able to access” does not automatically mean the root-access-for-all! Rather, there are access gradients between root and read-only.

Unexpected actions are detected
Segregation of duty is all about ensuring the right groups are taking expected actions. If we’re going to change how the world works, we will need to have some sort of ability to detect when unexpected actions take place. This is part of auditing and evidence for various compliance and governance bodies to whom we have accountability. A nuance of this attribute is how do we know an action is unexpected? For us, we’ve decided that our scope is all actions are evaluated and we’ll detect those for which we deem unexpected in and amongst all those actions. Another nuance here is the where are we detecting these actions? On servers? In databases? With tools? The answer is yes for all.

There are three pattern that best describe the attributes of the problem we’re trying to solve. These patterns align to the three attributes described above:

  • Peer reviews and pipelines
  • Access / automation
  • Anomaly detection (information radiation)

Peer reviews and pipelines
All sorts of goodness comes from performing peer reviews. Engineers cross-train each other on the code they are crafting. The opportunity of collusion is reduced due to no change having less than two sets of eyes involved. Tools can enforce peer reviews as a programmatic gate before allowing changes to flow to an environment. Tools can record actions showing a trail of activity. Pipelines also provide all sorts of goodness. Code is sent through frameworks into environments, including production. Checks and balances can be built into the framework for things like undesired actions (i.e. delete * from) and for security scans. Automated testing is also very much a thing when you have pipelines.

Access / Automation
Defining the access teams need to be productive, yet provide a slight boundary between team responsibilities is something we value. This is in contrast to all access or no access. It means we define a pattern of access around responsibilities in the organization. Engineers across many teams may need to get on a server. Some will need to read logs. Some will need to restart services. Others may need to bend the server to their will. Automation is a complimentary pattern to the access pattern. We’ve taken the stance that no human should ever ‘have’ to get on a server. When a human does, it is an exception and there is a broken process or missing capability we should go and develop and provide some automation to allow our engineers to support their ecosystem.

Anomaly detection (information radiation)
To make decisions or detect events, much less declare those events as anomalies, we need information radiation. (note to self: dedicate a post to information radiation) The pattern here is our tools log useful information. Our assets log useful information.The logs are picked up and streamed to an aggregation point. This useful information contains data elements that enable us to correlate events across tools and assets. These events are classified as things we expect, things we do not expect and the favorite ‘other’ category. We would bucket those events and setup actions of either ignore, alert or investigate.

Finally, a few patterns so we can go and solve something. Taking the same three patterns, I’ll describe what we have either implemented, are thinking about implementing or somewhere in-between.

Peer reviews and pipelines
We use source control (GIT) and its mechanisms of repositories, forks, pull requests, and commits. Engineering teams use this internally as well as externally with other engineering teams where collaboration happens. The aspirational component is we envision a world of completely open-sourced (internally) repos where any team can view and suggest changes. The pipeline is how the code from that repository, upon successful commit, gets delivered to production, or any environment. Pipeline tools include Chef and Jenkins and offer a nice non-human on asset way of touching systems. The engineers that traditionally make up a segregated group, like Linux engineers, would create this pipeline with the appropriate access controls and connect it to the appropriate repositories. The access control point is maintained at the source control level, enabling teams to fork and pull request with the owning engineering team the ones to make the commit.

Access / Automation
Our engineers use secrets to access servers. Servers are classified, loosely, by type and department. An example would be a database server in merchandising. Engineers have a role and a department. The solution is simple: secrets for engineers are placed on assets relating to roles and departments. We define a scope for what any given role can do on a server and use automation to land the appropriate secrets on the appropriate servers with the appropriate role. Pretty simple. Automation is an answer to the things engineers are unable to do with their role defined access levels. We view this situation as an exception. Can’t restart the whole box? There should be automation to facilitate that action. Can’t disable AV? That’s good, you shouldn’t.

Anomaly detection (information radiation)
This part is in a 100% aspirational state. Everything before this is somewhere between reality and aspirational. There are several ways to solve for this, I’ll use an ELK stack as my example. Logs from everything, source control repositories, tools, servers, pipelines, automation tools, even the coffee maker are ingested into ELK. Queries process the logs looking for things we’ve defined as anomalies or things we left in the ‘other’ bucket. Alerts and / or dashboards are populated and away we go. Teams would have processes to deal with alerts and we would use this information aggregation to show the auditors our controls are programmatically enforced and circumvention is detected and reviewed.

That’s the journey thus far. I expect we’ll make good progress in our iteration one or two of each of these solutions this year. We’ll probably make a few more problem statements and we always evaluate the patterns for correctness and completeness. An interesting conversation thread is around change management and what happens to it when you have something like what I have described here. Expect a future post talking about that angle and our journey.

One last note to self: images would probably be a useful thing…

Thanks for reading.

Posted in Technology | Leave a comment