Marc O'Morain on adding Windows support to CircleCI

Daniel Compton 0:00 Hello, welcome to The REPL - a podcast diving into Clojure programs and libraries. This week I’m talking about Windows and Clojure with Marc O’Morain, one of the developers on CircleCI’s new Windows support. Welcome to the show, Marc.

Marc O’Morain 0:11 Thank you, Daniel. Thanks for having me! I’m a big fan of the podcast. I listened every week, so I’m delighted to be asked to be a guest.

Daniel Compton 0:18 Great. It’s awesome to have you on. So there’s kind of two recent news items from CircleCI. The big one that I want to talk about first was Windows support. I imagine… it sounds like quite a lot of work. And so what was the reason for putting in all of this work to support Windows?

Marc O’Morain 0:33 Absolutely ever delighted to launch our Windows support. It’s something that our customers have been asking us for, first and foremost. Our customers that use this for Linux/macOS also have Windows workloads, which they wanted to be on one platform for. Secondly, there’s a big untapped market there for us. So, the Stack Overflow Developer Survey this year had an interesting figure, which was that 40% of the developers are running Windows on their desktop. So that was one of the two big reasons why we wanted to build Windows support.

Daniel Compton 1:01 Nice. And so what does this mean now? What can people do now that they couldn’t before?

Marc O’Morain 1:06 Use all the platforms. So, when you push code to CircleCI, you can now have a workflow, and you can run jobs on Windows, macOS, and Linux all together in the same workflow. So you can build your iOS apps, build any Docker containers, and also any Windows apps, Windows phone apps, any Windows workload.

Daniel Compton 1:26 All within the same code base. So one push would trigger all of these builds at the same time.

Marc O’Morain 1:31 Exactly, yep, and the caches and workflows and all the features of CircleCI that you expect are all there on Windows, and all interoperate between the platforms. So one neat example there is the workspaces we have which let you simply pass a folder of data from job to job in the workflow. You could start with a Linux job that builds some data, puts it in a workspace, and then run a Windows job and take that data out. It’s a fully integrated solution. It’s not a separate system that we bolted on.

Daniel Compton 2:03 I see. So let’s say I had like a Go program where I’m pretty sure Go can cross-compile for different architectures. Could I have like a Linux Go build to build my Windows binaries and then pass it through workspaces and run the tests - or whatever I want to run - on Windows? Could you do that kind of thing?

Marc O’Morain 2:21 Yeah. Interesting, that’s how we build the system ourselves. So the piece of CicrcleCI that actually executes your config - we call it the build agent - that is a Go line program that we build on all platforms in one Linux job. So we build it for Linux, macOS, and Windows, cross-compile it, and those binaries get put into a Docker image. And then separately, we have a macOS job and a Windows job that run the tests. I don’t think we actually use the same binary that we built in the upstream job, but that’s certainly possible. That was an interesting one for us because as we started building the Windows support, that was the first thing we needed to do. So the very first build we ran was to install the Go tools and start from a build for the system itself to bootstrap us.

Daniel Compton 3:06 Yep. So traditionally, this kind of CI/CD workload has not been so well-suited for Windows. There has been… things in the past but certainly the way that CircleCI does it with very ephemeral kind of containers - and ephemeral stuff has noise. I’ve worked on Windows environments in the past where it just hasn’t really been a very good fit for the nature of the software we were building. So what kind of themes are you finding are able to take advantage of this?

Marc O’Morain 3:35 So, the Windows build itself, we’ve had some beta partners using it for a couple of months now. And they have largely been existing CircleCI customers that had Windows workloads they wanted to run - the easiest folk for us to get involved to help with that. And they typically have applications that are cross-platform. So customers that were using us for their Linux build and their Mac build, a lot of desktop apps and a lot of libraries that need to be distributed across all platforms being the initial users of this question. Also, you were talking about how the tooling has evolved. And it’s funny, I haven’t seen him other the recent evolution of Windows tooling. My developer origin story, as it was, I started out working game development when I first left University, and back then we use CruiseControl.NET to run builds - if any of your listeners remember that on a Windows Server. And you’re right, it was absolutely not suited to that ephemeral job. We would have… I think it was this TortoiseSVN where we used a subversion checkout of the code, and each build would check out the latest on top of the old code. And you’d end up with all these scenarios where a deleted file would remain in the system. You know, essentially, you’re just reusing the same folder over and over. Since then, I’ve been at CircleCI for a number of years, so I haven’t seen the evolution of the Windows platform, but what we have seen is the cloudification of Windows. So we run Windows builds on AWS (Amazon’s cloud) and GCP (Google’s Compute Engine - Google Compute Platform). And we can spin up Windows machines… I think it takes about 55 seconds for us to boot a Windows machine. That was today’s numbers, so we can spin them up very quickly, and we throw the whole machine away after the builders run.

Daniel Compton 5:24 Right. So every single build gets a new Windows

Marc O’Morain 5:27 Yeah, it’s a full new EC2 instance, or GCP just calls some instances. So it’s a whole new VM spun up. So the team that I’m on, we call ourselves the ‘machine team’. We deal with the non-Docker builds in CircleCI. So macOS builds, Windows builds, and what we have been calling ‘machine builds’, which I think we’re gonna refer to more as VM builds. So there you have a whole EC2 instance yourself. And one of the things we’ve been building hand in hand with this Windows launch is better support for us to scale the VMs because I think the Windows build take about 55 seconds to boot up. Some of the beefier Docker images for Linux can take 100 seconds, 120. So that’s a latency we need to hide from the user - you’re not going to be happy if you push your code and takes 120 seconds to spin up the machine, so we have to boot the machines in advance. That’s an interesting problem we’ve been working on. So we’ve had a system that was good enough for the last year or so with some hard-coded values and some system that worked well enough as long as you didn’t look behind the curtain. And we’ve been investigating using control theory, system control theory and queueing theory to try and work out the best way to boot the machines. So it’s been an interesting problem to work on, and it’s been a neat problem, and I love a problem where you can go and find the literature and solve the problem as it’s meant to be solved. So in looking into the queueing theory, we found queueing theory have been invented in… I think it was 1908, Wikipedia said, by Mr. Erlang that has the programming language named after him. And the problem he was trying to solve was telephone switches. He was trying to optimize, I don’t know whether it was the number of people or the amount of machinery they needed, but to run a telephone exchange, they were working out what their peak expected number of calls was. And I’m going to say it’s people… listeners might correct me if I’m wrong, but how many servers would be needed, to physically plug in all the plugs and to keep those calls going. And he seemed to spend 8 to 10 years inventing a new branch of mathematics - queueing theory - that solves the problem. Whereas we were able to go to his Wikipedia page and just start from there.

Daniel Compton 7:44 [Laughs]

Marc O’Morain 7:46 It has been 110 years of evolution in this branch of mathematics, and we’ve been able to just lift very standard queuing algorithms. There’s a couple of interesting points in how we do the scaling, so the first is that our jobs are ephemeral, and we throw them away. So a lot of the standard… like Amazon’s, I think they call them Auto Scaling Groups, and I think they call it instance groups in Google. They’re designed for maintaining a fleet of servers. So you can imagine you have maybe eight web servers in a scaling group, and then as your load goes up, you add a 9th or 10th; whereas we are constantly spinning up new VMs and shutting them down again as soon as jobs are finished. So our workload is nearly the exact correct shape, but very different at the same time. So we’ve been there. Certainly, we are investigating if we are going to use some of the auto-scaling toolings, but we’re finding - because our problem is just a slightly different shape - we’re having to build our own systems. But the other thing with queueing theory is that a lot of it works really well when the arrival rate of your jobs is random, which we’ve noticed for a busy image on our system — may be a standard Ubuntu 1604 image, say — is going to be really popular, you might be running 5-10 thousand jobs an hour. So the arrival rate of that is random because there are thousands of developers across the world pushing code. But for lesser-used images - so the Windows has been a great example - while we were in our preview period, we might get 40 minutes with no builds coming through, and then one developer would push and he or she would have a workflow with maybe three-four jobs in it, so we get four jobs at once. It’s not random; it is very much correlated what sort of workflow *, and when plugging those numbers into some standard queueing theory models, if you said, ‘well, we get four jobs an hour’, it would say, ‘well, you need one server booted, but when all four come along at once, you need four booted’. So we’re having to add a bit more intelligence to the system. I would say machine learning, but that’d be a lie - it’s just standard Queueing theory.

Daniel Compton 10:00 So, Windows is quite a different system in many ways: different OS than Linux and Mac. Linux and Mac seem to have shared quite a lot of similarities in their Unix underpinnings. So what did you find when you came to add Windows? Were there assumptions that you’d made that no longer held? So what was that process like?

Marc O’Morain 10:22 Yes, the shell is the big difference. So one very conscious decision we made was to go with Windows Server 2019. I don’t remember the exact release, but the later one.

Daniel Compton 10:33 Datacenter edition.

Marc O’Morain 10:35 Datacenter edition also. So that gets us the Docker containers, but the Windows 2019 comes with OpenSSH installed as part of the base install. And our system for booting VMs - be they typically Mac and Linux - assumes an SSH connection, and it assumes an Authorized_keys file. So the build agent itself - the Go binary, the bootstraps, the whole thing - will generate a new key pair, and then if we boot the VM and ask the service that has booted* the VM to add this public key to the authorized_keys file. So the builds can then SSH in and run. But using Windows Server 2019, we got that out of the box. We had to install Bash in the image, which - thankfully - we get through the official Git client for Windows comes with an empty* system that gives us a Bash shell. So we have a bash shell. We have been looking at supporting other versions of Windows, so Windows 10 has been something that’s been mentioned in the past. And if we went with that, we would need to build… install enough into the base image that meets the sort of API we have that is an SSH connection. The shell has been interesting. So we made a ton of assumptions: that /temp* is a folder you can write to; that Bash is the default shell. And we started out down the path then of adding - you can imagine - a bunch of ifs through everything, saying, if you’re on Windows, then powershell-c this way, and if you otherwise have run Mac and Linux, execute and bash, * support is command, cmd.exe. And annoyingly, of every shell we’ve seen, we had a hard-coded assumption that you could execute the shell name –c and then a string. And that’s true in PowerShell, Bash, DataStage - every shell except for cmd.exe. So we had to go in and put a big dirty “if” in the middle of everything, saying, build your shell string this way, unless it’s cmd.exe. We swapped, actually, the default shell. So the default shell that customers see of CircleCI is PowerShell. But when we SSH in a layer deeper, we get Bash as the default shell, which was great for building the prototype: the /temp* is there, you get your Deno* tools. So we were able to remove the ifs in the code, saying if PowerShell/Bash and just write a sort of sanitized Bash script for bootstrapping ourselves. That’s… were great, and now we’re starting to see the limitations, we’re starting to hit issues where Tar still isn’t quite the same; when you boot a SIG* in bash shell, it’s not real Bash; we’re starting to see some finicky issues with symlinks and changing paths and stuff. So I think we’re gonna have to reevaluate now and go back to a different system for bootstrapping the machine.

Daniel Compton 13:29 That sounds messy, perhaps, to figure that out.

Marc O’Morain 13:32 Yeah. Although, now, going back with the better ideas, what we hadn’t been doing was remotely going into the machine and installing our agent and setting some paths and adding some data. And a new model we’re looking at now is where we would put the agent in, and as part of it starting up, it can bootstrap itself.

Daniel Compton 13:51 So it will be loaded on an AMI beforehand.

Marc O’Morain 13:54 No, we actually use SCP to put the binary in itself. But we have a… we call them different words internally, inner and outer. The agent itself starts in its outer mode, where we’ll get a VM for running the build - be that Linux, Mac or Windows - and then it uses SCP to copy a copy of itself in and then it invokes itself, and off it goes. And that outer process is part of copying the binary into the VM itself; it also copies in some config data, it sets up some paths, sets environment variables. Because we’re doing that remotely, we’re sort of having to say, ‘if the target platform is Windows then treat it specially’. Whereas if we do that as part of… once we copy the binary in and have the binary itself bootstrap itself, it knows statically that it’s on Windows. So we can have much better-factored code, or we can have a Windows setup path and a Linux setup a path rather than having to introspect where we’re running.

Daniel Compton 14:53 Right. And so, you mentioned, you’ve got some Go in here, and CircleCI is a famous Clojure user. So there’s Clojure. So how did you split up those responsibilities? And what’s doing what?

Marc O’Morain 15:06 You’re right, Circle is a Clojure shop*, and we have some smatterings of Go around the place. And the Go, we’ve chosen for the specific places where we can’t really run Clojure. So, interestingly, in CircleCI 1.0, we used to have Clojure that we ran inside the build, which was our inference system, which would go into a repository and try and make guesses about what language you were using and tried to generate a config for you. I think we got Clojure down to a 9 MB JAR, and it started in about 200 milliseconds.

Daniel Compton 15:40 That’s pretty good.

Marc O’Morain 15:40 It would take a second or two to run. But it was… we were copying a 9 MB .jar in, and we had to have the JVM inside the build, and it was slow and restrictive. So with CircleCI 2.0, the core of the system is that you bring your own Docker image. So we can’t rely on there being a JVM in there, let alone the exact JVM we require. So we needed something that would boot quickly and be small, sort of like a compiled language makes sense in that scenario. So I think we’ve looked at Rust and Go - this is maybe three years ago - and we found Go was better at statically compiling itself, and including all of its libs in one executable. I think Rust has improved a lot since, but at the time, it was trying to dynamically link to libc. And again, in a container, we can’t rely on there being any specific libc. So Go on the day* there. So that’s our build agent, as we call it - the small executable inside the system. The other big place we use Go is our CLI tool, which my team rewrote about a year ago as part of our orbs initiative. So that’s a tool you can validate your config locally; you can call some of our APIs with it. And that is a tool that we wanted to run on our developers’ machines and customers machines. So again, we actually did prototype that as a .jar. So we wrote the initial version in Clojure, because it was what we enjoy writing code in most, and then for the actual release, we swapped it over to using Go, so that we would have - again - one binary that we could easily ship to plat* So* we’re not aware of what the execution environment is going to be at the time.

Daniel Compton 17:24 Right. So what were some of the other challenges you faced building the system?

Marc O’Morain 17:28 The main one was Unix versus Linux and the different shells as we’ve gone through. The one issue that we had was, we needed to build a full product. So we had an MVP since, I think, we had the first build running on the system in maybe March this year. But we need to build out all of the features that we have. So as customers expect to be able to SSH into a build, caches, workspaces, getting a proper image that has all the software installed, making sure the image can run containers. Interestingly, the very first feature request we had for the very first customer we had running Windows builds was a need to be able to run Linux containers inside the Windows image.

Daniel Compton 18:09 [Laughs] Great!

Marc O’Morain 18:12 And that was feature request #1 inside Windows was to run Linux. And to support for all these features, we needed to build them up, and along the way, we had some fights with file paths. So we use Blob storage, so typically Amazon S3 for storing caches, workspaces, test results and all the files that come out of your build container. And we had assumptions that the relative path to the cache or workspace on the machine we could also use as an S3 key. So we s3://your_project/your_build/bar-10-foo*, we could then put on bar-10-foo* on the machine. So we had code for generating one path with four slashes in it, but on Windows, we then needed to break it to the half: the Windows slashes locally and the forward slashes in the S3 path. So they were the trickier bugs to track down where the code would work, but it will be generating… you have two different systems trying to generate the same path, and one uses the wrong slashes. They can’t communicate about where files are. The neat things though, that went really well with the project, and one of the things I’m happiest with is that we were able to use orbs to do it. So, the orbs are our way of packaging config and sharing config between projects. And there’s a bit of templating in there as well. I was involved in the first Mac builds on CircleCI 2.0, and we had to introduce new config syntax. So, for a Mac build under your job, you can say you want Mac OS, and under that, you say xcode-10.3 or 10.2 or whatever you want. So that was additional syntax we added to the config. For Windows, we were able to use orbs, so you import the windows orb, and then in your job, you say the executor is windows-vs2019 for the image with VS 2019 installed in it. And that then we sort of rewrite the config, then it actually ends up looking like a Linux build with a Windows image under the hood as the CircleCI systems see it. But to the customer, we were able to add this neat sugar to the config and not actually change the schema of the file at all or add anything - which is really neat to be able to add features in a userland like that.

Daniel Compton 20:32 Yeah, I mean, I think that’s a mark of a well-designed system when you can extend it in ways that you didn’t think about originally, and yet, it still feels quite natural.

Marc O’Morain 20:40 Yeah, and it was only our third time trying it, so…

Daniel Compton 20:43 [Laughs]

Marc O’Morain 20:45 Five years in. Yeah, we learned a lot in 1.0. We knew where we were hamstrung ourselves and what we could do with config. So we’ve made some better choices in 2.0 config, and then 2.1 again, we’ve tried to improve again. Something we’ve been pushing for more is trying to build features, trying to build a platform more to be a platform that features can be built on top of by us and by our customers. So orbs was the first push in that direction. We’re also looking at new data APIs - we have a team working on right now, for us to build better insights and analytics about your CI system, but also to enable our customers to mine the same data. So we’re taking an API-first approach there, and we’re building the API and then the UI on top of it. At the same time, our customers are getting access to the new API’s. Again, it helps us iterate. And again, with the orbs, the neat thing with them is that they are pinned at a specific version in your config. So it lets us iterate on the orbs and make breaking changes all we want. But we won’t break any builds because customers have pinned exactly the version they wanted.

Daniel Compton 21:50 Yeah, that’s a good thing to have.

Marc O’Morain 21:52 One issue we have at the scale we’re at now is we’re running about a million builds a day, 30 million builds a month. So, when you think about making it change, you’re thinking, ’this is change will have zero effect on customers’, the chances of this happening are one in 10,000 - that’s 100 times a day that’s going to break. We find our customers will… anywhere there is any ambiguity in our config syntax, customers will have exploited it and will be relying on it, so we need to work really hard to keep things compatible. So, the more we can push into things like orbs that are pinned by the customer and committed into their source control, the more freedom we have to evolve the platform forward without breaking builds.

Daniel Compton 22:35 Yeah, I had issues - this was many years ago - it wasn’t even a CircleCI issue, it was an MPM issue, where I went away over Christmas, and the builds were running for Christmas, and then after Christmas, they were failing. And there was maybe a week or two weeks off in between, and I came back, all the builds were failing with some very obscure errors inside JavaScript land. And I was thinking, now where’s this change, and so I was looking all the way through CircleCI. CircleCI had not changed over that time. And eventually, I realized it was some transitive dependency way deep down that had released some breaking change. And I hadn’t numbered* the version correctly, and it was meant to be compatible, except it really wasn’t. And so, eventually, I found this thread of hundreds of people who all had run into the same breaking issue as me. So I’m definitely a big fan of pinning as hard as you can on every single thing to just make sure nothing can move away from you.

Marc O’Morain 23:33 I guess there is the pinning, and then there’s the Clojure adherence to never shipping breaking changes, which is so glorious, living in that world.

Daniel Compton 23:41 Yeah, that’s… I’ve had an old Mac project that needed to be updated for 64-bit, and I thought, ‘I will just take a look, is this easy or hard for me to do it for the author?’ - because I think the author wasn’t that involved anymore. The code hadn’t been touched for about 10 years, I think. And it was so old. Xcode* is pretty good about deprecation warnings and providing fixes, but it was so old, that all of the deprecation warnings had been removed from this version of Xcode. For me to have migrated, I would have needed to find a version of Xcode from four years ago to do one migration, and then migrate it forward, again and again, just to get to the current thing. There were things that were duplicated, and the recommendations themselves had been duplicated by the time I was running this in 2019. So, whereas, compare that to Clojure projects, where you open them up from 5-8-10 years ago, and they just run - apart from recent changes to the JVM itself, but Clojure itself… very compatible.

Marc O’Morain 24:38 I’m trying to think back through, so I’ve been involved in a bunch of the upgrades of our… So Circle 2.0 was built as a suite of services all built around the monolith that we were slowly but surely carving out. And I’ve been involved… I think when I started at Circle that we were on Clojure 1.4-1.5, then upgraded to 1.6, 1.7 maybe, and the big change was the hash algorithm had changed, and we had a bunch of tests that were asserting that the sequence of key values you get out of a map if you turn a map into a seek*, we were asserting that you got things in a certain order. So the first of a map gave certain key and value. So there were bad tests that I had to go through and fix. And then a colleague of mine had the misfortune of the one breaking change we hit in Clojure was when they introduced static linking, is that the term?

Daniel Compton 25:35 I know what you’re referring to. I think that’s the term.

Marc O’Morain 25:38 It would be like with the compiler, but sort of inline the * reference.

Daniel Compton 25:41 Direct linking.

Marc O’Morain 25:42 Direct linking. And we were monkey-patching Clojure.test

Daniel Compton 25:46 [Laughs]

Marc O’Morain 25:48 We still are monkey-patching Clojure.test. And Clojure.test… because it’s in Clojure core, and the Clojure core Jar itself is built with that direct linking enabled, so that call was no longer de-referencing* of our… it was invalid*. We had to… someone else on my team ended up with having to write with Circleci.test our own test runner to work around this issue. And then I did the * that came into the next upgrade, where, again, nothing needed changing. Or the most recent one I did was… I think we skipped 1.9 and went straight to 1.10, and the namespace macro is now specced.

Daniel Compton 26:32 Yeah, that’s true.

Marc O’Morain 26:34 It requires an import… the documentary* way to call them is with a keyword require and a keyword import within the namespace macro, but it would accept the symbol ‘requires’ but no colon in front, and import. We’re using some very old libraries in Circle circle* for sending chat notifications for HipChat and Flowdock and Campfire, and all these pre-Slack chat systems. And a bunch of those libraries used the unsupported old syntax, and I was desperately forking them on GitHub because I couldn’t find the old maintainers to have them commit my one-character patch to the import statement at the top. Some of them we were able to fix because we found out that the chat applications that we were sending notifications for have actually gone out of business two years prior. The world is migrated to Slack. I don’t even know what Flowdock is, but we have chat notifications support.

Daniel Compton 27:33 I think they sponsored a Clojure coding challenge, many years ago, four or five years ago. I think they used it for that.

Marc O’Morain 27:40 Oh, that would make sense why Circle has support for them then if there’s a Clojure link.

Daniel Compton 27:45 Yeah, I think, that’s my memory. So Windows, you pay for a license for Windows, and you generally don’t for Linux. So, how is licensing work for you and CircleCI Windows?

Marc O’Morain 27:56 For the customers, there’s nothing for them to do. They can build on Windows. From the start, we’ve had our biz dev and legal teams working with Microsoft, Google and Amazon under the licensing, so nothing for customers, and there was nothing for me and engineering, thankfully as well, to do, or other departments. And there’s an extra cost incurred in running Windows. I think the cost to us, I’m not sure, that might be double or something the cost. But we have our new pricing system, which is our usage-based pricing. So, the first plan we have is our Performance pricing, which is a usage-based pricing, and you pay for credits, you spend your dollars on credits, and then you can spend credits on Linux builds, Mac builds, and Windows builds. And there’s a different cost-per-minute for each of those. So the Windows builds, I think the machines maybe have two or four times the number of cores that the Linux ones do. And we have the Windows license fee as well, and so there’s a higher price for the Windows credits versus the Linux ones.

Daniel Compton 29:00 So that’s quite a new change also, the pricing. Well, actually, how new is that? I noticed it not that long ago, it feels like.

Marc O’Morain 29:08 It feels to me like it’s been there for about a year. But we might not have had a public we’ve been working with, again, partners on the pricing for quite some time. The container-based pricing that we had when migrating away from, the idea there was you would pay… the first one was free, and then $50 per container. Originally, it was $19, I remember we had no free plan. And that limits the number of parallel builds you can run at once, which is a bit of an artificial restriction because with us running all the builds on the cloud providers, we’re not ourselves limited in the number of builds we can run. And we found, for large companies, they wanted to have a large number of parallel builds running from 9 am to 5 pm in their local time. And then they didn’t want to be paying for those containers overnight. So they want a usage-based pricing. So the usage-based pricing is very much something our customers wanted - to be paying for their use - and it’s letting us increasingly give wider parallelism. Our goal is to remove our restrictions on parallelism, but we have some work to do there. I think our initiative is called ’task throttling’ because we’re not affected by how many builds are running at once, but when they all come at once, we have problems. Nomad is our task scheduling system we use - HashiCorp Nomad - for actually allocating tasks to compute resources. And the VMs that my team deal with, we’ve got to hide that boot time latency. So when a flood of jobs comes in at once, that gives us spikes in our load. So we need to work on smoothing those out and also stop jacking*. So you can imagine, if you have a project, and you send in 100 tasks to run at once, and another customer sends in one or two, we don’t want your hundred builds to go into the queue first, we want to run in, let one customer in, then let one of yours, then his or her other builds through, and then we’re 99. So we need to be fair in that. And also, generally throttling per account; one account can’t just flood all of the builds in at once.

Daniel Compton 31:15 I see. Does queueing theory come in there as well?

Marc O’Morain 31:18 Queues themselves rather than… not too familiar with how the team are actually working on that project, but it has been one of these systems where the behavior you need is a queue, but you also need to be able to look deeper down the queue and try to look past. If you’ve sent in 100 builds, and this other person has one build behind, we need to be able to look deeper than the hundred builds to find it. So you can’t just look at the top of the queue. So that’s my favorite pattern, which is: the database is a queue pattern that you need.

Daniel Compton 31:48 So what else have CircleCI and you are in particular been working on?

Marc O’Morain 31:53 For me this year, stability and performance have been a huge focus of mine this year. So around March this year, we had a shaky time with our uptime, and we assembled a small tiger team to try and address the acute performance issues that we were heading there. So we did a bunch of profiling and fixing of largely database systems. We’d grown in many axes, we found. We had more customers than we had previously. Since we launched workflows, we have more jobs. So what used to trigger one build could now trigger 8-10 builds. We have customers pushing in 100 builds at once as part of a workflow. And then within the workflows, they’re using orbs. So there’s text expansion going on there. We ended up really straining our database. We had some acute remedial fixes to do then, and subsequently, we have a couple of teams now doing the longer-term fixes. We identified where the growth has been hurting us and where we need to work on stability. So we made huge improvements there in the last few months. And then my team in particular with the VMs, what we’ve been looking at is cloud stability. At busy times, the cloud providers we use run out of compute in different zones. The guarantee we get from Google and Amazon is that in a particular region we can get compute, but individual zones run out of the capacity, and that’s not an incident with the cloud provider, it’s just one zone is strictly busy, so we get errors saying ’try again later’. So we’ve been building a system of circuit breakers to allow us to detect when these operations are failing in different zones. So if we get more failures to boot a machine than successes within a certain zone over a couple of minutes, we then mark that cloud zone as being like the circuit is open, and I think it’s about 10 minutes we stop sending any requests to that zone. Then once the 10 minutes had elapsed, we go into a half-open state of the circuits, we send a few build requests to boot if they enter the zone, and again, monitor those closely, and if they are successful, we put the zone back into operation. Otherwise, we leave it for another 10 minutes. That protects us against added capacity issues and also actual problems of a particular zone or region just start giving 500 errors, network problems and stuff in the clouds. So, trying to protect ourselves as much has been a big focus on my team, along with the auto-scaling work, and also doing some more reactive scaling. So if we get a lot of builds, or if a cloud zone goes down, or big issue we get as well is when a partner such as GitHub have a period where their webhooks stop delivering - which can happen from time to time - we get no webhooks from GitHub for 10-20 minutes, say, and then we get them all at once, once they come back.

Marc O’Morain 34:54 And that then opens the floodgates. So we’ve been trying to protect ourselves from these - adding reactive scaling to our system. Along with booting enough machines for what we predict the growth to be through the day, also reacting to the number of builds that are starting to queue up. And we’ve been automating a bunch of processes that had been manual, where alerts would trigger about backlogs, different VMs. So we would have run books where we have to take remedial action. And we’ve been automating a bunch of that, which has been freeing our team to work on the feature works. So Windows and work on new things, because we have automated the toil away - currently the track we’re taking. So that’s the performance work. We’ve been making a bunch of underlying changes that customers want the secondary benefits from. So we have a new setting in the advanced setting in projects what’s been on by default in September last year, which is pipelines. So a pipeline is our term for one execution of your project. A pipeline contains workflows, and workloads contain builds, and that pipeline system is where we’ve built orbs. It’s where the performance pricing is based. If you want to run Windows pipelines it needs to be enabled. And there have been some very subtle, incompatible changes in pipeline. So we’ve been able to turn it on for new projects because we don’t risk breaking a new project. And we’ve been turning it on slowly for more and more projects. And that has part of carving up the * that we’ve been able to really isolate systems nicely with our new pipeline system. So it’s something our engineers are really pushing forward to get all projects across to use pipelines. And then the users at CircleCI will see new projects like Windows and orbs and new things coming on the back of that. We’re building a new UI. So you’ll see on the job page, we have a new UI, and we’re really loving that project. We’ve got a great team, they call themselves the ‘X team’ at CircleCI, building that project. They are so enthusiastic about the project, it’s great to see. And they’ve been building it with what they termed a ‘WAFL’, which is an acronym for ‘Well-architected, functionally limited’. So the way they’ve been building the new UI is, rather than an MVP, they’ve been… which would typically be like a scrappy*, you know, get something working, and then build it out from there, they’ve been building - with best practices - a very limited new build page. And then day by day and week by week, they’ve been adding all the features to it. So going in with a really solid vertical slice through the feature, and then broadening that out, add all the functionality on the build page. I think they have to build page largely done and then moving on to other pages through the system. Now, that’s been a big focus for us, we’re really excited by that. We recently g8 what we call our scripted contexts. That lets you have your contacts is where you store your secrets to your environment variables that might have API tokens, etc in them. And you can now use GitHub Teams to restrict what users have access to particular contexts. So, you could have… anyone on your team can push and run your tests and run your code coverage, but only certain designated fork* can actually deploy to production. And that gets very neat when you tie it with the manual approval jobs. So you can put a job into your workflow, which is a manual approval that puts a button that someone needs to push, and that essentially acts like a pseudo. So maybe me and you, Daniel, we’re on the same team, I push code, I can run the tests and do the code coverage, but you have the ability to push to production. So that approval job will require you to push the button. And then from that point on the workflow runs with your permissions, not mine.

Daniel Compton 38:47 That’s clever!

Marc O’Morain 38:48 So it lets people hide their credentials but keep a known set of folk * credentials, and they’re restricting them from the entire *

Daniel Compton 39:00 Just going back to your frontend. CircleCI was a big Om user for a long time, but I understand that the new UI is written in JavaScript almost entirely or entirely, perhaps.

Marc O’Morain 39:11 Yeah, entirely in JavaScript and ClojureScript. We were a big Om user, there are some Om Next in there as well. And I believe prior to ClojureScript there was CoffeeScript as well, and I believe there may still be some CoffeeScript in there, too. The frontend team made a choice to swap from ClojureScript to JavaScript. They had a bunch of reasons for swapping. One main reason was that we found ourselves… I think we were using ClojureScript for about four years, and about one year ago that we made the swap. But during those four years, we found we were one of the biggest users of Om. So, it was very difficult for us to get help, and there wasn’t the community, the StackOverflow community of answers on how to solve different problems. And we found we were writing a bunch of tools ourselves. And we found the React community already had these tools and better. Another reason for the change was we found we were building a lot of our own tools and ClojureScript while the React community already had the tools. There are the native browser plugins for Chrome that have the React UI debugger, there is the GraphQL debugger, etc, all built-in. And we were sort of building our own web-based tooling to add these sort of debuggers where they were built in with React. For hiring it was a big change for us because we could hire people that already knew React and TypeScript, and all the tooling we were using. So they were productive on day one, rather than having to learn ClojureScript and also learn our frontend, which had been, like I said, quite monolithic and had a bunch of technologies in there. So we were able to make a clean cut away from the old system to the new. Another thing that the team pointed out to me was that the things they love about ClojureScript weren’t present in JavaScript when we started with ClojureScript. But in the four years that had passed, the React community had adopted a bunch of patterns that the Redux pattern of having a very small area of your application that can actually change the global state. And that state atom that Om uses, that’s sort of standard React/Redux application now, whereas at the time with Om, that was very niche, very new superpower for ClojureScript.

Daniel Compton 41:31 That’s nice. And one thing I’ve been curious about is, you allow people to just run builds for free - limited, but still for free - based off just a push to GitHub. And how do you… what is the story with crypto miners and other malicious usages of the CircleCI platform? How do you address that?

Marc O’Morain 41:52 Yeah, it’s interesting you bring it up because last Saturday I actually spent several hours trying to swat away the crypto miners from the platform. It’s fun to see them come in because then immediately alerts go off and we’re very quick to ban them. I really enjoy seeing the tenacity of them; we had the miners in the past where they would typically peg all the CPUs that were very easy to spot and be able to detect and eradicate them from the system. And then they started a very interesting way of doing it, where they would run a build using Docker, which would run a headless Chromium browser, which would then point at a Blogspot, and the Blogspot blog would have a JavaScript which would mind crypto. So they would run the browser headless on CircleCI, executing JavaScript from Blogspot, mining cryptocurrency.

Daniel Compton 42:50 It feels like a very inefficient way to mine cryptocurrency.

Marc O’Morain 42:53 I would have thought so, yeah! They also don’t get to do it for very long.

Daniel Compton 42:57 [Laughs]

Marc O’Morain 42:59 The one at the weekend was they were actually mining the cryptocurrency during the Docker build. So the build looked like they were doing a regular Docker build. The steps of running the Docker build were actually mining the crypto itself. The issue with the weekend was their Docker builds ended up crashing, and we have a system where if a build hard-crashes, we restart it five times. So we had systems that were catching the miners, but then the way their builds were crashing and being canceled, they’d run again. So we needed to quickly cut that out as well. Interesting, we have a shared Slack channel with some of our competitors in the space where we have a cordial atmosphere of sharing all the latest tips and tricks of what the miners are up to, and we can swat them. So we have a shared hive mind of how to find them and eradicate them.

Daniel Compton 43:56 I like that.

Marc O’Morain 43:57 It’s always fun to have boardrooms between companies being competitive, and the engineers have a cordial attitude solving their problems.

Daniel Compton 44:06 Yeah, that’s really cool. I’m very glad to hear that. The other thing which is quite recent in the news is CircleCI has got raised their Series D funding round for quite a lot of money. So what does that mean for the company and for the product and the team? Talk me through those things.

Marc O’Morain 44:24 Yeah, we’re delighted with the fundings. We’ve raised our Series D, and first and foremost, that’s going to allow us to continue our laser focus on CI/CD. So we’re going to focus entirely on CI/CD and adding new compute types to the platform, and we’re going to start adding support from more VCS providers. So right now, we support GitHub and Bitbucket, going to start looking at GitLab and other providers and making our system-agnostic of the VCS type. Another focus for us is control that the larger enterprises need - features like the restricted contexts and audit logging features. You can imagine, large enterprises like banks have very strict requirements for who can push to production and want full logging at the top end. And that’s something we’ve been… the last two years, we’ve added a bunch of work there; that’s something we’re just getting better and better. We’re building a new UI, which we’re really happy with, and also focusing our growth or of hiring into EMEA and APAC.

Daniel Compton 45:30 Can you just expand those acronyms in case people haven’t…?

Marc O’Morain 45:34 Oh, sorry, EMEA is Europe, Middle East, and Africa, I believe, and APAC is Asia-Pacific region. So in Europe specifically - I’m in Cavan in Ireland, and we’ve been a remote team, our engineering at Circle is nearly 100% remote. And we’ve been hiring in Europe for as long as I’ve been at the company, but we haven’t been particularly targeted in. So we recently changed our strategy to focus on a smaller number of countries. We now are focusing on Ireland, the UK, and Germany, in particular, and trying to hire folk in there. So you’ll see increased focus from us now in those countries as we try to grow our teams directly there, rather than having an open hiring rack on the website that just says ‘remote-friendly’. We’ve got specific roles on the site that mention Ireland, the UK, and Germany.

Daniel Compton 46:23 Right. So on a pretty light note, I noticed that CircleCI somehow managed to get the domain circle.ci, which I thought was very clever so that was linked on. That must have been on Twitter that it came through. How did you manage to get that?

Marc O’Morain 46:38 Yeah, it came up recently I needed to do some digging on it, because when we built the new CLI tool, we wanted to… so you can install the new CLI tool through Bash or through snap on Ubuntu, or through GitHub releases, but you can also do the curl pipe to bash trick to install it, and I wanted a neat URL for that. So I had to go digging and find out how to get circle.ci/cli, and you can pipe that straight to bash and it installs the CLI tool. But I joined the company in November 2014, and in October 2014, the month before I joined we snag that domain for $10 - which I think is a bargain. Our SRA* team are saying it’s a bane for them because it’s not through a regular registrar, so we need to pay every year with PayPal for circle.ci. So we can’t use our credit card.

Daniel Compton 47:28 This goes to the Ivory Coast, isn’t it, in Africa?

Marc O’Morain 47:31 It’s an African registrar. I’m not certain exactly where the registrar is. But when I went looking for more info on it, I was told that damn that domain and me having to send through PayPal. It’s a neat domain, certainly.

Daniel Compton 47:46 I’m sure the marketing team is happy with it.

Marc O’Morain 47:49 Yeah, and we have it hooked up through Bitly, so we can shorten links with Bitly. But rather than using bit.ly, circle.ci/ and then we get a little miniaturized URL on the end, so that’s really nice. Good for tweeting.

Daniel Compton 48:05 Great. Well, if people are interested in working with CircleCI, they can go to the CircleCI careers page, which is circleci.com/careers. And I want to say thanks for coming on and sharing the internals of what’s going on. I know, when I asked you to interview, I wasn’t exactly sure how much you were going to be able to tell me and it’s been really interesting to get the inside scope on a lot of things going on at Circle. So, thanks so much for coming on and sharing what you know!

Marc O’Morain 48:34 Oh, you’re very welcome. If anytime anyone ever meets me at a conference, I’d be happy to share what’s happening under the water as the swan furiously paddles to keep the platform running.

Daniel Compton 48:47 [Laughs] Great. Thanks so much! Have a great day!

Marc O’Morain 48:50 Thank you very much! Goodbye.

29: Marc O'Morain on adding Windows support to CircleCI