This is the first part of a multi-part series which is a deep dive into Ocient’s Bazel-based build system. This post covers the motivation behind transitioning to Bazel and how we manage external dependencies.

This post assumes a working knowledge of Bazel. If you are new to Bazel, the Bazel website will be a better place to start.

Obligatory: the views expressed in this blog are my own and do not constitute those of my employer (Ocient).

Background

Motivation for Blog Series About Bazel?

During this project I read a ton of blogs about Bazel. It seemed like it was the cure to all ailments. Hermetic and faster builds? Sign me up! Everything anyone needed to know about Bazel was already out there. What can I add?

What I quickly found is that many people who had written these blogs used contrived examples to show “how cool Bazel was” by building a hello world like program. It was either that or they worked at an organization that had strict C++ standards on compilation time and didn’t push the C++ compilation to the limit (more on this later). Ocient was in race mode when significant portions of our codebase was written which means we have some… how do I put this gently… “legacy code”. We needed a flexible build system that could scale well and would handle some of our very specific build requirements (see more on this later).

My goal of writing this series is to talk about a “real-life” switch to Bazel with a project that has many external dependencies, extremely nuanced build system, and very custom solutions to building C++ code. I hope to speak fairly technically about work we did to make everything build so you can use these design patterns and ideas to facilitate your switch to Bazel, speed up your Bazel builds, or just to get inspiration.

Here are some stats about the Ocient codebase.

Over 3,000,000 lines of C++ code
Just under 100,000 lines of Python code
Over 70,000 commits authored by more than 100 people

Existing Build System

Before this project was started our build infrastructure was Make based with a distcc backend. A developer would run a command like make all -j 100 which meant “build everything and do 100 compilations actions at the same time (and do them remotely if slots remain)”.

If 100 simultaneous actions sounds like a lot it is because it is! Our build servers are not homogeneous but most have around 128 cores and 500GB of RAM. Each one cost on the order of $10,000 and we had around 8 of them when this project started, we are now up to 17. Post-Bazel we now run with 1000 concurrent actions as the default.

Our project was divided into 20+ libraries where each library contained related code and unit tests. The dependencies of the main binary target (“rolehostd”) are shown below. Do not worry too much about the names of the libraries. The point of this graph is to illustrate we have a reasonable list of internal dependencies.

Build Types

We had three poorly named build types as well as a variety of subtypes.

  • RELEASE builds are the most aptly named. They have all the optimizations (-O2 and others) we support and all the asserts are compiled out. To get a sense of how many optimizations we have I have attached a list of some of our largest object files. Our top three object files sizes in this mode 1264MB, 1177MB, and 1173MB. With -j36 it took over 3 hours to build.

  • TEST_RELEASE are the same as release builds with asserts turned on and a few of our egregious inlining optimizations turned off. These are the builds we use for most of CI since they provide a good tradeoff between program execution speed, compile time, and correctness.

  • DEBUG are builds without any optimizations (-O0 (note we are investigating the use of -Og vs -O0 docs)). In addition these builds use shared objects for each library for faster linking. More on this later. These builds also include an address sanitizer and a memory sanitizer. This is the default build type.

Here are some examples of a few subtypes of builds as well. I am not including all of them for brevity.

  • NO_SHARED is only used with DEBUG builds and it signifies to build without the address sanitizer, memory sanitizer and statically link all libraries rather than building shared objects. Great naming!

  • NO_DISTCC means do not use our distributed build system, instead run locally.

  • DEBUG_CONTAINERS means to build with debug containers that have asserts in them to check for proper usage. This found extremely serious bugs in our code and I would highly recommend you use this in your organization.

Putting this altogether a sample command might look like make all DEBUG_CONTAINERS=y which means “make all with DEBUG_CONTAINERS and use a DEBUG build”

Build Targets

For those familiar with Make you might be wondering why I am specifying all rather than a specific target (e.g. binary-target). Without going into too much detail, it is because the impacts of other things included in all are negligible. The main thing we are building we call “rolehostd” but for the purposes of this post I am just going to call it our “binary target”. In addition, each library also produces a gtest binary that runs all the unit tests in that library.

Specific Build Requirements

I’m sure people are going to read this and say “why don’t you do X instead of Y”. However, for each one of these changes the scope can creep and the criteria to call this project done grows and grows. It was important to have as minimal impacts outside of the build system as possible. Many of these hard requirements crept up on Ocient. For example, we would never test building with a different version of gcc. Lo and behold when we are no longer continuously testing we eventually lose compatibility.

  1. Stripping debug symbols from many object files because our binary was so large that it would fail to link. Sections need to be within 2GB of each other since there were only 32 bit offsets. This is worthy of a post by itself.
  2. We use linker scripts to move sections around in the binary since these addresses can conflict with the memory regions the address sanitizer reserves.
  3. We build with -mcmodel=large. This is also worthy of a post by itself.
  4. We require gcc 8.2 with a patch. We are using features in C++20. Downgrading or upgrading gcc would be a non-trivial amount of work.
  5. DEBUG builds required creating a shared object for each library.
  6. We use gdb 10.
  7. Our developer containers run Ubuntu 16.04.

Problems with the Existing Build System

So hopefully by now you have some background. Now lets talk about why there were problems. At the time this project started we had about 44 employees in development roles (not counting interns). By the time the project was finished we had about 55 employees (an increase of 20%). We were already starting to see the age of the build system. Frequently, during times of high demand, we would see compile actions time out on remote nodes and then have to be run locally. Since our hardware is shared this had a hugely negative impact on our CI nodes as well as developer containers. For example, if a developer built when our distcc build farm was saturated, the other people on the shared machine would notice a significant slowdown and a significant increase in memory usage. To minimize saturating the distcc build farm, we lowered the concurrency of builds in CI and it was not unheard of for CI to take longer than 8 hours.

Other than slowness, there were a other problems with Make/distcc.

  1. Distcc offered no support for debug symbol fissioning. This was the primary driver to upgrade. We needed support for this because we needed a way to keep our massive amounts of debugging information.
  2. No caching of artifacts could be shared between developers. When we tried with ccache, we found bugs at scale.
  3. Little in the way of elasticity which implied often resources were underused.
  4. Semi-frequent build errors that were due to a stale build artifact sitting around. Often we would have to tell people to blow away all their build artifacts and like magic it would just start to work but waste a lot of their time.

With all that being said it was time to find a new solution.

Bazel Transition

Bazel Perks

Besides solving the problems with Make/distcc Bazel comes with some other goodies.

  1. Support for debug symbol fissioning
  2. Remote execution and cache (for compilation and test actions)
  3. Better resource utilization (will get into this in a later post)
  4. Build sandbox
  5. Excellent built in profiling
  6. Potential for auto-scaling based on demand
  7. Linking could be done on the build farm

Moving our Toolchain to Bazel

In our Make based system there was a bootstrapping step when building. First, a new developer would have to build our toolchain a 2-3 hour process that they had to go though on their first day (boring!). Then once the toolchain was built then they could use the tools built in the toolchain to actually build our system. The real toolchain pain came when something in the toolchain was updated. Every single developer would have to remake their toolchain. Depending on the change a developer might have to blow away all existing object files and archives. Many many times developers would skip that step and have nightmarish linker bugs since a stale object was being linked with an incompatible different object. Our toolchain consisted of two things, the tools to build Ocient, think g++, and some external dependencies of which some were linked and some where not (think curl and gdb). These external dependencies were versioned by git sha and stored on our internal servers. Secondly, there were additional external dependencies were kept outside of toolchain locked to a specific commit hash (you know… to keep everything simple /s). Before we made any progress on Bazel we had to port our toolchain and external dependencies to Bazel.

Something I briefly looking into while building this was a cross-compiler toolchain. I am in no way claiming I fully understand this topic. This is something that is hard to do but allows for different GCCs for different targets (32 bit vs 64 bit gcc for example).

Our First Approach

We noted that every developer was building the same set of tools. We could build this once and have every developer clone this down. The building of toolchain can be an automated job whenever our toolchain changes. We used Make since this allowed us to reuse some of the existing toolchain code. We spent the time to migrate all our toolchain and ext dependencies to our Bazel toolchain. The new process was to just run a rclone command and a new developer was ready to go.

This worked well until it didn’t. We found we were frequently updating our external dependencies and each time we did we would have to deploy a new toolchain while supporting the old one until no one used it anymore. By itself this is just horrible for us (and great for the rest of the development organization), we would have anywhere from 10 minute to 8 hour iteration times for toolchain changes. In the worst case we would need to rebuild toolchain and our binary only to discover there was something that needed to be changed in the toolchain. This leads us to the second approach.

This was the first point where I was disappointed with Bazel docs and the literature on the subject of external dependencies. It appeared to me, that the suggested way of dealing with external deps was to replace the project’s build system with a BUILD.bazel and a WORKSPACE file. I assume this works great for organizations the size of Google since you can afford to spend the development effort on things like this. However, for Ocient, every second I spent working on rewriting a build system for an open source piece of software was a second not spent adding features to Ocient. This was made even more frustrating since these projects already have a working build system.

We have 47 external C++ dependencies (60+ including tools built)
Of those 7 are header-only (require no building)
In Bazel’s default approach we need to maintain a build system for 40 external dependencies

Our Second Approach

What if we could build most of our external dependencies in Bazel and leave only a select few things in our toolchain while utilizing the external project’s existing build system? This avoids the problem of frequent toolchain upgrades while still getting caching and hermetic builds. There is another good blog post detailing this here (scroll to the very bottom of the article). The solution uses rules_foreign_cc to wrap non Bazel build system in Bazel by listing the artifacts we are expecting. For headers output by libraries you only need to specify a folder rather than every file in the folder (which is huge). This is the approach we are still using today and I would recommend you use.

I’m not claiming we have the most complicated dependency system but I do think it is more complicated than the average Bazel blog post which is why I am going to share the code for some our our dependencies.

Lets start simple. This is a fairly popular C++ library known as TCLAP. This is very simple because it is a header only library.

# WORKSPACE
http_archive(
    name = "tclap",
    # Hold on, I thought you said we don't want to replace any existing build systems?
    # Yes, that is true but it is easier to just rip out the header files than it is to get TCLAP to build using its own build system.
    # By just creating a cc_library with only the header files we can #include them later in our program (since this library does not produce any archive or shared object files)
    build_file_content = """
cc_library(
  name = "tclap",
  # Find all the header files for the library. See the bazel docs on cc_library for more info.
  hdrs = glob(["include/**/*.h"]),
  # This is the magic line right here.
  # Without this when we would #include any TCLAP header we would need to preface it with "include/"
  # However, maybe that is your style but it still would not work.
  # TCLAP headers can include other TCLAP headers and those #includes do not start with an "include/"
  # We need to strip the include prefix.
  strip_include_prefix = "include/",
  # Make this visible to everyone
  visibility = ["//visibility:public"],
)
    """,
    # The sha to use for this library. If the sha does not match the download this will fail.
    sha256 = "9f9f0fe3719e8a89d79b6ca30cf2d16620fba3db5b9610f9b51dd2cd033deebb",
    # After unzipping we want to cd into the folder to run our commands.
    strip_prefix = "tclap-1.2.1",
    # The locations to download the files from.
    urls = [ext_location + "tclap-1.2.1.tar.gz"],
    # WORKSPACE file for this dependency. Usually can be left blank.
    # This is used more often if this library is a Bazel dependency.
    workspace_file_content = "",
)

Lets knock it up one notch. This is another a real life example for jsoncpp.

#WORKSPACE
load("@Bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

# Select all files and to use those as the input to the cmake rule
all_content = """filegroup(name = "all", srcs = glob(["**"]), visibility = ["//visibility:public"])"""

ext_location = "http://external-packages-location/"

# Standard Bazel practice to declare the package an http_archive to be downloaded before the analysis phase
http_archive(
    name = "jsoncpp",
    # "Select all files"
    build_file_content = all_content,
    # About 50% of our external dependencies need at least one patch to work.
    # This is due to the flags we use when compiling, general bugs, and static analysis problems.
    # We often found patching to be relatively easy.
    patch_args = ["-p0"],
    patches = [
        "//toolchain/patches:jsoncpp-clang-fix.patch",
    ],
    # The sha to use for this library. If the sha does not match the download this will fail.
    sha256 = "90618516abaed488d23f7b7e358341075073cbdce3d1ed0329bb23cdaaa66183",
    # After unzipping we want to cd into the folder to run our commands.
    strip_prefix = "jsoncpp-1.7.2",
    # The locations to download the files from.
    urls = [ext_location + "jsoncpp-1.7.2.zip"],
    # WORKSPACE file for this dependency. Usually can be left blank.
    # This is used more often if this library is a Bazel dependency.
    workspace_file_content = "",
)
#BUILD.Bazel
load("@rules_foreign_cc//foreign_cc:defs.bzl", "cmake")
cmake(
    name = "jsoncpp",
    # Understanding this requires some knowledge of cmake.
    # The build command is run with these defined.
    # To understand what these mean you need to read the cmake documentation as well as the documentation of the project.
    # In this case, each line is pretty self explanatory with what it does.
    # A good hint to see these configuration options is to run `cmake -LA`
    cache_entries = {
        "CMAKE_BUILD_TYPE": "Release",
        "BUILD_STATIC_LIBS": "on",
        "BUILD_SHARED_LIBS": "on",
        "JSONCPP_WITH_WARNING_AS_ERROR": "off",
        "JSONCPP_WITH_STRICT_ISO": "off", # Issue all the warnings demanded by strict ISO C and ISO C++ - found in the source code of this library
    },
    # I think going deep into this is worth of its own post.
    # This enables -mcmodel=large so that we can statically link our binary
    # This also allows us to switch on the build type.
    # For debug containers, we have to enable debug containers here.
    # By putting everything behind the "EXTERNAL_FEATURES" variable we can change this is the future and make sure all the changes happen to every external dependency.
    features = EXTERNAL_FEATURES,
    # Reference to the source files this is "all_content"
    lib_source = "@jsoncpp//:all",
    # comment out out_shared_libs and uncomment out_static_libs to statically link
    # out_static_libs = ["libjsoncpp.a"],
    # Prefer to use shared libs here.
    # We used to use the static archive and we left it here for illustration.
    out_shared_libs = [
        "libjsoncpp.so",
        "libjsoncpp.so.1",
        "libjsoncpp.so.1.7.2",
    ],
    # Everything should be able to reference this dependency.
    visibility = ["//visibility:public"],
)

Another example. This is for DPDK. I skipped the changes to the WORKSPACE file since they are similar (except we have more patches for DPDK). Warning, it gets grosser!

#BUILD.Bazel
load("@rules_foreign_cc//foreign_cc:defs.bzl", "make")
make(
    name = "dpdk",
    # See above
    features = EXTERNAL_FEATURES,
    # See above
    lib_source = "@dpdk//:all",
    # YUCK! Now we need to make changes to this library if we change some gcc flags.
    # However, this avoids the problem of rewriting DPDK's build system.
    # We are passing CFLAGS and other various arguments to the meson build system that mirror the arguments we build with.
    # Note $EXT_BUILD_DEPS and $INSTALLDIR.
    # These variables are provided by rules_foreign_cc for scripting.
    # Later we will see how to get a list of such variables.
    # DPDK uses ninja/meson. These are the commands used to run ninja/meson.
    make_commands = [
        "CFLAGS=\"-march=broadwell -mcmodel=large -fPIC\" $EXT_BUILD_DEPS/bin/meson.py -Dprefix=$INSTALLDIR -Ddisable_drivers=* -Dmachine=broadwell --buildtype=release --libdir=lib --default-library=shared build",
        "cd build",
        "ninja",
        "ninja install",
    ],
    # Yuck, more grossness...
    # We need to know all outputs so Bazel can build its dependency graph.
    # We found these by running the DPDK install and copying all the libraries output.
    static_libraries = [
        "librte_acl.a",
        "librte_bbdev.a",
        "librte_bitratestats.a",
        "librte_bpf.a",
        "librte_bus_dpaa.a",
        "librte_bus_fslmc.a",
        "librte_bus_ifpga.a",
        "librte_bus_pci.a",
        "librte_bus_vdev.a",
        "librte_cfgfile.a",
        "librte_cmdline.a",
        "librte_common_octeontx.a",
        "librte_compressdev.a",
        "librte_cryptodev.a",
        "librte_distributor.a",
        "librte_eal.a",
        "librte_efd.a",
        "librte_ethdev.a",
        "librte_eventdev.a",
        "librte_flow_classify.a",
        "librte_gro.a",
        "librte_gso.a",
        "librte_hash.a",
        "librte_ip_frag.a",
        "librte_jobstats.a",
        "librte_kni.a",
        "librte_kvargs.a",
        "librte_latencystats.a",
        "librte_lpm.a",
        "librte_mbuf.a",
        "librte_member.a",
        "librte_mempool.a",
        "librte_mempool_bucket.a",
        "librte_mempool_dpaa2.a",
        "librte_mempool_dpaa.a",
        "librte_mempool_octeontx.a",
        "librte_mempool_ring.a",
        "librte_mempool_stack.a",
        "librte_meter.a",
        "librte_metrics.a",
        "librte_net.a",
        "librte_pci.a",
        "librte_pdump.a",
        "librte_pipeline.a",
        "librte_port.a",
        "librte_power.a",
        "librte_rawdev.a",
        "librte_reorder.a",
        "librte_ring.a",
        "librte_sched.a",
        "librte_security.a",
        "librte_table.a",
        "librte_timer.a",
        "librte_vhost.a",
        "librte_telemetry.a",
    ],
    # Requires meson
    tools_deps = ["@meson//:meson"],
    # Anyone can see this
    visibility = ["//visibility:public"],
)

End of Part 1

That is it for part 1! If you have any feedback or see any mistakes please drop me a line. The upcoming posts will cover the BUILD.bazel files we use to actually build our code, how we deal with protobuf dependencies, how we do remote execution, and any other topics people are interested in.

Ocient is Hiring

Ocient is hiring for all sorts of roles across development. If you are interested in working on build systems or any other aspect of distributed database apply and drop me an email to let me know you applied.