Category

Blog

Moving SCM Off The Cloud

Huh, it’s been… half a decade since I last wrote a blog post here. Time has no meaning anymore. Well, I’ve been working on some interesting side projects, so I thought I might document some of my thoughts as I go about them. Nothing very structured, but maybe it’ll be of use to someone (or useful training data for our AI overlords.)

I’ve been working on a new game project, and I’m getting to the point where I want to start adding art assets. In the business SaaS space of which I’ve lived for the past… I don’t want to think about how many years it’s been… it’s generally fine to throw your limited art assets straight into your git repo. A few MB here and there might eventually creep your repo into a few gigabytes, but it’s usually not a big deal, even for fairly large teams. Games though? Oh, you’re going to quickly grow your repo far far faster. Whatever, storage is basically infinite, right? As of today, a single consumer grade HDD is 28TB. Surely our cloud overlords have kept pace, and now offer a healthy amount of storage for a reasonable price… right?

Oh no.

Tools like Dropbox are a bit more generous with their offerings (3TB for just $17 a month per user! What a deal!), but you’re going to have a whole host of additional problems managing version control of your git repo and Dropbox versioning together. I don’t want to deal with that mess – it would be far more convenient for me to have everything in one place. AND the idea of paying a monthly fee in perpetuity also doesn’t sit right with me when I have a datacenter of storage capacity locally. I think it’s finally time to look at a self-hosted option.

Defining My Criteria

  1. Keep it simple, use existing solutions.
    • Nothing I want to do here is truly unique. I’m assuming this problem is common enough that there are several off-the-shelf solutions. The scope and complexity of my projects are also not generally enterprise-grade, so I don’t need super-complex stuff.
  2. Use my existing hardware
    • Any self-respecting data hoarder has a file server. And a backup for that server. And then some more servers for when you run low on space. And then some offsite backup servers. In this day and age, there’s zero reason to run a dedicated machine just for this function.
  3. Support future users
    • Hobby solo projects are fine, but when it comes to collaboration, my solution needs to make it fairly painless for other people to contribute. This is the biggest reason why I use Github for most projects.
  4. LFS support
    • LFS has now been around for a decade – I think it’s safe to assume I can use LFS for all my projects now. If something doesn’t support LFS by now, that’s an indication it’s not tooling worth investing in.

Synology

My primary fileserver is a Synology Rackstation. While I’m no stranger to rolling my own petabyte+ servers, there is an immense value in having a deployed server that “just works”. So naturally, the first solution I looked at was Synology to see if they have any SCM solution. And yes, they do!

It’s relatively painless to get started, as long as you’re familiar with SSH. I was able to spin up a repo using my existing Synology credentials pretty easily. But pretty immediately, I realized I don’t like a lot of this.

For one, managing multiple repositories is very clearly going to become cumbersome. You create a shared folder that holds all your repositories. By default, you’re instructed to share the folder containing all your repositories for each Synology user account that you want to have repo access. If I want to have per-repo access, I have to manage multiple different shared folders. Spinning up a quick repo is no longer quick, and I have to be mindful of security far more than I might want to. And again, since accounts are the same as my primary fileserver’s accounts, I absolutely would want to be mindful of security.

Annoying, but manageable. What’s not acceptable though, is the lack of LFS support. Out of the box, there is no LFS support, because LFS transmits over http, not SSH, and Synology’s Git Server doesn’t support that. There are some workarounds, but when I started looking into that, I just thought about all the other issues this approach has, and decided to look elsewhere.

Gitlab

If I’m honest with myself, I think the reason I really wanted to try Synology’s simple git server was an aversion to my days of being a sysadmin maintaining an instance of Gitlab, tying it into a Jenkins server, and integrating that with Testrail. My goal now is to develop my software project, not to spend my time doing more server administration. So initially I avoided more robust tools. But Synology’s offering not doing what I wanted forced me to take a new look at other offerings.

During my time at Smartsheet I got intimately familiar with Gitlab (and administrating a grandfathered legacy Github Enterprise account that has a far better pricing offering than Github’s modern pricing structure, but I digress), so I could roll my own Gitlab instance. But… I just don’t like it. It’s very Enterprise-focused, and for smaller projects I think there’s better out there.

Gitea

Browsing around, I stumbled upon Gitea. I think it took me about 5 minutes to get an instance up and running. If I didn’t know better, I’d swear this was just a green version of Github. Most things that I use on a regular basis are very similar to the way Github does things.

Migration from Github was also easy. I know Git doesn’t technically have servers and all that, but it was nice being able to point to my (properly authed) Github private repos and have them fully in Gitea without having to figure out how to create a repo and import my existing code into that without losing history – I mess that up like half the time.

This was all a long winded way of saying Gitea has been great for the limited amount I’ve used it. I haven’t set up any CI/CD stuff yet, but as a simple git server? It’s now my self-hosted go to!

Stop Erasing My Updates!

A while back I was writing a story for a game idea. I was trying out some web-based SaaS writing tool and started writing at my desk. After a few minutes I realize this is going to be a long writing session, and switched over to my laptop to write on the couch. I was in the zone! All sorts of cool ideas just flowed onto the screen. It was awesome. After several hours, I eventually called it a night, closed my laptop, and went to bed. The next morning, I decided to continue writing, and got out my laptop again. All my work was gone. What happened?! Turns out, I had left the page open on my desktop, and it was periodically auto-saving the mostly empty text from that version, overwriting the content that was coming from my laptop’s version. All my work was lost, in part because of bad API design.

Frustrated by the experience, I decided to make my own writing tool. (Why switch to another tool when you can reinvent the wheel for fun!) From the very beginning, my plan was to design something that avoids the pitfalls I’ve seen so many APIs make, by implementing safeguards at the API level to prevent the client from hurting itself.

Let’s reinvent the wheel instead of using another tool!

What went wrong?

In the above situation, the server, API, and client all committed grievous sins that together caused a bad situation. The server certainly could have saved the document history. And the client certainly could have not kept sending save requests every five minutes even though nothing had changed. But the API could have worked around both of those problems and prevented other bad scenarios from occurring.

The Naive Solution

At first glance, the solution I’d pick seems simple. Add document versioning, have the server send the version number when it sends the document, and have the client send back the version number it believes to be the latest version as part of the update request. If the numbers match, fantastic, you’re good to go. If not, the server should reject the operation, and then the client has to figure out some way to inform the user and let them figure out a way to resolve the conflict.

/article/{articleId}
  patch:
    parameters:
      - in: path
        name: articleId
        schema:
          type: integer
        required: true
      - lastVersion:
        description: The last known version of the document
        type: number
        required: true
        example: 42
      - body:
        description: The new text to replace the existing document
        type: string
        example: Hello dark and stormy world.
      responses:
      '200':
        description: OK
      '416':
        description: lastVersion does not match latest version

But this is of course a totally custom solution, and for a problem as common as this, I bet there are some established conventions out there that can handle this problem. After all, there’s no need to reinvent the wheel. Let’s explore a few.

The No Versioning Solution

Let’s say adding the history of the document is too much to ask of the server or database design. Well, surely you’ve at least got a “last modified” timestamp in there somewhere, right? Turns out, there’s a solution we can used built in to the HTTP request spec: If-Unmodified-Since.

Essentially, all we have to do is send the time that that last update was created, and if there have been no changes since the specified time, the server then accepts the request to update.

/article/{articleId}
  patch:
    parameters:
      - in: path
        name: articleId
        schema:
          type: integer
        required: true
      - name: If-Modified-Since
          in: header
          required: true
          description: Timestamp of the last response.
          type: string
          example: Sat, 29 Oct 1994 19:43:31 GMT
      - body:
        description: The new text to replace the existing document
        type: string
        example: Hello dark and stormy world.
      responses:
      '200':
        description: OK
      '416':
        description: If-Modified-Since lower than latest version

There are a couple problems with this approach, however.

First, the client can simply set the time to the end of the universe, and this will override the safety check or requiring a timestamp. You should always assume that at some point, someone using your API will misuse it. People don’t use APIs because they want to follow elegant design. They use APIs because they want to get something done. And if setting this property to the end of time means they don’t have to bother with managing last known times, it means they can get to their desired goal of updating the document faster and with less cognitive load. Your API should protect the client from making poor decisions whenever possible.

The second issue is how timestamps are handled in HTTP land. HTTP dates are strings whose smallest unit of time is seconds. If you have multiple updates within the same second, data loss will still happen as the second request overwrites the first. And if you’re following the standard, you MUST use standard HTTP dates:

A recipient MUST ignore the If-Unmodified-Since header field if the received field-value is not a valid HTTP-date.

No Milliseconds since Unix Epoch for you! Is it likely that two clients will try to update at exactly the same second? No, probably not. But when designing systems, why take that chance? We can do better.

Etags!

There’s another HTTP standard header we can use instead: Etag (stands for ‘entity tag’). The server sends out an Etag whose value somehow represents the last version of the document. How that representation is formed is up to you. You can use a simple version number, like in my naive solution, you can use Milliseconds since Unix Epoch, or you can use something more complex if that better suits your needs.

The client doesn’t send back the Etag. Instead, they send back a different header: If-Match. Here, the intent is made clear by the name: Only accept this update if there’s a match with the specified identifier.


/article/{articleId}
  patch:
    parameters:
      - in: path
        name: articleId
        schema:
          type: integer
        required: true
      - If-Match:
        in: header
        description: The last known version of the document
        type: number
        required: true
        example: 42
      - body:
        description: The new text to replace the existing document
        type: string
        example: Hello dark and stormy world.
      responses:
      '200':
        description: OK
      '416':
        description: If-Match does not match latest version

This is essentially the same as my naive version, but using established conventions. There’s nothing really gained from my original way of doing things, so following the convention is preferable, since it’s a format people are already familiar with, and it makes the intent more clear.

Think about APIs

I’m a believer that a strong API helps foster a great user experience. Your API should be more than just a dumb communication protocol to let the client throw things into your database. It should consider both the intentions of the client and possible bad things the client might do. Don’t just accept anything because the client asks you – make sure they know what they’re asking you to do. It might be a bit more effort, but your end users will thank you for a smooth experience!

PUT vs PATCH

Most of the time, we consume APIs, not write them. And even when you do write an API, it’s almost always to modify or expand an existing API. That means following existing conventions that were put into place when the API was first written, no matter what, because maintaining consistency is important and breaking existing clients is absolutely unacceptable. The unforeseen problem is that you tend to treat the design patterns you’re familiar with as the correct ones. So when you eventually do get the opportunity to design a brand new API from the ground up, it’s important to take the time to relearn what you think you know.

That’s the position I find myself in now. By far, the API I’m the most familiar with is Smartsheet’s, since I consumed that as a mobile developer for a number of years, and designed many additions and improvements. The amount of stuff you can do with that API is staggering – it’s far more complex than many enterprise APIs out there, but it’s also really old, which means there’s been a lot of opportunities for less than ideal designs to work there way in. There’s a number of trappings from this design that I started off on my current project doing, simply assuming them to be correct, because that’s the way I’ve always done them™! It’s time to take a step back and get back to basics.

Put that PUT down!

Let’s say I want to update an existing document using a RESTful API. That document’s bound to have lots of different properties such as the name of the document, font information, owner of the document, and of course the actual text of the document. If I’m just modifying the text of the document, but leaving the rest of the properties alone, the clear solution is to PUT only the properties I want to modify, and omit the ones I don’t want to change. Right?

No! PUT is intended to replace an entire object, not modify elements of it. If you’re looking to just modify a section, PATCH is a better option. You’ll find a lot of APIs behave this way though. In part, because PATCH wasn’t even a thing until 2010, and wasn’t a widely adopted part of the standard until later (as all new standards take time to proliferate).

But even now, a decade later, it’s still an opinionated subject. And I agree that it’s not strictly necessary to use PATCH to apply a partial update. But then again, you could also send a large body of information along with your GET request and have updates applied that way – as long as the server and client both agree to it, you can do whatever you want!

The reason different HTTP verbs exist is to help make intent clear. To a new viewer of your API, PATCH much more clearly broadcasts the intent of the action – to modify part of the object, not replace it. Given the opportunity to make a choice between the two, following conventions is preferable.

Unless…

There’s one time I’d argue against using PATCH over PUT. And that’s if your API is already using PUT as the established convention. Even if you’re writing a new endpoint, if the rest of the existing API consistently follows one established format, for the sake of your users, keep using that format. The intent of your local API will be better understood if it is internally consistent, even if that goes against the format everyone else uses. After all, only nerds read formal specification docs. Your users are cooler than that, so do them the favor of not having to stop and think about every endpoint.

tl;dr

In summation, if you’re designing a new API from the ground up, I implore you to use PATCH over PUT when partially modifying stuff as this makes your intent more clear. But maintaining consistency is more important, as this reduces cognitive load on your users and reduces the frustration they’ll feel as they switch between endpoints.