Design for a small-scale self-hosted Git service

In many ways this is a sort of wishlist for a self-hosting Git solution that is much more lightweight than the big players (GitLab etc). I love GitLab and will continue to use it to host my important repos + stuff I expect to collaborate on, but there are many features (PRs + issues + very fancy web view) that I just don’t need for, say, private scripts.1 Even though I am referring to this hypothetical Git service as if I plan to make it, no promises that I actually do.2

Additionally, I don’t want private scripts under a centralized hosting service out of my control. It’s not that I’m paranoid about having my private scripts on GitLab, but it just doesn’t feel right.3

Also, GitLab’s namespaces are quite dry (at least GitLab SAAS). One particular point of annoyance is that every user has a namespace. I was very bothered that users could not have subgroups, but that really disguised the fundamental problem I had with GitLab users: users should not have a namespace by default. It’s so clunky that signing up users, which are used for access control, also reserves a namespace! What if I’m just using GitLab to help maintain X software, or what if I sign up and never use the service, period? There are a lot of group/organization names (GitHub and GitLab respectively) that I wanted to use that are some person’s username. Said person invariably has made 0 commits in the history of ever, and the namespace is so much drier for that. Boo!

OK, so that calls for a private Git server. But most of the solutions do not have “subgroups” like GitLab, which is a total deal-breaker for me, and GitLab is way overkill for a private Git server that literally only exists to sync files between my desktop and laptop. But at the same time, I have no plan of SSHing into the server and pull/pushing to repo that way. Particularly, I want to set up API routes so that other people can see some repos I decide to make public, and so they can create/push to new repos as well.4

As far as I know there is no Git hosting service that does all of this for you. These are my plans for building one. Tentative name: Glee.

Filesystem and permissions

Overview

Let me tell you why I like GitLab’s subgroup functionality so much: It’s like a filesystem. That’s it, that’s all I want from a Git hosting service. As far as I know only GitLab supplies that: nesting with a depth of >1>1. And that’s great, but the number one complaint I have about GitLab (sub)groups is that permissions are inherited in an opaque manner. Projects inherit permissions from their group somehow — is it on creation? Is it persistent? I have no clue, even as I’m writing this, and I care so little about finding out that I’d rather write my own Git hosting service.

Permissions should be handled according to the following two rules:

  1. The most specific permission option should be used.
  2. If no permission is set, it “looks up” for the default option. Here’s an example. Say that we have project C, and its path is A/B/C (so A and B are directories, C is a repo). Let us say that for some permission P, a directory/repo can either have yes, no, or inherit. So if A has yes and B/C are both inherit, then C inherits the status from B, which inherits the status from A, which has permission P. Therefore, C also has it. But if A has yes and B has no, then C inherits B which explicitly does not have permission P. Therefore, C does not have permission P.

That’s how permissions should work: keep going up until you reach a directory with the permission explicitly set (i.e. not inherit). Of course, if it’s just inherit all the way up, there should be a default value. It doesn’t particularly matter what it is, just as long as it’s sensible and made clear.

Users

This is the design for a a small-scale Git server, so every user should be trusted. Hosting services like GitHub and Gitlab have intricate user permissions which I have used exactly zero times. What I am about to describe does not scale for large enterprises, because it is not supposed to.

There are three levels of permissions: none (i.e. you can perform this action without being logged in — think public repositories), user, and admin. For any particular repository, you can set view to none, user, or admin, and you can set push to user or admin. Obviously users inherit the permissions of all visitors, and admins inherit the permissions of users.

The reason GitHub/Lab needs access control is because anyone can sign up for an account. Instead, I think it’s better to authenticate each user during signup. Here I think an O(u)O(u) cost (uu is number of users) is better than an O(p)O(p) cost (pp is number of projects). This is because I think project-side operations will be far greater than the number of users.

There are a number of ways you can deal with verifying user accounts. One way is just by allowing anyone to make an account (as in GitHub, GitLab, or indeed, any popular public-facing website), and only verifying accounts that come from trusted maintainers. This can get kind of annoying because you have to drudge through potential troll/spam/test accounts5 to verify the one or two new legitimate users.

The only viable solution that I see is requiring admin intervention to create an account. On the mathadvance.org mail server, because we use the Mailcow suite, the admin has to directly make a user account. I kind of hate this line of approach, because it puts the impetus on the user to login to their account and change their password, and if they don’t then sucks to be you. Forced password resets are sort of a bandaid on this, but the consequences of a user not following through and using their account should not be that a garbage account gets created.

So here is my proposed solution. The signup form has these five fields:

Email
Real Name
Password
Confirm Password
Signup Code

All of the fields are self-explanatory except Signup Code. The signup code is a one-use temporary code that an admin generates that expires in, say, 48 hours (which is perfectly reasonable for any actual contributor to sign up in). The idea is that if you want someone to make an account, you give them a signup code. That way, if they don’t follow through, your temporary code expires in 48 hours anyway and there is no harm no foul.

I’m thinking of storing signup codes in /tmp, so they get cleaned up, and put a timestamp along with the code in the file. So something like this:

bqIApG2okZH2NrAJVWKVQkQpvSIwV86L
1646357721036

Where the first line is the token, and the second line is the Unix timestamp.

Directories

The project will follow XDG specifications, so there will be two directories where stuff6 is stored. We have $XDG_DATA_HOME/glee for data generated by interfacing with Glee and $XDG_CONFIG/glee for manually edited config files.

Here is how $XDG_CONFIG/glee is going to look:

repos/
test-repo/
actual_file.txt
.git/
dir/
nested-repo/
actual_file.txt
.git/
users/
dennisc
repo-data.json
redirects.json

Here is what repo-data.json contains.7

{
    "test-repo": {
        "perms": {
            "view": "user",
            "push": "admin",
        },
        "history": ["dir/test-repo"]
    },
    "dir": {
        "perms": {
            "view": "any",
            "push": "admin",
        },
        "history": [],
        "subpaths": {
            "nested-repo": {
                "perms": {
                    "view": "user"
                    "push": "default"
                }
            }
        }
    }
}

If an object (like the value for key dir) has field subpaths then it is a directory, and its subpaths are contained in the object value of key subpaths. Otherwise, it is not a directory and is a repository. If you want, you can think of the entire JSON object as listing the subpaths of $XDG_DATA_HOME/glee/repos.

If the repos flag doesn’t exist, then the path itself must be a repo, and otherwise it is a directory.

The history array is the prior locations that a particular path was in. If it proves to be too much an implementation hassle/I decide it isn’t useful (it’s not how the webserver will determine redirects), I will cut it out. In practice I think the hardest thing to do will be to define a simple, intuitive spec around its behavior when moving stuff. Should the history of nested-repo be added to if dir is moved? I am inclined to say yes.

The redirects.json file is for redirecting from old paths to new paths, provided that the old path is not used by something else. Here’s an example that corresponds with the previous one:

{
	"dir/test-repo": "test-repo"
}

Now if you move A to B to C, then you have two approaches: A redirects to B which then redirects to C or just set the link from A to go directly to C. Now, the former is more costly on all redirects, and the latter happens only on renames. Since redirects will be far more common than renames (we hope), then it is better to make renames more expensive. GET is more common than PUT/POST.

This is why a history array might be useful: look through the history, edit anything that appears in redirects.json. Then again, when moving B to C, you could just look at all key-val pairs with value B, edit them to C. Since this is a small-scale Git hosting service (why would perms be so broad/users require admin auth otherwise?) I don’t envision such a distinction mattering at all, since moving is a fairly rare operation. So here’s where my head’s at: no history array, when moving A to B, scan all redirects with value A and edit the value to B.

Entries will be deleted if a new repo is created (here, at dir/test-repo), or moved to the old location.

Obviously, redirects will respect view permissions (so the old URL will just return “no repo” if it redirects to a private repo, i.e. one you don’t have permission to view).

The users directory contains user info, probably username + hashed password + permissions. I don’t think I know enough about Git to say for sure what should be handled by Git/the OS and what should be handled by the program.

As for $XDG_CONFIG_HOME/glee, here is what’s going into it:

perms.toml

(Yeah, that’s it for now; I may add more conf files if the need arises, but if we only need one conf file that’s perfect.)

And inside perms.toml:

# The list of all roles besides `any` and `admin`
# The order that they are defined in is how permissions are inherited
# For instance, if `roles = ["user", "mod"]`, then `mod` inherits
# the permissions of `user` since `mod` comes after `user`.
roles = ["user"]

# The default role given to a new account.
signup_role = "user"

# Permissions assigned to the "default" key
[defaults]
view = "admin"
push = "admin"

You will notice that any is not in roles. This is because any literally means anybody, signed in or not. So it’ll be a reserved keyword, something that you can’t put in roles. Same for admin, there are special permissions only admins have (like granting signup tokens).

Even though I said before (and probably will say later) that having a bajillion levels of access control is stupid, I think I’ll be keeping roles extensible. It’s lightweight and totally opt-in (just don’t add more roles if you don’t want more). Just because Glee isn’t designed to scale doesn’t mean I won’t nab an easy opportunity to make it scale better.

You will notice that the default permissions are kind of conservative. That’s by design; you don’t want to accidentally expose private repos before you read up on how default permissions work.8

By the way, config files are in TOML, data files in JSON.

Webview

I want to make something simplistic like the Linux Kernel’s Git webview.9 Actually, maybe even moreso: I don’t think I need about, diff, or stats, and I probably won’t even implement syntax highlighting — this makes the link for raw content shorter, since there only is raw content. Maybe I’ll allow formatted view through a URL param like

https://glee.dennisc.net/glee.git/tree/.gitignore?fmt.

and have links to files/directories direct to the fmt version, plaintext otherwise.10

Oh and also, branches via

https://glee.dennisc.net/glee.git/tree/.gitignore?b=dev

I have no intentions of totally eschewing JS, by the way. I am not nearly as militant as some other people about “no JS!!” (Ad banners annoy me as much as anyone else, but to extrapolate that with “all JS bad” is a stretch. Though some contexts, particularly high-security ones, are totally right about no JS.) If I can avoid JS though I will make an effort to, particularly since I care about people using command-line browsers. Currently I plan to have the API return a list of directories and files inside a repo and format accordingly (this includes links, with fmt if appropriate); same for non-repo directories, and format with JS appropriately.

But GitWeb, CGit, etc are not really the sort of solution I want, since as far as I know you can’t login to the web interfaces. Which sucks for collaboration, and also sucks for personal use because what’s the point of a webview if I can’t even see all my projects, most of which are supposed to be private?

Other VCS

I say this is a Git hosting service (and indeed that is what I will support, first and foremost) but in principle nothing I have said will not work with something like Pijul (which I have wanted to try for quite a while!) So a Pijul integration is something I might want to consider, depending on my experiences with it.

The Caveat

The thing about this sort of design is that it has so few details, someone must have done it before me. Maybe I am wrong and everyone else decided to use 7 levels of access control but only 1 level of nesting (username/repo or org/repo). I hope I am not, though, so if you know a self-hosted Git service that sounds something like this, please let me know so I don’t have to build it myself. Because I would really rather not.