How to find out which Git hosting application a repository is served by?

For a small git helper script, based on this blog post, I'd like to be able to "discover" in which Git hosting app, a given remote URL ([email protected]:namespace/project.git) points to (e.g. GitLab CE/EE, Gitea, GHE, etc).

Using curl --head I found a wide range of "some identifying strings" to "none". So, that seems inaccurate, if fed into a heuristic. Going by the page body may provide more data for the heuristic, but seems equally crude.

Is there a more elegant or standardised way to find the app type? Something like a "server_agent"?


I understand that for security reasons, detailed info like the app version, will likely not be served. Also, I noticed that in Shodan, there is no "product" search for those apps. Does that mean it's fundamentally not possible to reliably identify them without HTML parsing?


Solution 1:

I do not think there is any "standardized" approach to finding the hosting application. The Git protocol itself does not provide any such thing. In HTTP (which most Git hosting apps use as the transfer protocol), the Server header is probably the best match - but of course, as you noted, there is no requirement for it to be meaningful (or even present).

Does that mean it's fundamentally not possible to reliably identify them without HTML parsing?

Yes, if the server chooses not to identify via the Server header, you can only guess (based on other headers, HTML responses, whatever).

So it seems there is no reliable way to do what you want. Maybe it helps to see it as a X-Y problem? If you describe what you want to do with the information, you may find a different solution.

Maybe you can try probing the server? Or ask the user?

Solution 2:

Does that mean it's fundamentally not possible to reliably identify them without HTML parsing?

More or less, that is true. As sleske correctly states, there's no reliable way to use headers to identify the application/technology behind an HTTP server, as servers often choose not to provide this information.

Parsing the HTML response on the tld home page may or may not yield any useful information. With a familiarity of these services, you could probably get a good guess -- but it would be just that. A guess. With enough sophistication, you can probably get very good at guessing, but nothing is 100% certain.

You may also be able to make some positive determinations based on the remote URL and/or application behavior (if publicly accessible) -- (probing the server, as sleske also suggested)

For example, most SCM servers except for GitLab do not have deeply nested remote URLs. The remote URL [email protected]/foo/bar/project.git is not possible on GitHub, BitBucket, or Gitea, but is possible on GitLab.

You may also find that certain UI kits (presence of certain combinations of relevant javascript, CSS, etc) are used by certain SCM product versions exclusively or other unique elements in the response. Error responses (both over HTTP and SSH) can also be revealing.