How GitHub became the nexus of software automation

One day, Linux' creator made a utility called Git to keep track of all the contributions to the Linux kernel. That triggered a string of events leading to the establishment of GitHub as the de facto automated supply chain for software -- not just open source.

How GitHub became the de facto automated supply chain for software GitHub is an example of a web service that absorbs the function of an entire industry's supply chain, but it took a few versions for it to become the software we now know and use.

Video: How GitHub became the de facto automated supply chain for software

To call GitHub a website is to call Italy a place to eat. GitHub is the leading practitioner of an emerging marketplace -- and yes, it may legitimately be called a "market" because it does generate revenue. It earned, by several estimates, over $200 million in revenue in 2017, and was evidently valuable enough to Microsoft to prompt it to purchase GitHub outright, in a $7.5 billion all-stock deal last June.

Read also: Why Microsoft is buying GitHub: It's all about developers

It is accurate and fair to say that GitHub created a market in the supply of open-source software, and the automation of its deployment. There are other competitors in this market, most notably GitLab and Atlassian's Bitbucket. It's the presence of those players that legitimizes this market.

What GitHub has become is the most effective example to date of a web service that absorbs the function of an entire industry's supply chain. Open-source software has been shared online in the past, with SourceForge being one of the most effective practitioners. But the distribution of software through SourceForge, and sites like it, takes place using a content management system -- a platform best suited for folks using web browsers.

Git going

180702-github-example-page.jpg

GitHub is engineered to use a tool created for Linux called git, by the man who created Linux itself, Linus Torvalds. It is an automated supply chain for the distribution of open-source software, both between the people who develop it and for the folks who use it. Its automation ensures that the distribution channel provides the newest, safest version available to users, while at the same time distributing less stable works in progress to developers. It also provides the most stable versions of any other components upon which a shared code element may depend.

Read also: Linus Torvalds on Linux, life, and bathrobes

"The cloud is increasingly a core priority for developers. And at GitHub, everything we do should be about making a developer's life better at every stage of their work and lifecycle," said Nat Friedman, incoming GitHub CEO, during a June joint press conference with Microsoft. "That includes helping them make it easier to build in the cloud.

"GitHub is an open platform," Friedman continued. "So we have the ability for anyone to plug their cloud services into GitHub, and make it easier for you to go from code to cloud. And it extends beyond the cloud as well: Code to mobile, code to edge device, code to IoT. Every workflow that a developer wants to pursue, we will support."

Linus Torvalds' intentions

The git tool is not, contrary to what you may read elsewhere, an exclusive part of Linux. Indeed, versions for Windows and Mac OS X are distributed freely. Technically, git (which also is not an acronym that stands for anything in particular) is introduced as a distributed version control system . Made originally for the Linux command line, it establishes a filing system aside from the computer's own file system where multiple versions of an evolving file may be stored and retrieved. The database where these versions are exchanged is called a repository (or "repo" for short). Any version of a file that is put into a repository may be extracted from it, in the same condition. From git's perspective, this does not have to be software. It can be the manuscript of a book, the instruction manuals for separate versions of an evolving machine, or a person's diary.

Torvalds wrote the original git as a source code management system for his own personal contributions to the Linux kernel. Part of his inspiration was an existing repository-based version control management system called Concurrent Versions System (CVS). If you'll recall the ancient world of disk-based databases, locks were imposed on records that were being retrieved and potentially updated, to ensure that no two clients have different views of the same record. CVS had a similar concept using a repository of content versions that forked from each other like branches of a tree. A branch could be "checked out" by someone looking to make updates or alterations. Once those changes were made, a revised version of the content would be merged back into the trunk, but instead of pruning the old branch, a tag would mark its original location, enabling it to be restored if necessary.

Read also: Linux creator Linus Torvalds: This is what drives me nuts

More to the point, Torvalds hated CVS. But the alternative he had chosen for maintaining Linux was a piece of proprietary software called BitKeeper. It may have been the first truly scalable content repository system. Yet it owned the keys to its own kingdom -- specifically, the metadata that describes the history of contributions to a repository. This metadata is absolutely critical to the construction of an operating system kernel, but was only available to BitKeeper licensed users. BitKeeper extended such a license to Linux contributors on a read-only basis.

When another Linux contributor apparently tried to reverse-engineer the metadata for himself -- an act which Torvalds would publicly condemn -- BitKeeper's publishers yanked the license, leaving Torvalds to either wrestle with CVS instead or devise an alternative.

From a centralized to a distributed repository

That alternative, git, would steer clear of BitKeeper's centralized repository, opting instead for a distributed model. One upshot of this model is that a multitude of contributors could offer their own updates to a branch without some arbitrarily designated upper-class person being given "commit access" -- the right to declare one such contribution the "official" one.

Read also: Git: A cheat sheet - TechRepublic

As Torvalds told a Google-sponsored conference in May 2007, the distributed model purposefully avoids the kind of politics that ended up destroying his efforts with BitKeeper.

180702-github-01-linus-torvalds-2006.jpg
(Image:Google LLC)

"Since you do not want everybody to write to the central repository because most people are morons," he told his audience, "you create this class of people who are ostensibly not morons. And most of the time what happens is that you make that class too small, because it is really hard to know if a person is smart or not -- and even if you make it too small, you will have problems. So this whole commit access issue -- which some companies are able to ignore by just giving everybody commit access -- is a huge psychological barrier, and causes endless hours of politics in most open-source projects."

Intentionally or not, Torvalds' architectural decision was the catalyst for the movement that transformed open source from a revolution into an establishment. "Openness" never made sense in a community with a more arcane organizational hierarchy than the average corporation. With the aid of a distributed repository model, any individual -- even an anonymous one -- could claim a fork of an existing project, and contribute the changes necessary to make that fork his own.

Signing in

What GitHub contributes to this picture are these important components:

  • The social framework for coordinating git among multiple users;
  • A basic system of identity for individual contributors (as opposed to their employers or their projects);
  • A basic website with which to present and explain the software (or other content) to the outside world;
  • The context for projects to be integrated with continuous integration (CI) pipeline platforms such as Travis CI, CircleCI, and Jenkins.

One of the principal challenges the open-source community faced throughout its formative years was, quite ironically, the lack of a common, programmable infrastructure spanning all of its contributors. Yes, they had content management systems, but they were not truly clouds, nor were they systems engineered for the distribution and deployment of software.

Read also: Google exec says it's OK Microsoft nabbed GitHub - CNET

Though this may not have been Torvalds' original intention, the pairing of git on developers' PCs with GitHub on the web resulted in an easily automated system whereby any individual may participate in a massive collaborative project, even without an invitation. Any GitHub member may fork an open-source repository. She may then opt to clone it to her PC locally (forking is not the same as cloning, despite how some tutorials phrase this).

Next, the GitHub user configures git to point to that repository. Any changes she makes will create new versions of the repository. Such a change will not physically clone the entire repository, but it will produce a new image that effectively merges the changes with the original. She can experiment by making branches -- evolutionary pathways that diverge from one another, especially to test many methods for achieving a result.

The pull request

The act of contribution -- of requesting that changes or (presumably) improvements be committed or merged, either upstream or to someone else's repository -- is the process you may have heard about called a pull request . This is the most important social process in the entire system. It is a means for a contributor to ask the owner of another repository -- usually the maintainer of the project -- to evaluate the changes she has made, and to either accept them and merge them into his own repository, or reject them as he sees fit.

Read also: Learn to use GitHub with GitHub Learning Lab

It may not be a coincidence that GitHub uses the abbreviation "PR" to refer to a pull request (we journalists immediately think it means "press release"). A social etiquette has formed itself around the proper method for introducing a pull request to the community, particularly to make it more appealing than a contact request on LinkedIn. Users are being advised to be more friendly, more persuasive, more -- to coin a phrase -- "open" about the intent of the changes they're trying to make. It's an effort to maintain the human element of the open-source process, rather than assume that people are equally fine with plain-vanilla automation.

180702-github-03-conference.jpg
(Image: GitHub Inc.)

"Pull requests can happen in any workflow that you use whenever you want to incorporate changes into your code base," explained GitHub trainer Eric Hollenberry (pictured right, above) during a 2016 company conference, to an attendee and GitHub member who didn't know pull requests existed. "A pull request, by definition, is a conversation around a change. You create that conversation at whatever point you would use in your workflow, and then it would result in a change that you merge in, at the end."

The Docker factor

The power of GitHub to galvanize a widely diverse gathering of contributors reached critical mass with the introduction of containerization. Prior to the standardization of containers by Docker, the principal means of sharing code online was through a common compression format such as TAR or ZIP. There were automated scripts everywhere for getting and putting contributions from and to staging. But "everywhere" is a difficult place to patrol and manage. Some SourceForge users adopted plug-ins for their respective development environments (their IDEs), such as Eclipse. And those plug-ins would be standardized after a fashion, but only with respect to those IDEs.

As a file format, Docker's container is nothing particularly novel or new. It uses a form of ZIP compression (based on Lempel-Ziv) that is so close to the standard that UNZIP utilities can make sense of it. But it's the contents of that container that were so revolutionary, particularly the inclusion of something called a Dockerfile . It's a kind of manifest with instructions for how to compose and deploy the software inside the container. Those instructions may be executed automatically by Docker's own Compose utility, and they include directions on how to acquire, unpack, and deploy all the other components on which the software relies, without having to include them in the container package. Those instructions do involve repositories, and Docker Inc.'s own is called Docker Hub.

Read also: What is Docker and why is it so darn popular?

The Dockerfile filled in a major gap in the automation process for GitHub and other online repository systems based on git. Instead of relying on scripts of their own devising (or, ironically again, that they'd shared with one another) to stage and deploy their software, as SourceForge users had to do, they employed a means that everyone could freely adopt. This made software attained through pull requests both usable and testable, in an isolated context that would not disrupt other software, including other pull requests.

What the stewards of open-source projects finally had available to them was a means with which they could evaluate and approve the work of others, and marshal the process of committing their work upstream, without really having to think about the process much.

Business case

So where, in the act of automating all this free sharing of software between free individuals freely, does that $200 million of annual revenue enter the picture? What GitHub realized -- in this case, quite intentionally -- was that the exact same mechanism used to unite the broader open-source community for common projects, could automate private enterprises' own internal efforts to produce their own software, open source or not. In effect, GitHub Enterprise could leverage the open-source community's infrastructure as a platform for collaborating in the open-source style, but not necessarily with the aim of attaining an open-source license.

This is where the deployment platform became an industry. The private repository is becoming the cloud-based distribution center for enterprise developers, and enterprises are willing to pay subscription fees to have it.

Read also: Docker has a business plan headache

And yes, this is where an open-source project cut directly into one of Microsoft's key profit centers. For years, collaboration and version control have been the ingredients justifying Microsoft charging premiums for Visual Studio Team Foundation Server, and later for Visual Studio Team Services. Like CVS long, long ago, Visual Studio's native version control system is centralized, although in 2013 it adopted git as a distributed alternative. Still, VS has offered a sophisticated, very graphical management system for code evaluations and commitments.

Embrace and extend redux?

Microsoft's acquisition of GitHub certainly bears at least some resemblance to the "embrace and extend" business policies of its not-all-that-distant past. A decade-and-a-half ago, a headline boasting such an acquisition on ZDNet or Betanews (my old haunt) would have dozens of readers crying, "Conspiracy!"

Today, you still hear some cries, but you have to hunt them down. "When I read about this I thought, great, now they will be tracking everyone's downloads," wrote one Reddit user. "Now I think they will be tracking, advertising, and selling everyone's download information."

Read also: GitHub: Changes to EU copyright law could derail open source

But that comment was buried beneath several other comments, many of which admitted indifference, a few of which supported Microsoft's move outright. The company has already become a principal contributor to both Linux and Kubernetes, and for those developers whose relationships with open source contributors have been made more intimate by GitHub, Microsoft is just as much a part of their lives as Red Hat.

"Buying GitHub does not mean Microsoft has engaged in some sinister plot to own the more than 70 million open-source projects on GitHub," wrote Linux Foundation executive director Jim Zemlin last June. "Most of the important projects on GitHub are licensed under an open-source license, which addresses intellectual property ownership. The trademark and other IP assets are often owned by a non-profit like the Linux Foundation. And let's be quite clear: The hearts and minds of developers are not something one buys -- they are something one earns."

Protecting the idea

The open-source licenses around projects trusted to GitHub repositories already protect them from any organization or other individual entity claiming ownership over them. Just as Google's 2006 acquisition of YouTube didn't transfer ownership of their possums-chasing-squirrels videos, Microsoft's oversight of GitHub won't change the legal status of the projects it hosts.

Read also: GitHub: A cheat sheet - TechRepublic

What does deserve further scrutiny as time goes on, however, is how this deal, by legitimizing the open-source delivery pipeline as a top-tier industry, will alter the character of the open-source movement. If it was ever truly a counter-culture, it certainly isn't one now. Although GitHub's profitability may not be directly due to the popularity of sharing code, it is indeed tied to the automated, pipelined supply chain to which open source gave rise. If that model is truly as influential as open source proponents assert it to be, then nothing Microsoft would do to change it one way or the other, in the long run, should have any noticeable effect.

Learn more -- From the CBS Interactive Network

Elsewhere

Related stories