Skip to content

Git Guide

Barry Pollard edited this page Oct 10, 2021 · 73 revisions

Many of the contributors may not be familiar with Git and it can be a confusing world for those new to it with perplexing terms like clone, fork, branch, merge conflicts and rebase. This guide aims to provide some information to those of you new to Git about the best way we think of working with it it is and also serve as a quick reference to some of the Git terms, or commands to use.

Table Of Contents

Table of Contents

Introduction to Git

This section will give some basic background to complete newbies to Git. Feel free to skip to the next section where we talk about how we use it on the Web Almanac if you understand the basics already.

What is Git?

Git is a distributed version control system. That mouthful basically means it's a way of keeping track of changes to our source code and other content - especially when many, many people are changing various parts of it. The distributed part of it is what makes Git so interesting (and so complicated!) - there can be many copies of our code, and that can cause fun and games when trying to keep it in sync!

The original and primary copy of a code base (called a repository or repo) is hosted on a server (e.g. GitHub), people will be working on copies in their local machine, and people may have forked a copy of the repo to another one also hosted on the server - and then that forked copy may also be copied locally to your machine. Add in different branches in any of those copies and it can quickly become quite confusing.

Git often involves working with the command line, which might be less familiar and a bit intimidating for those of you less technically minded. Graphical front end tools exist to try to replicate this command line functionality but it's helpful to have some familiarity with using Git on the command line and with a guide like this, hopefully that becomes less daunting a prospect!

What is GitHub and how is it different than Git?

GitHub is not Git, but it is one of the most commonly used instances of Git and adds various features on top of the core versioning of code that Git handles. The main thing GitHub gives you is a Git server to store your code, and a nice web front end to manage it all through. Using the web front end you can view (and even change!) code, raise issues, open and review pull requests, use GitHub Actions to automate things (e.g. test code) and even host wiki pages like this one.

In this Wiki I've tried to differentiate between Git concepts and commands and those specific to GitHub. Other instances of Git that you might be familiar with, or use in work or other projects, include GitLab and BitBucket. They have many of the same features as GitHub.

GitHub also have a graphical front end tool called GitHub Desktop for working on on Git locally and syncing it back to GitHub. Check out the GitHub Desktop section for tips on how to use it.

The Web Almanac makes extensive use of GitHub to help us manage the project and allow all the many disparate contributors to collaborate easily.

Installing Git

While it is possible to work just using GitHub's website - especially if just comment on issues and adding your advice - managing the code really is best done locally on your own computer and then pushing changes back up to GitHub. Git is very popular and widely available and can be installed on Windows, Mac and Linux. You may already have it installed - to check open a command line and type:

git --version

If you see a version number returned then you've passed the first step!

If not, then for Windows I recommend installing and using Git Bash which is a Linux-like command line. For MacOS the built in Terminal available under Launchpad is fine and running the above version check will prompt you to install XCode and Git. For Linux I presume you'll be familiar with how to install this.

Git Repos

A Git Repository or Repo is a collection of all the code that makes up a project. Well that's not strictly true as a project may also depend on other programs and libraries, but typically they are not stored in the project repo - only the code specific to this project is stored in the repo along with config files that are used to install any necessary libraries to run the code and instead installed (e.g. using a command like npm install for node modules).

The Web Almanac repo is available on GitHub at: https://github.com/HTTPArchive/almanac.httparchive.org However, some of the analyst may also be working on one of the other HTTP Archive repos for example to implement custom metrics to gather extra information during the crawl that we will use for analysis.

Git Branches

A repo will usually contain a number of branches. These are copies of the code where you can work independently on a particular item. The name branch is used because, like a tree, these can diverge from each other - though, unlike a tree, they are usually merged back when the work is complete.

There will be one main (or master) branch which everything should be merged back into when ready. Traditionally these have been called the master branch, but many projects are trying to use more inclusive language and have switched to using the name main or similar instead. The Web Almanac moved to using main in June 2020 though at present the other HTTP Archive repos still use master.

Creating a branch is very quick and is integral to how Git works. Git stores branches in an incredibly efficient way and doesn’t literally have a copy of the same code, but only differences basically. So do not feel like it's a big deal to create a branch (it's not!) and frequently creating small branches, and merging them back in to the main branch when ready is the best way to use Git. Creating large branches or reusing branches for lots of different changes is not the best way of using Git and will lead to issues.

GitHub Pull Requests

Once your changes are ready to merge back to main you open a pull request (often shortened to PR), which creates a special type of GitHub issue which can be used to merge your changes into the main branch.

A pull request is really a GitHub concept and at the end of the day is basically a fancy way of actioning a merge in Git. Bitbucket also use the term Pull Request, while GitLab uses Merge Request. It should also not be confused with git pull, which is a Git command to pull down changes from the server (e.g. GitHub) into your local copy.

An example pull request on GitHub is shown below: Example Pull Request

In this pull request there are the following tabs:

  • Conversation - this allows you to give some info using GitHub markdown (including screenshots if you want). Reviewers can comment, and ask questions for you to answer, before merging the pull request into the main code.
  • Commits - this shows a list of links to all the individual changes you made to the code. It's not that useful a tab to be honest!
  • Checks - this shows all the automated checks run on your code so we know it's good! These are setup in the main repo using GitHub Actions (or similar) and the results are also shown at the bottom of the Conversation tab for open pull requests.
  • Files Changed - this is one of the most useful tabs and shows each line of code changed. Reviewers should look those this tab, and can click on individual lines to make comments or code improvement suggestions which are added to the Conversation tab and the person who opened the pull request (called the pull request author) can then answer or address the concern (including accepting any suggested code changes directly into this change with a click).

You can tag people to review your pull request, assign it to someone to deal with (not used much as kind of repeat of the author and reviewers), add labels...etc.

At the bottom of the Conversation tab you will see the following:

Bottom of a pull request with "Squash and Merge" and "Close" buttons

This shows on this PR that all checks have passed and this is ready to merge. Clicking the big green "Squash and Merge" button will copy (the "Merge" part) all this code into main branch with one single commit (the "Squash" part). Usually you don't need to have all the 100s of commits you have have done while developing this code change so "Squash" is what you want but you can change it if you want.

You can also Close this pull request if you change your mind with the Close button at the bottom, or add a comment with the Comment button if you make a big change to it since opening that you want people following the pull request to be aware of.

Please note you do NOT need to Close and Reopen the pull request (or even open a new pull request) when you need to make changes based on review feedback - simply pushing changes to the branch will cause any open pull request from that branch to automatically be updated and checks to automatically be rerun. It is expected (and a good thing!) to change your code based on feedback and this is very much part of the workflow of pull requests.

GitHub Forks

As well as branches, GitHub has the concept of forks, which basically means taking a complete copy of the repo (and all its branches at that time) into your own GitHub account. You can then create a branch in that fork, and then open a pull request to to merge code from your branch on your fork, all the way back to the the original repo (called the upstream repo). It may sound like an Inception level of abstraction and confusion but it actually works quite well once you get your head around it.

Note there is some confusion as to the name fork as traditionally that term was used when you wanted to take a project in a different direction than the original developers - so you forked the code and never merged back again. In GitHub a fork is used to make changes outside of the original repo but usually with the intention of merging them back into the original repo once complete.

Why would you fork when you can just work in the original repo? Well many projects don't want people messing with the original repo so restrict permissions to only allow core contributors to create branches in the original repo. Others must therefore fork to make changes and then open pull requests to the original repo for review before they are committed.

And it's important to use the correct terminology when working with forks. Tempting as it is, the original repo should always be referred to as "original" or "upstream", and never "main" or "master" - which refer to branches within a repo. Similarly a "local" copy, or "clone" refers to the copy on your PC as we shall see and that can be of the original repo or a fork.

Another extra bit of hassle with a fork, is that you must keep it reasonably up to date with the original, upstream repo. To do that you periodically merge or rebase the fork back to the original repo which pulls down changes into your fork. We'll explain how to do that later.

Cloning a Git Repo

To work on a project in GitHub you would normally clone a repo, which simply means taking a copy of it on your local PC. It is possible to make small edits on the GitHub.com website but it’s quite limited and often doesn’t allow you to run code locally to test it for example. You can clone a repo by clicking on green Code button on the repo's home page:

Code button in GitHub

This offers a number of options:

  • "Clone with SSH" is the recommended way but is a little more complicated to set up, but allows you to interact with GitHub without entering your GitHub password each time, and is basically mandatory if using 2FA for your GitHub account.
  • "Clone with HTTPS" works but requires you to enter your password each time so gets a little tiresome.

Once you copy the SSH or HTTPS URL on the command line simply go to the command line on your PC, into a directory you want to create the copy in and type the following (assuming SSH):

git clone git@github.com:HTTPArchive/almanac.httparchive.org.git

You can clone a local copy of the original repo, if you plan to (and have access to work on branches of that, or you can clone a fork of the original repo. The above example command clones the original repo location, and not the fork location - you should change the git address to the forked version when working from a fork.

After running this command you'll see the repo being downloaded locally. You can then branch, edit any of the files, or add new files, or even delete files to your hearts content. Any changes you make will only be on your PC and then you push changes back up to GitHub. We'll cover that later.

Just like with a fork, you need to keep any local up to date with both the original, upstream repo, and the GitHub version. This is done by using the git pull, git merge and git rebase commands. We'll explain how to do all that below.

Git Merge Conflicts

When keeping all the different copies in sync you will inevitably run into the dreaded "merge conflict" - a rite of passage every developer must go through. This happens were you've changed some code, but so has someone else, and their changes has been merged into main, so when you attempt to merge (either by syncing main back to your branch to update your branch with any new changes since branching, or by attempting to open a pull request from your branch) Git will give up and say "I dunno what to do here - you deal with it!".

In actually fact, dealing with merge conflicts is actually very simple. When you open the conflicted file you'll see something like this:

If you have questions, please
<<<<<<< HEAD
open an issue
=======
ask your question in Slack
>>>>>>> branch-a

In this case someone changed the line to "open an issue" and merged that to main (aka HEAD) and you've also changed it to "ask your question in Slack". Git is warning you that it has been changed since branching but you also changed it. You simply need to decide what line you want and then delete all the other lines (including the ones starting <<<<, ==== and >>>>. Then git add the "resolved" file to your branch.

You can even do it directly on GitHub.

Merge conflicts get a bad name and people think they are scary to deal with but Git actually makes it fairly easy. It will also usually only complain if the exact same line has changed — two people working on different parts of the same file usually won't see any merge conflicts.

Of course if you're both working on lots of the same code, across lots of files they can be a real pain to deal with - this is one of the main reasons to resync your branch back to the original main branch frequently, and also to work on small PRs rather than big unwieldy ones!

Recommended way to use Git for the Web Almanac

When working on the Web Almanac we recommend forking the Web Almanac repo to your copy and working on that. Web Almanac contributors do have access to create branches in the original repo, and that can be useful if you want to work on a branch with another contributor, or if you become one of the core contributors that works on the code a lot, but in general working on a fork is the way you contribute to most GitHub projects so it's good practice to use this even for the Web Almanac when you don’t need to.

To create your own Fork do the following:

  1. Click the Fork button in GitHub and wait for it to complete.
  2. Clone your fork locally - for me (tunetheweb) that means running this command on your PC: git clone git@github.tunetheweb/almanac.httparchive.org.git
  3. Add a link to upstream - git remote add upstream git@github.com:HTTPArchive/almanac.httparchive.org.git (change this to the HTTPS URL if not using SSH)

It is also strongly recommended not to work on the main branch of your forked repo. When creating a new branch you will usually branch from main, so once your main has extra changes in it, it is no longer possible (or at least easy!) to create a clean branch for other work.

If you are only working on one thing at once, then using main in your fork may seem the quickest and easiest thing to do, but you'd be surprised how often you may want to pause that work for a bit and work on something else instead - and that's one of the advantages of using Git. For example if you are waiting on a pull request to be reviewed, you may want to work on another feature in the meantime on a fresh branch, which is completely independent of your other work. Or perhaps someone discovers an urgent, and easily fixed, bug in the code that you can quickly fix, before coming back to the current work. See the Switching between branches section below for more info on how to switch branches.

Working on a separate branch to main allows main to be kept in sync with upstream main, which allows new branches to be created more easily. It also allows you to merge upstream main into your branch periodically more easily. It also keeps your history of changes without a long history on each future pull request. Finally, it also also you to completely reset your main back to the same as upstream if you get it completely messed up, without losing any history in other branches.

Resyncing your main branch to upstream

As discussed above it is recommended to use the main branch in your fork only to sync up with the main branch in the original Web Almanac repo. To do that you can do one of several things:

  1. You can merge upstream main into your main branch - however if you have any changes, the this will add a commit message for each upstream change into your main. This means the next time you branch and open a pull request you'll see a long history of old commits, which is a bit noisy and annoying when they were nothing to do with your old pull request. Of course you shouldn't have any changes on main (because you're not using this as advised above aren't you?) so this shouldn't be an issue, but you never know...
  2. You can rebase your main branch to be based on upstream main. This is a special Git command that basically makes it look like you just forked your main off of upstream main and then applies any changes you have made in your repo on top of that, so should lead to less risk of old commit messages, but it will still add commits for any changes you added. And again, you shouldn't have any changes on main so this shouldn't be an issue, but you never know...
  3. You can blast away your main and basically reset it back to upstream main. This sounds extreme but is actually my preferred option to ensure main is super clean and reset back to the same as upstream main. Note this will (intentionally!) blast away any changes from your main and you will lose them completely but, again, there shouldn't be any changes here if following above advice.

So I prefer option 3 - it's the cleanest and ensures main is exactly the same as upstream, and that there will be no long history of commit messages in your next branch and pull request.

Commands to force reset your main branch to upstream

You can use the below commands to reset your local fork to upsteam and then push those changes to GitHub.

Note you should ONLY do this if you are happy to lose all changes to your main branch and completely reset it as if you'd just forked it. If you have changes in flight on the main branch then you will need to either wait until they have been merged, or move them to a new branch before you can do a clean reset of main.

Check if upstream already exists:

git remote -v

Add an upstream remote, if not already added previously:

git remote add upstream git@github.com:HTTPArchive/almanac.httparchive.org.git

Then force reset your main branch:

git fetch upstream
git checkout main
git reset --hard upstream/main  
git push origin main --force 

After this your should visit your forked repo on GitHub and check you get a message that "This branch is even with HTTPArchive:main.":

A forked repo which is even with upstream

Creating and working on a branch

So, when you're ready to make your first changes, do the following:

  1. Resync your local copy to upstream as per above (unless you've just forked it, in which case you can skip this step)
  2. Make sure you are in the main branch - git checkout main
  3. Branch main to a new branch with a meaningful branch name. For example to work on an issue to add new social media icons you'd run this command - git checkout -b new-social-media-icons (note the -b which means "create a new branch please").
  4. Make your changes using your favourite code editor (I recommend Visual Studio Code myself).
  5. If making code changes to the website then test them - follow instructions in src/README.md to run the environment and then use npm run generate to build the site. Then you can use pytest and npm run test to run the automated tests to check its working (and again after making any changes!). For changes to SQL or chapters this should not be necessary as they will be run by GitHub anyway when you open the Pull Request.
  6. Add any new files you added in this change that you want tracked in git - e.g. git add static/images/*.png. You also need to do this to re-add merge conflict files that you have resolved. You typically do not need to add other files that are already in the repo that you have changed, as Git will track those already.
  7. Commit your changes - git commit -a. This means “save all changes to Git (for tracked files)”. If you only want to commit some files, then git add just the files you want and then use git commit without the -a option. When you commit you need to add a message. Git will automatically open your default Git editor - usually vi which can take some getting used to but basically type your message and then type Esc : w q ! to enter command mode (Esc), start a command (:) save your file (w for write) and quit (q) and force override any errors on save (!). See the next section for an alternative method if you're not comfortable with vi. And btw if you want to cancel the commit at this stage, remove any message you added and type Esc q ! to quit without saving any changes and Git will see you didn’t add a message and cancel the commit.
  8. Push your committed changes to GitHub - git push
  9. If there are any changes already in GitHub (e.g. of other people are also working on this branch), then you will get and error and need to do a git pull to pull them down locally before you are able to push any more changes back up to GitHub.
  10. Repeat steps 4-9 as often as you want until you are happy to open a pull request to merge this back into the Web Almanac code base.

Committing changes - to vi or not to vi

My preferred method of committing changes is to use this:

git commit -a

This takes all the changes for existing tracked files and adds them to the commit. New files still need to be added with git add but all files currently tracked by Git are automatically included without having to use git add. This then opens the default Git editor (usually vi) and lists the files for this commit, any files not included and allows you to add the message and complete the commit. If you close vi without adding a message then it cancels the commit.

However vi can be quite a barrier to people as it's quite a confusing editor with a "command" and an "edit" mode, and needing to remember command sequences (like ESC + w + q + !). For those not familiar with linux and vi this might be quite a stumbling block.

You can provide the message on the command line so you don't have to got near vi using a sequence of commands like this:

git status
git add file1 file2
git commit -m "Committing file1 and file2"

This does require you to add the files or folders first so a bit more painful than git commit -a. Alternatively you use the -a and -m switches together:

git status
git commit -a -m "Committing all open files"

The downside is that, unlike the vi method, it won't show you the list of files it's going to commit, so is a bit more dangerous, hence why I prefer the vi method instead. However, if you do a git status before you commit, you should see the files that will be committed. Plus you can always revert changes if you need to.

So, in summary vi method is preferred but can be more complicated to those not familiar with it so can give commit message on command line but take care with it.

Keeping your branch up to date

It is also recommended to merge any changes that have happened to the Web Almanac code (in its main branch) into your branch periodically in case it affects your code, and particularly important to do this just before opening a PR.

To merge changes into a forked repo from upstream main do the following:

  1. Commit any changes in the branch you are working on.
  2. Update you local copy of upstream - git fetch upstream
  3. Merge the changes from upstream main into your branch - git merge upstream/main
  4. Commit the merge - git commit -a
  5. Push all your changes up to GitHub - git push

Or to merge from a branch on the main repo do the following:

  1. Commit any changes in the branch you are working on.
  2. Update you local copy of main - git fetch origin main
  3. Merge the changes from upstream main into your branch - git merge main
  4. Commit the merge - git commit -a
  5. Push all your changes up to GitHub - git push

Switching between branches

Git allows working on several branches at once. This allows you to work on something else while you are stuck on one bit of work (waiting for answers, or pull request feedback, or you just fancy a break!). Use git checkout to switch between branches and use git status to ensure all your changes are committed when switching between branches.

For example, let's say you are working on feature1 branch:

git checkout main
git pull
git checkout -b feature1
# Make some changes
git commit -m "Commit my changes for feature1"
# Make some more changes
git commit -m "Commit some more changes for feature1"
# Push changes to Github.com if you want to (always good to do this in case your computer dies!)
git push
...etc.

And then you want to take a break from feature1, in which case you should open a new branch - in most cases you want to branch from main again, and not from feature1 branch so make sure you flip back to main again if this is the case. If working on a fork, you should also check your main is up to date first - see the Resynching your main branch to upstream section above.

# Check your branch is clean and everything has been committed
git status
# Create a new branch from main (note I've not included the resyncing of main to upstream here for forks)
git checkout main
git pull
git checkout -b feature2
# Make some changes and push
git commit -m "Commit my changes for feature2"
git push

You are now free to switch back to feature1 if you want using git checkout (note you don't need the -b flag as that's only needed to create a new branch that doesn't exist, whereas just switching between branches that already exist don't need it):

git checkout feature1

And then switch back to feature2 later:

git checkout feature2

The primary concern with multiple branches like this is getting yourself confused! Using git status here is your friend to double check you're on the correct branch and all outstanding changes have been committed.

As with lots of things in Git, there are many ways to handle branches (including the git branch command), but I'd encourage you to stick with few commands and use git checkout, git commit, git push, git pull and git status as much as possible as you can do most things with those and it's already getting confusing!

The other alternative is to just create a new folder and clone the whole repo again and manage it completely separately. This can be easier, safer and less confusing for those less familiar with Git if working on a limited number of branches. However it doesn't scale very well and is not the way you're gonna get the most out of Git so as soon as you go beyond a second branch I'd strongly encourage you get used to checking out between branches.

I do encourage separate folders however if you have different repos (e.g. your do some of the main work on the HTTPArchive repo, and some work on your own fork) as switching repo that a folder points to, while also possible, is adding yet more confusion to an already complex thing! 🙂

Opening a Pull Request

Once you are finished making changes, you should take the following steps to open a pull request back to the original repo to accept your code into the Web Almanac:

  1. Merge in any changes that happened to the Web Almanac code since you branches (see above).
  2. Run all the automated tests if making website code changes - pytest if changing Python code and npm run test to test every page on the website (note you don't need to do this to make SQL or chapter edits as GitHub will run these tests for you).
  3. Make sure all your changes are pushed to GitHub.
  4. Open a pull request in GitHub.
  5. If the pull request closes an issue then you can add "Close #123" or "Fixes #123" in the first comment and GitHub will automatically close issue #123 when the pull request is merged. If it doesn't fully close the issue, then you should instead say something like "Makes progress on #123" which will create a link on original issue but not close it.
  6. Assign pull request reviewers if there's anyone in particular you think would be good to review it. If not then someone will pick it up shortly.

As mentioned above, you can make more changes to your branch and push them up to GitHub and the Pull Request will automatically be updated. There is no need to close the PR and reopen a new one.

Actioning feedback from pull requests

When your pull request is reviewed, the first bit of feedback you're likely to see if from the automated tests that run every time a pull request is opened. They take a few minutes to run and then you will then give you a satisfying green tick, or a scary red cross.

Merge conflict checks

The first check that GitHub itself will do is check for any merge conflicts and these must be resolved before the pull request can be merged. If you merge main to your code before submitting a pull request, then it's unlikely you'll get any of these (unless someone's got in real quick while you were opening it!) which is why it's recommended to do that, but other pull requests can be action while yours is being reviewed so can still happen. Smaller pull request, touching few files reduces the chance of this so again, where at all possible, break up changes into smaller batches.

Linter checks

Another check that we perform is linting the code in the pull request. This runs automated checks for errors or code styling and formatting issues that don't match the code conventions we use.

Below is an example if what you will see if any of these Lint checks fail:

Failed GitHub Lint Checks

Click on Details will bring you into the test run:

Lint Action Run

And then opening the Lint Code Base line will show the full details.

Lint Action Results

In this case the README.md file must begin with a heading and not launch straight into the text. Fix the problem (e.g. by adding a header in this case), commit and push to rerun the tests and resolve the issue.

Some of the linting suggestions may come across as pedantic but they are important as they ensure a sense of consistency across our code base and reduce the chance of errors or misunderstandings.

If you have any problems understanding exactly what is wrong then reach out in the Converstion tab of the pull request and ask for help and someone will hopefully point out what's wrong. Or improve our linting config if you've genuinely discovered something which should not be flagged!

Code review feedback

After the automated tests have passed the code will be reviewed manually by another contributor. They can ask questions, or make suggestions to change the code.

Look upon a good review as an opportunity to learn and improve your code. Feedback can be tough to hear after you've worked hard on code, but stay aware that it is meant with the best possible intention to provide feedback to you for this and future commits, and to keep our codebase to a high standard. It is not a personal sleight upon you or your code, and if you are getting annoyed with feedback I suggest you take a break and step away and read it again later, in a fresh light. Of course if if you feel that a reviewer is acting inappropriately then please raise it - we have a code of conduct and want all contributors to feel welcome. Feel free to also reach out to the project leads if you would like to discuss something privately.

When a reviewer makes a code suggestion you can accept it right in GitHub and it will automatically update your branch. As this happens in GitHub directly, just make sure you do a git pull next time you are working locally on your code to pull down these changes. It's quite slow to accept a lot of changes this way, so if there are a lot of changes then sometimes better to change locally in your favourite editor, and then push one commit with the fixes, and then mark each of the suggestions as resolved. Any suggestions on lines which have changes since they were raised, will be marked as outdated which makes them easy to spot.

Reviewers may also make small, seemingly pedantic changes - usually they will include the word "nit", to indicate this is a "nitpick" in these. Like the linting, they can seem needless, but they help maintain our code quality.

It should also be noted that not all suggestions may be valid! Reviewers can make a mistake just as easily as the pull request author (more easily in some ways as they often will be reviewing based on reading the code rather than testing it). Feel free to push back on suggestions when you disagree with them. However, it is best to get consensus between reviewer and pull request author where at all possible so explain why you don’t think you should make the change being suggested rather than just ignoring the suggestion or resolving it without a comment.

After addressing feedback, please re-request a review by clicking the little two arrow icon next to the reviewer name, or make a comment (e.g. "All feedback addresses. I think this is good to merge now."). Sometimes it's difficult to know if someone is still working on feedback and a pull request may be left unintentionally, getting out of date, because reviewers are not aware that it's good for a re-review or merge.

Only the Web Almanac maintainers (currently @rviscomi and @tunetheweb) can merge pull requests, but every contributor can review pull requests, and merging becomes a lot easier (and so more likely!) if someone else has already had a look through the changes. Please, please, please help us review Pull Requests to help spread the load!

Draft pull requests

It is also possible to open a draft pull request, if you want early feedback on your code or approach. Please remember to convert to a full pull request when happy.

Additionally if making a large number of changes (for example if you need to update every chapter to reformat the way out figures work or something), then best to do this in only one or two files initially, open a pull request (whether draft or full) and get feedback on your approach before you spend time updating all the files! It's much easier to review code changes if they are not drowned out by lots of content changes that needs that code in the same pull request, and much less likely to result in merge conflicts. Then the rest of the files can be added to the pull request, or a separate one opened for those (maybe a new pull request per language if you want different translators to approve any changes).

Reviewing pull requests

As mentioned above we strongly encourage contributors to help review pull requests. This is a voluntary, collaborative effort for us all and depending on one or two people creates a bottleneck and a single point of failure for the project.

Even if a review pulls up nothing, it is important to approve the pull request - this indicates that it has been reviewed and is just as useful (if not more so) as commenting on code or making suggestions.

Do also be conscious of how feedback will be read. We have many first time contributors who may not be as familiar with language (either programming language or English language!) so do try to avoid technical terms, colloquialism...etc. Though we are aware of commonly used acronyms and terms in the Git world like LGTM (“Looks Good To Me”).

Do be mindful as well that pull request authors have voluntarily spent time on this and we wish to encourage that and foster an inclusive environment. Offensive language is forbidden by our code of conduct.

Do remember that a code review is also about reviewing the authors changes, and not about showing off your own knowledge! Try not to get side-tracked but instead raise a new issue if you want to consider something else that comes up during a code review but is not directly related.

On that note, do also remember that code can be improved incrementally. Small changes are much better in Git. So, while it’s OK to point out a fuller solution do remember that “Perfection is the enemy of good” and accepting a change that is an improvement and then improving further in future iterations can often be better than holding out for the perfect solution that may never come. Of course that doesn’t mean we should accept code that regresses the quality, or seems like the wrong way of doing it completely!

And finally, we strongly encourage positive GitHub reactions - particular for new contributors. They give quick, encouraging, feedback and add a playful, fun tone:

GitHub Heart and Rocket reactions

We discourage the negative ones though (thumbs down 👎, or confused 😕). Better to add a comment (politely!) explaining your concerns and then if others agree with you, they can "thumbs up" your comment. This keeps things on a positive tone and also means your cause for concerns are fully understood.

GitHub Desktop

GitHub Desktop is a Windows and MacOS app that provides a visual interface to GitHub. It reduces the need to use and understand Git via the command line.

This section will provide some tips on performing some common tasks via the GitHub Desktop

Installing GitHub Desktop

First make sure you have Git installed. See our section on Installing Git for more details.

You can then download the install file from GitHub Desktop, with further instructions from their Installing and configuring GitHub Desktop document. Your main tasks will be to Authenticate with GitHub and Configuring Git for GitHub Desktop so that the systems know who you are.

Cloning the HTTP Archive

If you have not done already, you will want to clone a copy of the almanac.httparchive.org into your computer. The simplest way is to follow Cloning a repository from GitHub to GitHub Desktop where you go to the repository on the website and select "Open with GitHub Desktop". This will open a window where you can click "Clone" and the job will be done.

clone web

Navigating GitHub Desktop

Once you have cloned repositories you will be able to select them via the "Current repository" toolbar button, just under the menu on the left. By default the sidebar will show you what edits have been made to the repository, and the main section shows actions you may want to perform.

current repository

Updating your repository (Pull origin)

Over time the original repository will get updated and your copy will become out of date. GitHub Desktop will highlight if your repository is out of date, with an option to pull any changes from the origin so that you have the latest versions.

pull origin

Making your own edits (creating a branch)

You want to create your own branch before you start to change the HTTP Archive repository as you do not have permission to edit the 'main' branch. A branch is a way for you to group your own edits so you can later submit (push) them for review. Then, when they are approved, they will get merged back into the main branch.

Before creating a branch, make sure you're currently on the 'main' branch and it is up to date (see above).

If you click on the "Current branch" tab in the toolbar you will see all the public branches in play. To create your own branch, enter a new name in the textbox at the top and click the "Create new branch" button. I'd recommend following the naming style used be the existing branches. In my case I used the chapter name, year, then reason. e.g. 'seo-2020-demo'

create branch

Publishing your branch

At the moment your branch is only known to you. If you want others to see it, then you need to publish it. GitHub Desktop will prompt you to do that.

publish

Once published you and others can select your branch on the GitHub website.

branch website

Editing your branch

You can edit the repository using your favourite editor. As you edit, GitHub Desktop will show you what changes you have made.

edits

Note that you can change branches at any time, but I suggest you commit and push any edits (see next) before you switch as things can get confusing. If you are working with multiple branches, always keep an eye out to make sure you're on the right one when working.

Committing and pushing your edits to the web

Every once in a while you want to store and document your changes. This can help you or others in the future. You also have to commit before you can share (push) your changes with anyone. You can quickly commit your current edits via the form to the bottom left.

commit

Once you have commits you will be prompted to push those commits to GitHub. I typically do this straight after committing.

push to web

Getting your changes accepted

At this point you have a branch with edits committed and everything pushed to GitHub. Once you are happy with your work, you want it to be reviewed, approved and merged back into the main repository.

For this I switch back to the website, as it is there you will be communicating with reviewers. To get this stage started you need to create a pull request. Go to the repository on the GitHub website, make sure your branch is selected, then click the Pull request link and follow the instructions. This will notify the reviewers who will help you to get your changes live.

pull web

Working with the Web Almanac

You will have to use the Web Almanac if you want to make changes to things like the custom metrics which are run during the monthly crawl. Using this repository is a bit more complex as you have to fork it to your own account, make a branch on your fork, make edits, then push your branch back to the original.

Forking the Web Almanac

You should only need to do this once. It creates a connected copy of the repository in your account. Go to the Web Almanac web page and click on the Fork button (top right) to start the process.

fork

Once completed you will see a copy of the repository in your own repository list.

Cloning the Web Almanac

Once you have the forked repository you can clone it to your machine. This is done in the same way as before, the difference is you clone from the repository in your account.

clone fork

Keeping the forked repository up to date

The main/master branch of your fork should be kept in sync with the original repository (rebased). Especially before you create any branches to make edits. Details on how to do this are in the Resyncing your main branch to upstream section. Note that this repository uses 'master' and not 'main'. It involves opening a cmd prompt and running commands. These are the commands I ended up using to rebase my repository:

cd C:\Users\tony\Documents\GitHub\legacy.httparchive.org

#git remote add upstream /url/to/original/repo
git fetch upstream
git checkout master
git reset --hard upstream/master  
git push origin master --force 

Making your own edits

This is done in the exact same way as before. Create a branch from your master (make sure master is up to date using the above process), publish the branch, edit the files in the branch, commit your edits, push back to GitHub.

With a forked repository the process to get your edits accepted is about the same as before. Go to the web page for your copy of the repository and create a pull request.

Glossary of terms

This is a list of terms to those less familiar with Git/GitHub:

  • branch - a copy of the code within a repo, where you may be working on code that is not ready to commit back to the main branch. Note Git actually only stores differences so it’s not really a copy, but that’s just an efficiency in Git and you can consider it a copy to all intents and purposes.
  • fetch - a git command which downloads all changes from a remote repo (i.e. a server) to a local one
  • fork - a complete copy of a repo and all it's branches.
  • LGTM - Short hand for Looks Good To Me - typically used approving a pull request
  • local - a complete copy of a repo on your PC
  • main - the primary branch of the Web Almanac repo. Other repos use master for this.
  • master - an alternative name for main branch
  • merge - to copy changes from one branch into another branch
  • merge request - what a pull request is also known by particularly in GitLab an alternative to GitHub
  • origin - the server version if the repo (the opposite of local)
  • pull - to fetch changes from a remote repo, and then merge them into this branch in one step
  • pull request - a way to merge changes back to the main branch. A pull request is a special issue type that allows the potential merge to be reviewed and commented on before it is merged.
  • rebase - to bring a branch up to date, as if it had been created from now, while maintaining the existing changes on top.
  • repo/repository - a git project which is basically a collection of files - which may exist in several branches.
  • upstream - the original repo that a fork was created from
Clone this wiki locally