Use Github to detect, track, & notify about website changes

Note: the example code for this post can be found here. See the repo’s README for instructions on how to install & run script.

Are there pages on your websites that you need to monitor what has changed? Do you need an easy way to see the differences before and after changes? Or do you need to keep a history of website changes so that you can go back later and see when a specific change occurred?

Although there are many good website monitoring products available, oftentimes you can use Github’s actions and notifications to perform website change monitoring tasks and customize the results for your specific needs.

The problem: tracking website changes

Recently, I was working with SEO people at my company. They said that they regularly downloaded sitemaps and robots.txt files and then use a text editor to “difference” compare previous and current values of files. This helps them confirm expected changes and fix unexpected changes on the websites.

They said that they try to run these “diff” checks every week, but because the process is manual, repetitive, and tedious, often these diff checks would only get run every few weeks or even months.

Brainstorming

This got me thinking about the heart of any version control system (VCS). Its main job is to save and track differences between file revisions. It also shows you what is new, what is deleted, and what has changed. The SEO people were retrieving and diff-ing files manually. Could git be used to simplify what they were doing, plus do a better job of tracking the changes?

So, what if the SEO sitemap and robots.txt files were saved as files in an “SEO” git repository? And this repository’s only purpose would be to track file differences and maybe contain scripts that update those files? Then, the scripts could run regularly to retrieve those files off the websites and save them to the repository (overwriting any current file versions, if necessary).

After files are retrieved, if there are any differences between the new file versions and previous versions saved in git, a “git status” command would show that there are changes to be committed to the repository. If there are no differences between the new files and the saved files, then a “git status" command would show a “nothing to commit” message.

Okay – so far, so good. Then, the script would just need to commit any new changes back to the repository. Additionally, the repository could be configured to email recipients any time changes are pushed to the repository’s origin. This is nice because the scripts would run regularly, but people would only receive new emails when new changes are detected.

So this script would eliminate the repetition and tedium of downloading files and checking for differences. The SEO team would only need to take action when they received a new email message. Then they could go straight to the Github link, view the commit details, and easily see all file additions, deletions, and modifications thanks to Github’s UI pages showing changes (in red & green).

A Github commit details page show additions & deletions to a file

The last necessary piece would be to find a way to regularly run scripts at an interval that would be useful for the team. Perhaps a scheduler could be used to run the scripts on a system or maybe a continuous integration (CI) system (like Jenkins or CircleCI) could run the scripts?

Implementation

Website data retrieval script

Data retrieval scripts can be customized to perform any task (using any language you prefer). Basically, what data do you want to retrieve from your websites and in what storage format would be useful to you? (Bonus: this is what can make your custom tool more powerful for your specific needs than any off-the-shelf tool you purchase!)

Examples:

  • For files like sitemap XMLs, robots.txt, or bot IP lists, you can have a script that performs HTTP GET calls to retrieve files and save them to a local directory.
  • For static web page data, you can make HTTP GET calls to pages, retrieve the HTML data, and then use parsing libraries (like jQuery, Cheerio, Beautiful Soup, etc) to find page elements. You can then save parsed element data to a file in your preferred data structure format (e.g. JSON, XML, text, etc).
  • For dynamic web pages, you could use any web UI automation tool to open a browser, extract page element data, and save it to a file in any format.

We use this script flexibility for not only capturing files like sitemaps and robots.txt files, but also for tracking page metadata, specific tags (H1, H2, div) & content, etc.

Committing to Git from a script

After the script runs and file updates have been made, how can a script add the files, commit the changes, and push them back to the Git repository’s origin? I thought about shelling out to the command line and making git calls directly, but this sounded like a hack and capturing & validating results would likely be pretty error-prone.

Since my script was a Node.js script, I searched npmjs.com to see if there was a good package that allowed me to interact with git directly. I found this package:

simple-git – “A lightweight interface for running git commands in any node.js application.” Voila!

Given Github’s security, however, you do have to define a personal access token in the repository for defining the remote origin URL. (You can no longer just user your Github username & password):

const remote = `https://${username}:${personal-access-token}@${repo-url}`;

After that, pushing changes is as simple as defining your Github user information, configuring remote origin, then making calls to add, commit, and push your changes:

// git add, commit, and push changes
import simpleGit from 'simple-git';
const git = simpleGit();

const branch = 'main';
const commitMessage = await git
  .addConfig('user.name', username)
  .addConfig('user.email', useremail)
  .removeRemote('origin')
  .addRemote('origin', remote)
  .add('./*')
  .commit(message)
  .push(['-u', 'origin', branch]);

The interesting thing about this now is that this repository contains:

  • files that mirror website files (or your custom data files)
  • scripts that both capture file data and directly update its own repository.

So this isn’t a self-modifying script per se, but having code that updates its own repository is definitely a different use case from most code I’ve written.

Script run-time environment & Scheduling

As mentioned earlier, scripts need to run somewhere on a regular interval. The test environment needs to have environment variables setup that the scripts can use (you don’t want to hardcode these variable values!). Because we need permissions to push commits back to the repository, these environment variables include:

  • Github user name
  • Github user email
  • Github personal access token (that you setup directly in the repository)
  • The repository’s URL

Here’s where Github helps us out of several different fronts:

  1. We could setup our own test environment or use a continuous integration (CI) system to clone the repo and run these scripts. That works, but it would be another system to integrate. Github has CI functionality already built in with Github Actions. As stated in their home page:

    “GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub.”

    Using Github actions, we can setup an Ubuntu environment, setup Node, clone the repository, install the repository’s packages, and run the scripts.
  2. We want to be able to run scripts on a custom, regular schedule. Github actions let you run workflows triggered by different events. One trigger type is “schedule” which uses a “cron” time/date format to specify exact days, hours, minutes when the workflows will run.

Here is an example of a Github Action configuration file that runs an “update-files” job every Monday morning at 14:00 UTC:

# This workflow will run the 'npm update-files' script every Monday at 14:00 UTC.
# Be sure to define these environment variables in your repository's 
# Settings / Secrets / Actions section:
# - GH_USEREMAIL - your email account used for Github
# - GH_PERSONAL_ACCESS_TOKEN - your Github personal access token (see repo README)

name: Node.js CI

on:
  schedule:
    # cron times are UTC times. For Pacific timezone:
    # Mar-Nov: UTC = PDT time + 7
    # Nov-Mar: UTC = PST time + 8
    - cron: '0 14 * * 1'

env:
  GH_USERNAME: ${{github.repository_owner}}
  GH_USEREMAIL: ${{secrets.GH_USEREMAIL}}
  GH_PERSONAL_ACCESS_TOKEN: ${{secrets.GH_PERSONAL_ACCESS_TOKEN}}
  DETECT_FILE_CHANGES_REPO_URL: github.com/${{github.repository}}
  
jobs:
  build:
    runs-on: ubuntu-latest

    strategy:
      matrix:
        node-version: [18.x]
        # See supported Node.js release schedule at https://nodejs.org/en/about/releases/

    steps:
    - uses: actions/checkout@v3
    - name: Use Node.js ${{ matrix.node-version }}
      uses: actions/setup-node@v3
      with:
        node-version: ${{ matrix.node-version }}
        cache: 'npm'
    - run: npm ci
    - run: npm run update-files

3. Notice also the environment variables defined near the top of the script. Two variables are constructed from Github default environment variables while two other variables use encrypted secret values that you define in your repository. In this way, you can specify variables in your scripts and keep the values secure.

Notification of changes

Lastly, you can configure the repository so that every time a code change is pushed to the repository, an email gets sent. Email addresses can be added in the repository’s “Settings / Email notifications” section. The body of the sent emails will link to the commit details page as well as show the script’s output. The URLs in the email will be hyperlinked to take you straight to the Github commit page. This Github commit page will then show all file additions, deletions, and modifications:

  Branch: refs/heads/main
  Home:  https://github.com/jantonypdx/detect-file-changes
  Commit: 6f36d349aef07f1f045131776a5c1c51de29c609
      https://github.com/jantonypdx/detect-file-changes/commit/6f36d349aef07f1f045131776a5c1c51de29c609
  Author: jantonypdx <email-address>
  Date:  2022-07-04 (Mon, 04 Jul 2022)

  Changed paths:
    M files/robots.txt

  Log Message:
  -----------
  1 modified 'robots.txt' file found.

Summary

Using only a Github repository of code & scripts, you can:

  1. Write scripts to extract any data you can access on your websites.
  2. Regularly run scripts at intervals that you define.
  3. Capture data from your websites and save it in formats that are useful to you.
  4. Push files back to the repository only when the files change.
  5. Notify users when the files change (i.e. when there are changes on the website).
  6. Provide users with a clear & simple way of seeing exactly what changed (additions, deletions, and modifications).

For a detailed example of a repository & script that checks a website’s robots.txt file for changes, please see the example code in this Github repository.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments