Nginx: Automate Whitelists

John H Patton
Level Up Coding
Published in
4 min readFeb 16, 2022

--

Photo by Rajeshwar Bachu on Unsplash

Adding rate limits or shun rules to the nginx configuration is a good way to deal with abusive or malicious bot traffic. However, SEO value will drop when a good bot is caught in the rules and can be a pain to deal with until they are whitelisted.

Vendor CIDR Blocks

Many reputable vendors will publish and periodically update lists of IP address blocks, or CIDR blocks, that can be used to update whitelist rules. Google, for example, publishes such a list here:
https://developers.google.com/search/apis/ipranges/googlebot.json

And a complete guide to whitelisting Google can be found here:
https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot

For this tutorial, the googlebot list will be used to create the process.

Reload Method — OpenSource Nginx

If using Nginx Plus, see the KeyVal Method below this section.

If the opensource version of nginx is in use, create a script to generate an nginx configuration file containing a map variable. The Nginx Map Comparisons post has information on what an nginx map is.

The curl utility will be used to retrieve the lists for parsing. If the published list is in json format like the Googlebot lists, the jq utility can be used to parse the json into a usable format.

Script:

Script Variables

There are two variables that can be updated to match the needs of the target environment.

GOOGLE_WHITELIST_CONF

Set this to an absolute path where the configuration file should be written. If a default nginx configuration is in use, *.conf files under /etc/nginx/conf.d are automatically included by default. It’s a good idea to put the file in a folder that is included with an include conf.d/*.conf; directive.

RELOAD_CMD

Update this value to match the command necessary to reload the nginx configuration. This will ensure any updates to the map are activated.

Dependencies

There are two binary dependencies required to operate the script: curl and jq. Make sure these are installed on the system running this script.

Parsing

The json format of the Googlebot list allows for simple parsing to get at the IP address blocks surrounded by double-quotes. Extracting all .prefixes[] piped to .[] will output the values needed. Wrapping this in a while read loop allows for formatting the values.

Scheduling the Script

The script can be added to a crontab on each webtier host that needs an updated whitelist or it can be run from a CI/CD system on each webtier host to update the environment. The whitelists should be updated regularly, once a day may work fine or possibly several times a day to avoid a gap in access depending on the whitelisted source changes.

Configuration File Explanation

The nginx map created by the script uses $remote_addr for the source variable and a “geo” map configuration that does an IP address comparison:

geo $remote_addr $is_google { … }

The geo map uses $remote_addr by default and is not needed in the definition. It is set in the examples for completeness.

A simple way to tell what $remote_addr is in the environment is to log it and check the client IP address value from requests with a Googlebot user agent. If the remote_addr is a private IP or set to a firewall, load balancer, or CDN IP address, a different variable that holds the correct IP will be required. Setting a proxy configuration in the geo map or configuring a realip implementation for the environment may be required to get at this value reliably.

The ngx_http_geo_module or ngx_http_realip_module has information on how to get at the correct IP address if $remote_addr contains an unexpected IP address.

The geo map output format should look like the following:

geo $remote_addr $is_google {

"66.249.72.224/27" 1;

default 0;
}

Usage

The $is_google geo map variable can be used in whitelists. A good way to create a whitelist is by building a cascade of maps to form a $whitelist map variable. Here is an example of how to build this kind of whitelist:

The resulting $whitelist variable can be used in an exclusion rule or as a flag in other variables or directives to ensure actions are performed on only non-whitelisted requests or vice-versa.

KeyVal Method — Nginx Plus Version

If Nginx Plus is in use, instead of updating a file and reloading the configuration, Nginx Plus can use a keyval to handle live updates without needing to update configurations and reload all nginx webtier instances.

A zone_sync stream configuration is recommended in the Nginx Plus configuration, otherwise the script needs to run on each instance to set the keyval data for all nginx instances.

KeyVal and Stream

The keyval configuration that holds the Googlebot IP addresses in this example uses a stream configuration to enable keyval data sync. This example uses DNS resolution to inform nginx of all nginx instance IP addresses in the cluster, but this can be adjusted to have individual nginx server instances if necessary. See the documentation for zone_sync_server for guidance.

The keyval_zone is configured with a TTL of 10 years, a state file for restoration on full restart, and is set to a type of ip to do an IP address comparison to a CIDR block. If there’s a match, $is_google will be set to “1”.

The whitelist.conf file from the reload method will work with this method:

Script

The script that updates the keyval memory zone is below. If the nginx cluster is not using a stream configuration for data synchronization, the script will need to run against all individual nginx instances. It’s recommended to use a zone_sync configuration to simplify operating nginx with shared data. The script is written for bash 3.x or higher, and it requires the following variables to be set or updated in the script:

NGINX_CLUSTER_API_SCHEME - set to http or https (default)
NGINX_CLUSTER_API_SERVER - set to an nginx ip or internal DNS
NGINX_CLUSTER_API_PORT - set to port number for the enabled API

The script can be cron’d or scheduled in a build job, as needed.

Good luck!

--

--