Negative Cache with Nginx

Published in

Level Up Coding

6 min readFeb 4, 2021

Without Using a Separate Subsystem

REST API calls that are used to check the state of something before allowing an action to complete can be heavily abused by non-human bot clients. Typically, with a REST API, the response status will inform the calling service of whether or not the action can complete or was successfully completed. A positive response code in the 2xx range is normally associated with letting the action continue. A negative response code, typically in the 4xx range but can be in the 5xx range, indicates blocking the action or the action failed. A system can become degraded or experience an outage under heavy load from requests on an endpoint that indicates an action cannot complete or was not successful, and that state is unrelated to the transaction but some other aspect of the system.

Example
A good example of this kind of API call is a stock availability check to determine if an add product to cart action can complete. If an E-Commerce site has a low-stock, high-demand product, an increased volume of bot traffic hitting the add product to cart API would result in a heavy load on the commerce system’s cart service and the inventory system. This can happen when a high demand product is not yet available or is sold out, typical conditions for such a product. Products that enter into a sold out state after a frenzied sale will continue to throw errors at a high rate on certain endpoints, even though they do not have any product left to sell. Operations teams can feel helpless to do anything about this kind of high traffic impact knowing there’s nothing that can be done until the product sells out as they watch their systems suffer.

The state of this product is not going to change until there is available stock, so it might be a good idea to cache the negative response on these calls or the request resource that trigger this call with a short Time-To-Live (TTL) to alleviate unnecessary load on the backend systems. Putting a cache in place on the negative responses can allow the overburdened upstream to recover. This is called, unsurprisingly, negative cache.

NOTE: This article will focus on using nginx-plus to handle a negative cache, but open source nginx or openresty should still be able to work in the same manner. Additionally, if there’s a CDN in front of nginx, like Akamai or Fastly, this logic should be moved to that layer if it makes sense for the implementation and is possible to do. In some cases, it may make sense to have a negative cache layer on edge and origin, just remember the TTLs will potentially stack.

Unexpected Errors

Scenario: A new product is launched by marketing and wakes up the reseller bots when the product page is activated even though the product has 0 stock until it launches. Heavy traffic is causing significant errors on the site which triggers an incident response. Support technicians rally to fix the issue.

An inventory system will usually provide an availability check URI which is called when a request is made to add a product to the user’s bag. When a reseller bot is active on a site, usually it’s going to walk through a series of requests:

create a session
create a cart by adding any product to bag
add desired product to bag repeatedly until the add to bag request succeeds
remove cart create product from bag (from step 2)
add shipping/billing info
add payment
submit order

Many bot agents will be activated across various cloud providers in an effort to scoop up as much product as possible as soon as a desired product comes online. Add to cart activity will increase putting load on the system even though there is no product available yet. The a add-to-cart availability check is a good place to put a negative cache to reduce the overhead.

NOTE: Adding a short to time-sensitive endpoints can provide a huge benefit, but it must be monitored and tuned to meet the needs of the system.

PII / PCI-DSS and Cacheability

Before enabling any cache, the response being cached should be evaluated for PII or PCI-DSS data. If the response payload has any headers or body data that might allow a client to receive information that causes data from another client to be exposed, that payload must be modified before it is cached to exclude those headers or body data.

Ideally, cacheable resources should not have any response data tied to the client’s session or any data that can interfere with caching the response.

NOTE: if there’s a Set-Cookie header or Vary header in the response, it will not be cached by default. There may be other considerations to caching a response for the system, but these are the most important. See the proxy_hide_header and proxy_ignore_headers directives for details on caching these types of responses.

The Negative Cache

For this example, product availability requests will be given a short negative cache for out of stock responses.

NOTE: depending on the system architecture, the availability check may be part of the add product to bag function and there may be more involved. The add product to bag resource may also be a good place to put this cache, depending on the architecture.

Availability URI

Let’s define a basic availability request and response that has no stock:

Request:
GET /api/product/check_availability/{product_number}

Response:
{“quantity_available”, NUMBER }

The response status codes are:
* 200 if the product is available, with 10 or more
* 201 if the product is low on stock, less than 10
* 400 if the product is unavailable
* 500 if the product number is invalid.

Cache Key

In order to cache the availability call for an item, we need to create a cache key that makes sense for the item. In this case, we’ll separate the product number out of the URI into a variable using an nginx map:

Since nginx maps are only evaluated when they’re used, this is an effective way to capture the product number for a cache key. It’s only evaluated where we need it: in the location block handling availability calls.

Cache Configuration

To configure the cache for this call, a proxy_cache_path directive needs to be configured in the http{} block. There are some proxy_cache_* settings that should be configured for this, which will be setting in the location block. Here’s the proxy module configuration for the negative cache:

This configuration will serve from cache if the inactive TTL of 5 minutes has not expired and the backend is having issues and the response status code is a 500, 502, 503, or 504.
Nginx will serve from cache on error, timeout, or when cache is being updated for a request.
Duplicate requests for the same cache key will wait 5s if cache is updating before proxying to the backend server for the response.

The directives are documented in the nginx proxy module documentation here:
http://nginx.org/en/docs/http/ngx_http_proxy_module.html

In the product availability location{} block, the following directives will complete the configuration of the different caches:

This will cache responses with a status of 400 for 1 minute in active cache, with an inactive cache TTL of 5 minutes.
If the request method is POST, the proxy_cache_methods directive will need to be uncommented.

Conclusion

Remember to remove any headers from the cached response that may cause problems, like Set-Cookie, Vary, or any headers that might expose PII / PCI-DSS data.

Place the negative cache on request URIs that might be targeted in a “failed” state at a high rate to reduce backend impact.

Thanks for reading.