Automated requests and Cloudflare's TLS fingerprinting
My (initially unsuccessful) attempts to fetch a RSS feed via curl
When making automated requests to a RSS feed, I recently came across very peculiar behaviour. The RSS feed (lawgazette.com.sg/feed/) would load just fine in my browser (Firefox) but not when I used
The unminified HTML returned above is Cloudflare's challenge page.
Pretty simple, I thought, Cloudflare is probably running all sorts of 'real browser' checks, e.g. calling HTML canvas APIs, checking what fonts are installed, looking at the user agent, etc. But then I noticed that I had the JShelter browser extension enabled, which thwarts most fingerprinting attempts. Odd.
If I had been ssh'ed into a VPS, I would have chalked it up to IP address blacklisting or greylisting, but I was calling
curl from my own residential IP address.
Since the page I was loading was a RSS feed, I popped it into a desktop RSS reader, Fluent Reader, just to check, fully expecting that the HTTP request sent by the RSS reader would trigger the Cloudflare challenge. But the feed loaded just fine. Very odd.
At this point, I recalled that Fluent Reader was an Fluent Reader so it was possible that the request was being sent via headless Chrome, which might pass the 'real browser' checks. It seemed unlikely the RSS feed was being fetched in the frontend headless Chrome layer rather than the Node.JS backend, not least because the former would give rise to CORS issues, but I decided to try fetching the RSS feed in a non-Electron app, QuiteRSS, just to be sure.
As far as I could tell, the HTTP requests being sent by QuiteRSS and my own HTTP requests via
curl were identical. Save for minor differences in the TLS version. Surely not? I added the
--tlsv1.3 flag to force
curl to use TLS 1.3. Nope, still didn't work.
But that got me thinking. Maybe there were minute differences in the way TLS was implemented in
curl and in web browsers like Chrome and Firefox that Cloudflare was taking advantage of to identify
curl requests and show challenge pages in response.
Fortunately for me, I am far from the first person to be plagued by this issue, and others far more talented than I had already implemented a solution:
lwthiker/curl-impersonate, which is a derivative of
curl that performs HTTP and TLS handshakes identical to that of Chrome and Firefox. I installed the dependencies, downloaded the binary, and made my HTTP request:
sudo apt install -y libnss3 nss-plugin-pem ca-certificates curl-impersonate-ff -H 'Accept: application/atom+xml,application/rss+xml;q=0.9,application/xml;q=0.8,text/xml;q=0.7,*/*;q=0.6' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0' https://lawgazette.com.sg/category/notices/disciplinary-tribunal-reports/feed/
It worked beautifully.
It looks like most major RSS readers (and other 'bots') don't have to use this workaround because they are whitelisted by Cloudflare.1 Aside from the usual suspects (Googlebot, Bingbot, etc.) there are also entries from other tech solution providers such as OpenAI, Better Uptime, Slack, and Telegram, as well as RSS readers such as Feedly, Feeder, and Feedbin. Under this system, 'friendly bots' to be submitted to Cloudflare to be whitelisted after verification.
This is fine, but should there be a need to get Cloudflare's permission before fetching a RSS feed? Since the verification methods that Cloudflare supports for validating traffic are reverse DNS and IP range whitelisting only, the system doesn't seem to allow for whitelisting of client-side solutions that run on the user's machine such that the IP address cannot be pre-determined.
Separately, what remains unclear to me is why a challenge page was shown for this particular RSS feed. Many sites are behind Cloudflare these days (including this one), and Cloudflare doesn't show a challenge page for vanilla
curl requests to my feed, so why that one? My suspicion is that it has something to do with the Cloudflare security configuration for that site:
Maybe it has 'Bot Fight Mode' or Cloudflare's general 'Under Attack' mode turned on? The showing of the challenge page is clearly not intentional — it's a RSS feed; it's meant to be fetched in an automated fashion, not read in a browser directly. It's not difficult to get around this at the moment, but there shouldn't be a need for developers to play this sort of cat-and-mouse game for non-malicious applications.2
Considering how widespread usage of Cloudflare (and other similar solutions such as Akamai) is, there does appear to be a risk to RSS as a technology as well as to the open web more broadly if this sort of blocking, solely based on the nature of the request, becomes the default option or the recommended option on the basis of 'security'.
- This list is not exhaustive.↩
- This unintended blocking of RSS feeds as a result of heavy-handed anti-DDoS and anti-bot measures has also been highlighted by others. See e.g. Kevin Cox, "Problems with Cloudflare Bot Blocking" (2021), this, etc.↩