Turn any page into a JSON API with Cloudflare Workers

Cloudflare workers are Cloudflare's answer to edge/serverless functions. Code that lives "on the edge". You can read more about them here:

What I understand them is allowing me to write some JavaScript on someone else's server. The cool thing about Cloudflare Workers is they can intercept a request, tweak it, and return it (like a self-owned man-in-the-middle attack). One example of this is adding an extra cookie to the request.

However, this blog isn't covering that - instead we are going to utilise HTMLRewriter - a class which allows you to modify/search/replace the HTML response.

The original purpose of the HTMLRewriter is to modify the response - you could use it to add some dynamic content to a static page, for example. The thing if offers, though, is a powerful API for analysing and extracting data from the page - we can leverage that to extract key, consistent elements and assign them keys in a JSON array/object.

For the examples in this post, we'll turn a blog post from this website into a "simple" API endpoint.

Note: for this blog, we will be using the Quick edit in the worker interface - but you can develop locally.

Setting up the worker

Log into Cloudflare and create a new Worker service - call it what you wish and select HTTP handler. Once ready, click the Quick edit button in the top right.

To get started, copy and paste the below code:

// Set the page to get
const url = 'https://www.mikestreety.co.uk/blog/how-to-delete-a-git-branch/'

// Set up a fetch handler
addEventListener('fetch', event => {
	event.respondWith(handleRequest(event.request));
});

// Handle the request
async function handleRequest(request) {
	// Load the URL
	let response = await fetch(url);
	// Get the contents
	let html = await response.text();

	// Return the contents
	return new Response(html, {
		headers: {
		 'content-type': 'text/html;charset=UTF-8',
		},
	});
}

You can click the Send button in the debugger and view the output or Save and deploy and visit the page in the browser. This returns the html of the page directly - without doing anything to it.

Modify the HTML

If you wanted to change the response, you can introduce the HTMLRewriter. As a first example, this updates the h1 on the page to be the current date

async function handleRequest(request) {
	let response = await fetch(url);

	// Start our HTML Rewriter
	return new HTMLRewriter()
		// Select the h1
		.on('h1', {
			// Modify the element
			element(element) {
				element.setInnerContent(new Date());
			}
		})
		// Pass in the response
		.transform(response)
}

One thing to note is the transform() function goes last.

The on() method takes a jQuery-like DOM selector, this can be classes or IDs or HTML elements (or a mix) - you can be as specific as you like.

Initial JSON array

Now we know how to initialise the HTMLRewriter, we can start to see how we could utilise it. The on() method also has a text method, for returning the text of the element.

As a proof of concept, let's grab the contents of the h1 and the gitlab link at the bottom of the post.

async function handleRequest(request) {
	let response = await fetch(url);

	// Set up a placeholder array
	let output = {};

	// Start our HTML Rewriter
	await new HTMLRewriter()
		.on('h1', {
			text(text) {
				// It loads the data a couple of times, so we need to append
				output['h1'] = (output['h1'] === undefined ? '' : output['h1']) + text.text;
			}
		})
		.on('.meta.source a', {
			element(element) {
				// It loads the data a couple of times, so we need to append
				output['git_source'] = (output['git_source'] === undefined ? '' : output['git_source']) + element.getAttribute('href');
			}
		})
		.transform(response)
		// Convert it to an array
		.arrayBuffer();

	// Convert the output to JSON
	const json = JSON.stringify(output, null, 2)

	// Return the JSON
	return new Response(json, {
		headers: {
			'content-type': 'application/json;charset=UTF-8'
		}
	});
}

This is a bit more complicated and there are a couple of gotchas:

  • The text is actually in text.text
  • I can't work out why, but it seems ti loops through a couple of times, so you have to append to the item if it exists
  • The response needs converting to the an array for it to work

However, when you click the Send button, you should see a 2 property JSON object, so you can start to see how our API could develop.

Additional properties

From here you can find and add any element on the page. You might find some creative selectors might be needed but if you are scraping your HTML, you can add IDs and classes to suit.

Here are some more examples of the types of elements & how you might select them:

Image src

.on('.photo img', {
	element(element) {
		// Data
		output['img'] = element.getAttribute('src')
	}
})

Link & text

.on('p a', {
	text(text) {
		output['link_text'] = text.text
	},
	element(element) {
		output['link_url'] = element.getAttribute('href');
	}
})

Date from a string

.on('.time', {
	text(text) {
		if(text.text) {
			let date = (new Date(Date.parse(text.text))).toISOString().split('T')[0]
			output['date'] = date
 		}

	}
})

Conclusion

Hopefully this should give you a good start on using the HTMLRewriter to convert a HTML document into a JSON endpoint.

View this post on Github

You might also enjoy…

Mike Street

Written by Mike Street

Mike is a CTO and Lead Developer from Brighton, UK. He spends his time writing, cycling and coding. You can find Mike on Mastodon.