Back to home

PrerequisitesGetting startedExploring the docs websitePlanning the JSON data structureSetting up the scraperScraping the dataSaving the data to a fileFinal codeSummaryResources and further reading
Scraping The Swyftx Apiary Docs Into A JSON File With Puppeteer main image

Scraping The Swyftx Apiary Docs Into A JSON File With Puppeteer

The previous post "Authenticating Use For The Swyftx API With Node.js" went through how we can create a simple workflow for our Swyftx API by following the Swyftx docs.

In this post, we're going to hack a way to create a record of the API calls we can make to the Swyftx API along with the expected params, request args and response body for each possible request.

This is in no way a complete list of all the API calls we can make to the Swyftx API, but it's a good starting point to get a feel for what we can do with the API and make adjustments as we go along from a more centralized, offline data source.

In general, I would be looking for a source of truth behind the API. Think OpenAPI documentation or something. As far as I can tell, there is nothing like that quite available, so I will do my best to generate that spec myself.

In order to do so, we need to do the follow "bigger" steps:

  1. Scrape the data from the Swyftx documentation website.
  2. Convert the data into a JSON data structure to use later.
  3. Generate an OpenAPI spec from the JSON data structure.
  4. Write a way to generate our TypeScript wrapper API based on on the OpenAPI spec.

In this post, we will be covering steps one and two.

Source code for the project can be found on my GitHub

Prerequisites

  1. Basic familiarity with npm
  2. Basic familiarity with Node.js
  3. Familiarity Swyftx docs
  4. Check out the previous post in the series "Authenticating Use For The Swyftx API With Node.js"
  5. This post expects you to be familiar with Puppeteer. I will not be covering the syntax and code I am writing here. You can see other posts I have written using Puppeteer here.

Getting started

Something worth noting before we go too far: the work we are doing today is essentially a "one-and-done" script for a specific use case. I am opting to work with JavaScript for the sake of swiftness as well as writing code that, to be brutally honest, you wouldn't want to run in production and would want to tidy up if you were working alongside other devs. The aim of the game today is to scrape data and get it into an offline document to import later.

To get started, let's create a folder swytfx-apiary-to-api and setup our Node.js project.

$ mkdir swytfx-apiary-to-api $ cd swytfx-apiary-to-api # Create a file to work from $ touch index.js # initialise npm project with basics $ npm init -y $ npm install puppeteer

We will be using Puppeteer to scrape the Swyftx API docs - note that the linked website has some caveats that I will speak to in the next section.

Exploring the docs website

If we head to the Swyftx docs and search into the HTML, you will realize that the website actually is an iFrame. This will be a nightmare to bother scraping, so if we follow the source of that iFrame to the source, we will find the source HTML that we can feed into the scraper.

In general, what I like to do is find the common class names and attributes that are used across the website and use those to create a JSON data structure that we can use to generate our API.

By opening up the developer tools and search around the HTML, I found the following:

  1. .actionInvitation is the class name applied to all of the action buttons on the page to display the request example on the right-hand side.
  2. Every time you click on .actionInvitation, there is a new .row.machineColumnContent div element appended that contains the request information (on the right-hand side). It looks as if the last-of-type for that element is what displays, while the others have the class .hidden appended.
  3. Within this, there are three elements that contain the information that I need for requests, parameters and response body under the class names .request, .parameterRow and .machineColumnResponse respectively.

If the above doesn't make too much sense, then hopefully it will when we write the code. The tl;dr is that I am finding the HTML elements that contain important data that I might be able to scrape to build out my desired JSON file.

Planning the JSON data structure

After identifying the HTML elements that I need to scrape, I will start by creating a JSON data structure that I can use to generate the API.

It is up to us with how we want to final JSON file to look like, but I will opt for this data structure:

{ "endpoints": [ { // url path (not inclusive of the base url) "url": "/auth/refresh/", "method": "POST", "request": { "requestExampleValue": { // example request body } }, // array of params as objects "parameters": [ { "paramKeyValue": "assetId", "paramRequirementValue": "Required", "paramDescriptionValue": "Asset ID. See Get Market Assets id for all asset ids" } // ... more params ], "responses": [ { "responseStatusValue": "200", "responseExampleValue": { "accessToken": "eyJhbGciOiJSUzI1N...", "scope": "app.account.read ..." } }, { "responseStatusValue": "500", "responseExampleValue": { "error": { "error": "StillLoading", "message": "Please try again or contact support." } } } // ... more responses ] } // ... more endpoints ] }

The above describes one entry into the endpoints array for my JSON object, but it details how each entry may possible look.

Each entry will cover the following:

  1. url - the URL path for the API endpoint.
  2. request - the request object that contains the request example.
  3. parameters - the array of parameters that are required for the request.
  4. responses - the array of responses that are possible for the request.

Now that we have the aim of what we want to build, we can move onto the scraper.

Setting up the scraper

I've said it once and I will say it again: this code will be hack-y. I'm not going to bother making it pretty as it is a means to an ends.

To setup our code to be working with Puppeteer, add the following to index.js:

const puppeteer = require("puppeteer"); const url = `https://jsapi.apiary.io/apis/swyftx.html`; /** * @see https://stackoverflow.com/questions/52497252/puppeteer-wait-until-page-is-completely-loaded/52501934 */ const waitTillHTMLRendered = async (page, timeout = 30000) => { const checkDurationMsecs = 1000; const maxChecks = timeout / checkDurationMsecs; let lastHTMLSize = 0; let checkCounts = 1; let countStableSizeIterations = 0; const minStableSizeIterations = 3; while (checkCounts++ <= maxChecks) { let html = await page.content(); let currentHTMLSize = html.length; let bodyHTMLSize = await page.evaluate( () => document.body.innerHTML.length ); console.log( "last: ", lastHTMLSize, " <> curr: ", currentHTMLSize, " body html size: ", bodyHTMLSize ); if (lastHTMLSize != 0 && currentHTMLSize == lastHTMLSize) countStableSizeIterations++; else countStableSizeIterations = 0; //reset the counter if (countStableSizeIterations >= minStableSizeIterations) { console.log("Page rendered fully..."); break; } lastHTMLSize = currentHTMLSize; await page.waitFor(checkDurationMsecs); } }; const scrapedData = { endpoints: [], }; const main = async () => { let browser; try { browser = await puppeteer.launch({ headless: true, args: [`--window-size=1920,1080`], defaultViewport: { width: 1920, height: 1080, }, }); const page = await browser.newPage(); await page.goto(url, { waitUntil: "load" }); await waitTillHTMLRendered(page); const elHandleArray = await page.$$(".actionInvitation"); } catch (err) { console.error(err); } finally { await browser.close(); } }; main();

The code itself does the following:

  1. Import the puppeteer library.
  2. Set the base url as https://jsapi.apiary.io/apis/swyftx.html.
  3. Adds a helper function waitTillHTMLRendered. This may not be required but I have included it as my first attempt waiting for an idle network exited prior to the site loading. We could alternatively use a delay or something similar.
  4. A main function that will be called at the end of the script.

The main function currently sets up a new browser and page, navigates to the base url and waits for the page to load. Afterwards, we create an array of clickable elements identified by the .actionInvitation class which we earlier determined to be important to update the right-hand side.

The configuration for puppeteer is setup such that the window is large enough to emulate the desktop functionality of the web page.

We also have an variable scrapedData that will hold our data that we will write to file.

Scraping the data

After we arrive at a point where we have all .actionInvitation elements stored in the variable elHandleArray, we can iterate through each button.

What we want to do on each iteration is the following:

  1. Click the button.
  2. Take a screenshot (this was more for me to debug and see the state of each iteration - totally optional but I will commit all the images if you want a reference point on the commit).
  3. Grab the URL (without the base that we use) and store is to our data struct under url.
  4. Grab the request example and store it to our data struct under request if they exist.
  5. Grab the parameters and store them to our data struct under parameters if they exist.
  6. Grab the responses and store them to our data struct under responses if they exist.
  7. Push all these interim data structs to our scrapedData object.
  8. Manually iterate the index. You may opt to use another for of iteration, but I normally just opt for whatever works first without strange race conditions with the async-await syntax. Performance is not an issue here.

Step (2) was used by myself to make sure that the right-hand was updating as expected. I will add a the first, second and last iteration image to show you what was being automated in terms of Puppeteer clicking our buttons and the right-hand side updating:

First iteration

First iteration

Second iteration

Second iteration

Final iteration

Final iteration

The code described above looks like the following:

let index = 0; for (const el of elHandleArray) { const tempEndpointObj = {}; console.log("@ STARTING INDEX:", index); await el.click(); await page.screenshot({ path: `imgs/debugging-${index}.png` }); // Note: this is a hack to avoid issue where new nodes are constantly added to the DOM // while older ones only have the `.hidden` class added. const machineColumnContent = await page.waitForSelector( ".row.machineColumnContent:last-of-type" ); const urlEl = await machineColumnContent.$(".uriTemplate"); const value = await urlEl.evaluate((el) => el.textContent); console.log("@ PATH", value); tempEndpointObj.url = value; const methodEl = await machineColumnContent.$(".destinationMethod"); const methodValue = await methodEl.evaluate((el) => el.textContent); console.log("@ METHOD", methodValue); tempEndpointObj.method = methodValue; // Get Required Body values const requestExample = await machineColumnContent.$(".request"); if (requestExample) { const requestBody = await requestExample.$(".rawExampleBody"); const requestBodyValue = await requestBody?.evaluate((el) => el.textContent.trim() ); console.log("@ REQ", requestBodyValue); tempEndpointObj.request = { requestExampleValue: requestBodyValue ? JSON.parse(requestBodyValue) : null, }; } else { tempEndpointObj.request = { requestExampleValue: {}, }; } // Get Parameters const paramsListArr = await machineColumnContent.$$(".parameterRow"); const finalParameterArr = []; if (paramsListArr && paramsListArr.length) { for (const param of paramsListArr) { const paramKey = await param.$(".parameterKey"); const paramKeyValue = await paramKey?.evaluate((el) => el.textContent?.trim() ); const paramRequirement = await param.$(".parameterRequirement"); const paramRequirementValue = await paramRequirement?.evaluate((el) => el.textContent?.trim() ); const paramDescription = await param.$(".parameterDescription"); const paramDescriptionValue = await paramDescription?.evaluate((el) => el.textContent?.trim() ); console.log( "@ PARAM", paramKeyValue, paramRequirementValue, paramDescriptionValue ); finalParameterArr.push({ paramKeyValue, paramRequirementValue, paramDescriptionValue, }); } } tempEndpointObj.parameters = finalParameterArr; // Get Responses const responseListArr = await machineColumnContent.$$( ".machineColumnResponse" ); const finalResponseArr = []; if (responseListArr && responseListArr.length) { for (const response of responseListArr) { const responseStatus = await response.$(".responseStatusCode"); const responseStatusValue = await responseStatus?.evaluate((el) => el.textContent?.trim() ); const responseExample = await response.$(".rawExampleBody"); const responseExampleValue = await responseExample?.evaluate((el) => el.textContent?.trim() ); console.log("@ RESPONSE", responseStatusValue, responseExampleValue); finalResponseArr.push({ responseStatusValue, responseExampleValue: responseExampleValue ? JSON.parse(responseExampleValue) : null, }); } } tempEndpointObj.responses = finalResponseArr; // push temp endpoint object to final object scrapedData.endpoints.push(tempEndpointObj); index++; }

The best way to grok the above is to just read through it line-by-line.

When you run the script, we have a scrapedData object that looks like the following:

{ "endpoints": [ { "url": "/auth/refresh/", "method": "POST", "request": { "requestExampleValue": { "apiKey": "7r4hTa2Yb..." } }, "parameters": [], "responses": [ { "responseStatusValue": "200", "responseExampleValue": { "accessToken": "eyJhbGciOiJSUzI1N...", "scope": "app.account.read ..." } }, { "responseStatusValue": "500", "responseExampleValue": { "error": { "error": "StillLoading", "message": "Please try again or contact support." } } } ] }, { "url": "/auth/logout/", "method": "GET", "request": { "requestExampleValue": null }, "parameters": [], "responses": [ { "responseStatusValue": "200", "responseExampleValue": { "success": "true" } }, { "responseStatusValue": "500", "responseExampleValue": { "error": { "error": "StillLoading", "message": "Please try again or contact support." } } } ] } // ... other entries omitted ] }

According to my code, we managed to scraped 74 documented endpoints. This may no be perfect and there will surely be edge cases, but this is already a massive time saver when compared to manually writing out all 74 endpoints with there data from the website.

Saving the data to a file

This part is the easiest. We can use the built-in fs module to save the data to a file.

const fs = require("fs"); async function main() { // ... after the iterations // write out the temp endpoint object to file fs.writeFileSync( "./data.json", JSON.stringify(scrapedData, null, 2), "utf-8" ); }

The above code has a bunch omitted, but reference the source code (or code in the next section) to compare and see where along the script it was written.

The code written opts to write out data structure to the data.json file at the root of our project.

Final code

As previously mentioned, the code itself is a one-and-done script that would require more TLC to be up to my standards.

That being said, here is the final script:

const puppeteer = require("puppeteer"); const fs = require("fs"); const url = `https://jsapi.apiary.io/apis/swyftx.html`; /** * @see https://stackoverflow.com/questions/52497252/puppeteer-wait-until-page-is-completely-loaded/52501934 */ const waitTillHTMLRendered = async (page, timeout = 30000) => { const checkDurationMsecs = 1000; const maxChecks = timeout / checkDurationMsecs; let lastHTMLSize = 0; let checkCounts = 1; let countStableSizeIterations = 0; const minStableSizeIterations = 3; while (checkCounts++ <= maxChecks) { let html = await page.content(); let currentHTMLSize = html.length; let bodyHTMLSize = await page.evaluate( () => document.body.innerHTML.length ); console.log( "last: ", lastHTMLSize, " <> curr: ", currentHTMLSize, " body html size: ", bodyHTMLSize ); if (lastHTMLSize != 0 && currentHTMLSize == lastHTMLSize) countStableSizeIterations++; else countStableSizeIterations = 0; //reset the counter if (countStableSizeIterations >= minStableSizeIterations) { console.log("Page rendered fully..."); break; } lastHTMLSize = currentHTMLSize; await page.waitFor(checkDurationMsecs); } }; const scrapedData = { endpoints: [], }; const main = async () => { let browser; try { browser = await puppeteer.launch({ headless: true, args: [`--window-size=1920,1080`], defaultViewport: { width: 1920, height: 1080, }, }); const page = await browser.newPage(); await page.goto(url, { waitUntil: "load" }); await waitTillHTMLRendered(page); const elHandleArray = await page.$$(".actionInvitation"); let index = 0; for (const el of elHandleArray) { const tempEndpointObj = {}; console.log("@ STARTING INDEX:", index); await el.click(); await page.screenshot({ path: `imgs/debugging-${index}.png` }); // Note: this is a hack to avoid issue where new nodes are constantly added to the DOM // while older ones only have the `.hidden` class added. const machineColumnContent = await page.waitForSelector( ".row.machineColumnContent:last-of-type" ); const urlEl = await machineColumnContent.$(".uriTemplate"); const value = await urlEl.evaluate((el) => el.textContent); // Grab the API path for this endpoint console.log("@ PATH", value); tempEndpointObj.url = value; // Grab the endpoint REST method const methodEl = await machineColumnContent.$(".destinationMethod"); const methodValue = await methodEl.evaluate((el) => el.textContent); console.log("@ METHOD", methodValue); tempEndpointObj.method = methodValue; // Get Required Body values const requestExample = await machineColumnContent.$(".request"); if (requestExample) { const requestBody = await requestExample.$(".rawExampleBody"); const requestBodyValue = await requestBody?.evaluate((el) => el.textContent.trim() ); console.log("@ REQ", requestBodyValue); tempEndpointObj.request = { requestExampleValue: requestBodyValue ? JSON.parse(requestBodyValue) : null, }; } else { tempEndpointObj.request = { requestExampleValue: {}, }; } // Get Parameters const paramsListArr = await machineColumnContent.$$(".parameterRow"); const finalParameterArr = []; if (paramsListArr && paramsListArr.length) { for (const param of paramsListArr) { const paramKey = await param.$(".parameterKey"); const paramKeyValue = await paramKey?.evaluate((el) => el.textContent?.trim() ); const paramRequirement = await param.$(".parameterRequirement"); const paramRequirementValue = await paramRequirement?.evaluate((el) => el.textContent?.trim() ); const paramDescription = await param.$(".parameterDescription"); const paramDescriptionValue = await paramDescription?.evaluate((el) => el.textContent?.trim() ); console.log( "@ PARAM", paramKeyValue, paramRequirementValue, paramDescriptionValue ); finalParameterArr.push({ paramKeyValue, paramRequirementValue, paramDescriptionValue, }); } } tempEndpointObj.parameters = finalParameterArr; // Get Responses const responseListArr = await machineColumnContent.$$( ".machineColumnResponse" ); const finalResponseArr = []; if (responseListArr && responseListArr.length) { for (const response of responseListArr) { const responseStatus = await response.$(".responseStatusCode"); const responseStatusValue = await responseStatus?.evaluate((el) => el.textContent?.trim() ); const responseExample = await response.$(".rawExampleBody"); const responseExampleValue = await responseExample?.evaluate((el) => el.textContent?.trim() ); console.log("@ RESPONSE", responseStatusValue, responseExampleValue); finalResponseArr.push({ responseStatusValue, responseExampleValue: responseExampleValue ? JSON.parse(responseExampleValue) : null, }); } } tempEndpointObj.responses = finalResponseArr; // push temp endpoint object to final object scrapedData.endpoints.push(tempEndpointObj); index++; } // write out the temp endpoint object to file fs.writeFileSync( "./data.json", JSON.stringify(scrapedData, null, 2), "utf-8" ); } catch (err) { console.error(err); } finally { await browser.close(); } }; main();

Summary

Today's post is about how to scrape data for the Swyftx API from Apiary using Puppeteer.

The end-goal was to take data from an AJAX-loaded page and write it to a file in a data structure that we can reference and re-write without continually needing to re-scrape.

As we move forward in the series and the project, we will potentially be re-writing and re-scraping sections of the website to handle edge cases that we run into. We don't require the Node.js wrapper we are creating to be perfect on the first iteration once we generate it, we just want a solid foundation to do a lot of the heavy lifting.

For what it is worth, this project itself took about an hour of work (took more to write this dang post), but will be super useful for cutting down time.

Stay tuned for the next post that will take our data structure and (hopefully) convert it into a valid OpenAPI v3.1 schema file before generating our TypeScript API.

Resources and further reading

Photo credit: joelfilip

Personal image

Dennis O'Keeffe

@dennisokeeffe92
  • Melbourne, Australia

Hi, I am a professional Software Engineer. Formerly of Culture Amp, UsabilityHub, Present Company and NightGuru.
I am currently working on workingoutloud.dev, Den Dribbles and LandPad .

1,200+ PEOPLE ALREADY JOINED ❤️️

Get fresh posts + news direct to your inbox.

No spam. We only send you relevant content.