Note 1: After writing the majority of this article I realized that it has a lot of commonalities with this excellent article about the OAuth2.0-Standard: https://www.nango.dev/blog/why-is-oauth-still-hard.

Note 2: Some of the pipelines described below, I wrote years ago. The data providers might have improved somewhat in the meantime. It’s too much work to rewrite them though.

In my day job I often consider myself a data plumber rather than a data engineer, because it’s more about plugging pipes together than solving complicated engineering problems. The sad reality is that data work is a lot of toil and a lot of workarounds that mitigate the design decisions other people have taken and you have no influence over.

So consider the following a rant and a warning to aspiring data engineers that a lot of the things you will have to deal with are infuriating, easy to avoid and you have a snowball’s chance in hell to fix them as they are part of the interface you use to interact with 3rd parties. Other people rely on these interfaces (see also Hyrum’s law). I hope this article might inspire someone to design a better system, though. It is also part instruction or at least motivation when you might think to yourself: “It can’t be that stupid/convoluted/counterintuitive!” My answer to that is: “Yes, yes, it can, it will probably get worse, you are not wrong!” - a small consolation for everyone who has to deal with similar challenges.

I have examined 34 APIs until mental fatigue set in, but overall I manage about 70 such interfaces myself and another 30 over a 3rd party ingestion tool like Fivetran (it has its own set of problems). Below is a really condensed version of how I get the data from the interface of each of those 30 something APIs.

The Simple and Easy ones Link to heading

Bundle up your single request with some credentials in any way, shape or form. Return data as JSON. That’s all you have to do to land yourself in this category of most formidable specimen of APIs. Your developer loves you, congratulations! Tokens and secret in this tier are normally fixed. They are provided through a user interface or simply communicated by the account manager on the data provider side.

  1. API token in URL as parameter, any configuration is already included in the endpoint

  2. Bearer token in GET header

  3. Bearer token in GET header - same data provider as above, but now with header “accept-encoding” set to “gzip”, which is a trivial thing to implement but I had to peruse the documentation in great detail to find this info.

  4. API token in URL, some dimensions in URL, metrics fixed, response is JSON

  5. Username/ password in GET header (with various settings for content-type etc.)

  6. Bearer token in GET request - data is split into multiple subaccounts with different credentials

  7. API key and secret as basicauth, report configuration as url parameter in get requests

  8. Bearer token in header, get request

  9. Token in URL of get request, two sub_accounts

  10. api_key and id in URL parameters, return “;” delimited csv text, fetch no more than 31 days at once

The Unneccesary Step Dancers Link to heading

These are not really a challenge as well, but I consider them an extra tier because they try to add security by layering an extra step between request and response. Dynamic tokens do have some cryptographic advantages over username and password and can be validated without storing the username and password server-side. Since the endpoints to get the token are publicly available and in close proximity to the data endpoints, I consider this layer somewhat superfluous as long as you practice good discipline with the security of your passwords.

They are like the developer’s second favourite child.

  1. Username & Password in POST body - security token is returned - POST token in header, request configuration in body - data is return as JSON

  2. username, account_id, secret_key get a sessions with username and pass, use session, account_id and secret_key in POST reqauerst to get XML report

  3. username, password in POST header to retrieve token, POST token in header of request to get the data

  4. POST client_id and client_secret as payload to the token endpoint, send (GET) token as header to data endpoint

  5. POST email/password to a login endpoint, receive token, POST token in header to report endpoint and configuration in payload

  6. Username and password in POST request to get a token, token in header of GET request to get the actual report, minor inconvenience: returns dates as YYYYMMDD while request parameters for dates have format YYYY-MM-DD

  7. POST id, username, password, secret to token endpoint as payload, post config with baerer token as header to reporting endpoint

The Prod with a Stick Clique Link to heading

The designers of the next to groups have decided that they do not want to immediately return data to an incoming request, but instead react to an incoming request with starting a detached reporting process. This can then take a few minutes to complete, during which you have to check repeatedly if the report is ready.

The Static Report Generation Nation Link to heading

This subset requires you to configure a report either in a GUI or via a specialised API endpoint. The report gets assigned an ID which is sometimes permanent, i.e. stays the same when you change the configuration of the report, and sometimes changes with a change in configuration. In the latter case, it the has to be passed on to the pipelines to fetch the up-to-date report and not an earlier version. When fetching the data, you don’t have to care about it’s configuration anymore, usually, the ID of the report and the date of the data is enough to fetch what you need.

Sometimes, you also have to schedule the exact time for the daily update of these predetermined reports and make sure you fetch the data after this point of time. (cough Daylight Savings Time cough)

  1. Requires you to create a report in the GUI, pass the ID to the API, update key and key_id every 3 months, not possible programmatically, report includes 7 days, as the data is still changing for a while, date of the report is one date after the 7 days timeframe. Data provider says that probably all data is complete after 3 days, they are still figuring it out, though.

  2. Create report configuration in GUI or via API (one-time), then send username,password in POST header, receive token that expires in 10 minutes, send token in GET header to get file list for all reports, filter filter file list for the ones matching the pre-configured reports and matching the dates for which you need the data, fetch zipped csv file from filtered location. Fetching historical data is difficult since the files are not created retroactively by default and requires generation of a separate report which has the exact same configuration, but refers to a different timeframe.

  3. Create report template in GUI, take note of id,JWT access token, requiring manual signature process, based on client_id, client_cescret, realm, base_url, scope, and the use of sha256 hash algorithm, POST everything to endpoint to receive token, POST token in header to report endpoint to create report from template, parametrize with dates. Then repeatedly check endpoint if report completed, fetch complete report from same endpoint.

  4. Use basicAuth with username and password, POST config id in payload, check status endpoint if report generation finished, GET report from different endpoint (indicated in the successful response of the status endpoint) with basicAuth when finished. It returns plain text csv file with “;” as separator. Not a big deal, but another infraction against convention that takes a little while to figure it out.

  5. Send account name, api_key, token in GET request header, get ids of all your marketing campaigns from first endpoint, but must ansynchronously create reports for each campaign and then check for completion, then CSV is created. This generates hundreds to thousands of requests per day, totaling in about an hour of runtime of the pipeline, just for this one data provider.

The Dynamic Report Download Road Link to heading

Here, you have to send a request that contains a report configuration to an endpoint every single time (i.e. no preconfigured report templates), wait until the report is generated and then fetch the finished report from a different endpoint.

  1. ClientId & Client Secret in POST payload (but urlencoded) to get a token, then POST token in header and request configuration in body (but as JSON). Cue repeatedly checking the status of the report and fetching the data when ready. Then you count the rows in the JSON of the finished report and re-do the whole process if this row count exceeds the maximum rows per reponse set by the data provider - but this time with paging parameters. You have no way to figure this out before the complete report is generated-

  2. Send bearer token and api_key in header of a POST request with configuration of the report as payload. Then POST to report status check endpoint, read csv from the very same endpoint when ready

  3. Username, password in payload of a POST request to receive token. Then the token is POSTed in header to a “reports” endpoint, GET request to another endppoint with access to in header to get names of entries in the first request, dates are timezoned

  4. Base64 encoded basic authentication, in POST headers, two subaccounts with different credentials, then receive bearer token in post header, wait for report generation, fetch zip compressed csv form download url

The Security Theater Troupe Link to heading

Luckily only a one-man-show.

  1. Submit username, password, api_key. The password is shared among all api users in the same organization though. I am under the impression that the provider of the API wanted to implement usage monitoring per client, but somehow their backend doesn’t properly support it. It seems they just switched around password and api_key, creating a unique identifier for each user, while the original api_key for the whole organization is now the same “password” for everyone. Besides that, the actual requests are simple: api key in header, username and password in basic auth. Data is a mess though, it’s ecommerce transaction data, but you can only retrieve the current state of a transactions, not their order history.

The Dirty Data Doozies Link to heading

While there is nothing particularly difficult about the way the data is fetched, the format of the data itself poses a problem here:

  1. Sending a bearer token in header returns a very, very large amount of CSV text. Handling large amounts of pure CSV is a great way to spend a week. This adds another layer of complexity, as the cloud machines that work on that data must be powerful enough to handle this amount of data pretty much in memory. If you are working in an international context like I do, content with “exotic” encodings and a few emojis added in for good measure might also throw you off: any library you use to parse the content and the data sinks must be able to handle the encoding of the text, any tidying up of the characters must be done on the fly. Have fun debugging in memory on a remote machine if there is a problematic entry somewhere around line 50,000 in your data.

  2. Accountname, api_key, token are sent in a get header - the API responds with JSON but in a variable structure and with missing entries. This requires you to make an educated guess about how a complete JSON object might look like, i.e. which attribute-value pairs might occur and how they are nested. Then you have to fill up any missing values within that structure and hope that you didn’t miss a possible variant of the JSON object’s structure (spoiler: you did), before you can squeeze the data into a table in your data warehouse.

  3. Send token as parameter in URL, response is differently structured depending if the response is paged, i.e. exceeding the threshold for maximum rows per response, or not. When the whole response is paged though, the last page of the set has - obviously - the same structure as an unpaged response, as there are following pages left. As a result, every response has to be parsed twice: once to determine the structure of response (paged/unpaged) and a second time to actually parse the results according to their structure.

The Extra Special Link to heading

Unique in their madness. Indescribable. One-of-a-terrible-kind.

  1. key_id, issuer_id, private key, python library with version golfing, i.e. finding the right combination of upstream libraries to work with this library and still support the popular framework I use to run the pipeline. It still fails occasionally because reasons and requires active opt-in in GUI regularly.

  2. Authentication with service account GCP credentials, creating a list of possible file matches from google cloud storage bucket, seeing which ones match (no permission to list the available files), fetching this data as a zip file. It’s sometimes a few days late, like with all data of this OG data provider.

  3. Use client library with email, password, domain, consumer_key and consumer secret, use python client to login with mail and password again (which was already provided to the client object, but whatever), post report configuration to report endpoint and then do the usual waiting to the report to complete.

  4. Username,password in url parameter, pass config as url parameter as well, pass time info Y, M, D, h but not m, as separate (!) parameters, read csv file from endpoint get date back in German format, but floating point data in English format.

Encore: How to creatively use Time and Date in an API? Link to heading

Normally, when fetching time series data, one mus provide a start and an end date of the requested data to the API. Then in most cases, the requested data is return per date (or if you need a finer resolution by the hour but that’s rearely necessary for financial reporting). Input and output formats of dates can be of a wide variety. Some boring data providers take YYYY-MM-DD as input format for the parameters of an API and return the very same format in their response. Some require YYYY-MM-DD as request parameter, but then return dates of format YYYYMMDD.

But this is just basic stuff - that’s no going to throw off a seasoned dev. Let’s try something more fun:

  1. Make the input dates so that they are always interpreted as 00:00 of a day, meaning that have to specify explicitly that you want to request data until 23:59:59 of the end date or skip ahead to the day after the end date when requesting your data.

  2. Data is usually not available in real time, so it takes a while - often days - until data providers have run all their processes and can provide their data to us, the consumers. So, we have the date on which the data in question occured and the date on which the report for that date is available. While most providers will name their reports and files to download according to the data date, so that requesting the data 2020-12-25 will yield the data for that date, some of them prefer to work with the report date. This means requesting the date 2020-12-25 will yield the report created on that date containing the data for 2020-12-24. That’s one of FAANG by the way.

  3. Another FAANG deliquent thinks it is fun to let you puzzle together the dates in their reponse, which contains two columns with strings like “Jan 1, 2023” and “1:11:11 AM PDT” leaving you to figure out the formatting string and assemble the UTC timestamp.

  4. Relatively easy to handle are the variety of different datetime formats that I have seen in use, such as

    • YYYY-MM-DDhh:mm,
    • YYYY-MM-DDThh:mm:ss,
    • YYYY-MM-DD hh:mm,
    • YYYY-MM-DDThh:mm:ssZ.

    Still annoying though, especially since you can’t be very sure if they denote naive timestamps or Zulu/UTC/GMT time (except the last variant obviously) without consulting the documentation and sometimes the data provider didn’t even bother to give it a thought and document it.