how to detect file size / type while mid-download using axios or other requestor?

Ok so this isn't as easy to solve as one might expect. Ideally, http headers 'Content-length' and 'Content-type' exist so the user can know what he should expect but these aren't required headers. However those are often inaccurate or missing.

The solution I've found for this problem, which looks to be very reliable, involves two things:

  1. Making the request as a Stream
  2. Reading the file signature that the first byte of a lot of files have(probably due to ISO 8859-1, which lists these signatures); These are actually commonly known as Magic Numbers/Bytes.

A great way to use these two things is to stream the response and read the first bytes to check for the file signature; After you know if the file is in whatever format you support/want, then you can just process it as you'd normally or cancel the request before you read the next chunk of the stream, which should prevent overloading of your system(and which you can also use to measure the file size more accurately - which I show in the following snippet)

Here's how I implemented the solution mentioned above:

const getHtml = async (url, { timeout = 10000, ...opts } = {}) => {
  const CancelToken = axios.CancelToken
  const source = CancelToken.source()
  try {
    const timeoutId = setTimeout(() => source.cancel('Request cancelled due to timeout'), timeout)
    const res = await axios.get(url, {
      headers: {
        connection: 'keep-alive',
      },
      cancelToken: source.token,
      // Use stream mode so we can read the first chunk before getting the rest(1.6kB/chunk(highWatermark)) 
      responseType: 'stream',
      ...opts,
    })
    const stream = res.data;
    let firstChunk = true
    let size = 0
    // Not to be confused with arrayBuffer(the object) ;)
    const bufferArray = []
    // Async iterator syntax for consuming the stream. Iterating over a stream will consume it fully, but returning or breaking the loop in any way will destroy it
    for await (const chunk of stream) {
      if (firstChunk) {
        firstChunk = false
        // Only check the first 100(relevant, spaces excl.) chars of the chunk for html. This would possibly only fail in a raw text file which contains the word html at the very top(very unlikely and even then, wouldn't break anything)
        const stringChunk = String(chunk).replace(/\s+/g, '').slice(0, 100).toLowerCase()
        if (!stringChunk.includes('html')) return { error: `Requested URL is detected as a file. URL: ${url}\nChunk's magic 100: ${stringChunk}` };
      }
      size += Buffer.byteLength(chunk);
      if (size > sizeLimit) return { error: `Requested URL is too large.\nURL: ${url}\nSize: ${size}` };
      const buff = new Buffer.from(chunk)
      bufferArray.push(buff)
    }
    // After the stream is fully consumed, we clear the timeout and create one big buffer to convert to str and return that
    clearTimeout(timeoutId)
    return { html: Buffer.concat(bufferArray).toString() }
  } catch (err) {
    throw err
  }
}