Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend file type (updated) #603

Merged
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
bda3f46
Allow specification of custom detectors + readme update
Jul 14, 2023
6007eff
Simplify logic in runCustomDetectors
Jul 14, 2023
c3dba6e
add custom detectors to fileTypeFromStream
Jul 14, 2023
fab97ae
fix linting issue
Jul 14, 2023
c7c3190
Execute custom detectors before default ones
Jul 17, 2023
4bcddff
add tests
Jul 17, 2023
733bfac
fix docs
Jul 17, 2023
37e1e57
compatibility with Node.js 14 and 16
Jul 17, 2023
bfd18b1
Remove blank space
FredrikSchaefer Jul 25, 2023
ee4cb2c
Wrap custom detectors into file type options
Jul 25, 2023
ad6d44f
Merge branch 'extend-file-type-updated' of github.com:FredrikSchaefer…
Jul 25, 2023
29930bf
Adjust fileTypeFromFile(...) to recent changes
FredrikSchaefer Jul 25, 2023
7ea6efd
Moved custom detectors from function to constructor argument
Oct 17, 2023
748ffee
fix fileTypeStream (add back fileTypeOptions)
Oct 17, 2023
2adec69
Update documentation
Oct 17, 2023
0d1464c
add check for illegal tokenizer position change
Oct 18, 2023
6b6188c
Update core.d.ts
FredrikSchaefer Oct 23, 2023
6806753
Update core.d.ts
FredrikSchaefer Oct 23, 2023
61e052e
Update readme.md (move custom detectors section as suggested by revie…
Oct 23, 2023
eed198d
Remove fileType prefix from class member functions
Oct 24, 2023
ff84f3e
Make runCustomDetectors private
Oct 24, 2023
326ccd1
Add class based approach to fileTypeStream
Oct 24, 2023
011fa53
Change error handling for read operations of custom detectors
Oct 25, 2023
b346f7c
Remove obsolete @throws from documentation
Oct 25, 2023
9e24ed9
Make usage of FileTypeParser class consistent
Oct 25, 2023
a926bf2
Rename stream(...) to toDetectingStream(...)
Oct 25, 2023
5e2a0fd
Fix error handling
Oct 25, 2023
f38565d
Suggested changes to simplify code
Borewit Oct 25, 2023
e25c294
Merge pull request #2 from sindresorhus/extend-file-type-updated-sugg…
FredrikSchaefer Oct 25, 2023
080ac75
Fix TypeScript declaration
Oct 25, 2023
de706c5
Remove comments from unit tests and redundant empty line
Borewit Nov 6, 2023
331502d
Make code examples executable.
Borewit Nov 6, 2023
9d85f05
Remove empty comment lines
Borewit Nov 6, 2023
ede94d9
Remove unused `fileTypeOptions` parameter from typings
Borewit Nov 6, 2023
ca6e449
Adjust number code and comment style suggestions
Borewit Nov 10, 2023
a50e37a
Update core.d.ts
sindresorhus Nov 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
50 changes: 45 additions & 5 deletions core.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,38 @@ export type ReadableStreamWithFileType = ReadableStream & {
readonly fileType?: FileTypeResult;
};

/**
Function that allows specifying custom detection mechanisms.

An iterable of detectors can be provided as argument for filetype detection methods.

The detectors are called before the default detections in the provided order.

Custom detectors can be used to add new FileTypeResults or to modify return behaviour of existing FileTypeResult detections.

If the detector returns `undefined`, the `tokenizer.position` should be 0 (unless it's a stream). That allows other detectors to parse the file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I agree with the "unless it's a stream". Essentially you can iterate to other detectors if you took a bite of the apple. Only peek is allowed, if read you you have consumed the tokenizer, which is very similar to a stream.

I fear this an area where we can expect a lot questions from users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the comment.

Yeah, guess you're right about the lot of questions.

I just mindlessly took this information from this previous discussion.

Let me suggest a more detailed explanation here:

If the detector returns undefined, the tokenizer.position should typically be 0. This allows easy parsing by other detectors, unless subsequent custom detectors specify otherwise. Additionally, the detector shouldn't consume the tokenizer; while peeking is non-consuming, reading is.

What do you think of this?

I'm really open to anything here!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also my other comment: https://github.com/sindresorhus/file-type/pull/603/files#r1356979704

I suggest something like this:

If the detector returns no_match it is not allowed to read from the tokenizer (the tokeinzer.position must remain 0) otherwise following scanners will read from the wrong file offset.

If the detector return undefined the scanner is certain the file type cannot be determined, neither by other scanners.

no_match represents option 1 explained in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point that custom detectors should be able to interrupt detection. However, I see two small downsides of the suggested approach:

  1. The wording no_match does not really make clear to me whether it means no match for this detector, or no match at all.

  2. The standard FileTypeParser returns undefined when no file type could be recognized. Therefore requiring the custom detectors to return something else is a bit counter intuitive.

I therefore suggest to do it the other way around:

If the detector returns undefined, it is not allowed to read from the tokenizer (the tokenizer.position must remain 0) otherwise following scanners will read from the wrong file offset.

If the detector returns file_type_undetectable, the detector is certain the file type cannot be determined, even by other scanners. The FileTypeParser interrupts the parsing and immediately returns undefined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, one could argue that file_type_undetectable also does not clearly say whether it means file type undetectable for this detector or for all detectors, but it still makes it a bit clearer in my opinion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point that custom detectors should be able to interrupt detection. However, I see two small downsides of the suggested approach:

  1. The wording no_match does not really make clear to me whether it means no match for this detector, or no match at all.

  2. The standard FileTypeParser returns undefined when no file type could be recognized. Therefore requiring the custom detectors to return something else is a bit counter intuitive.

I therefore suggest to do it the other way around:

If the detector returns undefined, it is not allowed to read from the tokenizer (the tokenizer.position must remain 0) otherwise following scanners will read from the wrong file offset.

If the detector returns file_type_undetectable, the detector is certain the file type cannot be determined, even by other scanners. The FileTypeParser interrupts the parsing and immediately returns undefined.

Sounds good to me. The second case can be also be something like, detector started reading but for some reason failed to determine the file-type. Not ideally, but it can happen. If the detector starts reading, there is no way back.

We could also check the position after each custom scanner. It may not be 0 actually, there is also an iterated use case with ID3 header. The position should be remain unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Just pushed a commit taking care of that check.


Example detector array which can be extended and provided as argument to each public method:

const customDetectors = [
async tokenizer => {
const unicornHeader = [85, 78, 73, 67, 79, 82, 78]; // "UNICORN" as decimal string
const buffer = Buffer.alloc(7);
await tokenizer.peekBuffer(buffer, {length: unicornHeader.length, mayBeLess: true});
if (unicornHeader.every((value, index) => value === buffer[index])) {
return {ext: 'unicorn', mime: 'application/unicorn'};
}

return undefined;
FredrikSchaefer marked this conversation as resolved.
Show resolved Hide resolved
},
];

@param tokenizer - An [`ITokenizer`](https://github.com/Borewit/strtok3#tokenizer) usable as source of the examined file.
@param fileType - FileTypeResult detected by the standard detections or a previous custom detection. Undefined if no matching fileTypeResult could be found.
@returns supposedly detected file extension and MIME type as a FileTypeResult-like object, or `undefined` when there is no match.
*/
FredrikSchaefer marked this conversation as resolved.
Show resolved Hide resolved
export type Detector = (tokenizer: ITokenizer, fileType?: FileTypeResult) => Promise<FileTypeResult | undefined>;

/**
Detect the file type of a `Buffer`, `Uint8Array`, or `ArrayBuffer`.

Expand All @@ -326,19 +358,21 @@ The file type is detected by checking the [magic number](https://en.wikipedia.or
If file access is available, it is recommended to use `.fromFile()` instead.

@param buffer - An Uint8Array or Buffer representing file data. It works best if the buffer contains the entire file, it may work with a smaller portion as well.
@param customDetectors - Optional: An Iterable of Detector functions. They are called in the order provided.
@returns The detected file type and MIME type, or `undefined` when there is no match.
*/
export function fileTypeFromBuffer(buffer: Uint8Array | ArrayBuffer): Promise<FileTypeResult | undefined>;
export function fileTypeFromBuffer(buffer: Uint8Array | ArrayBuffer, customDetectors?: Iterable<Detector>): Promise<FileTypeResult | undefined>;
FredrikSchaefer marked this conversation as resolved.
Show resolved Hide resolved

/**
Detect the file type of a Node.js [readable stream](https://nodejs.org/api/stream.html#stream_class_stream_readable).

The file type is detected by checking the [magic number](https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files) of the buffer.

@param stream - A readable stream representing file data.
@param customDetectors - Optional: An Iterable of Detector functions. They are called in the order provided.
@returns The detected file type and MIME type, or `undefined` when there is no match.
*/
export function fileTypeFromStream(stream: ReadableStream): Promise<FileTypeResult | undefined>;
export function fileTypeFromStream(stream: ReadableStream, customDetectors?: Iterable<Detector>): Promise<FileTypeResult | undefined>;

/**
Detect the file type from an [`ITokenizer`](https://github.com/Borewit/strtok3#tokenizer) source.
Expand All @@ -348,6 +382,7 @@ This method is used internally, but can also be used for a special "tokenizer" r
A tokenizer propagates the internal read functions, allowing alternative transport mechanisms, to access files, to be implemented and used.

@param tokenizer - File source implementing the tokenizer interface.
@param customDetectors - Optional: An Iterable of Detector functions. They are called in the order provided.
@returns The detected file type and MIME type, or `undefined` when there is no match.

An example is [`@tokenizer/http`](https://github.com/Borewit/tokenizer-http), which requests data using [HTTP-range-requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests). A difference with a conventional stream and the [*tokenizer*](https://github.com/Borewit/strtok3#tokenizer), is that it is able to *ignore* (seek, fast-forward) in the stream. For example, you may only need and read the first 6 bytes, and the last 128 bytes, which may be an advantage in case reading the entire file would take longer.
Expand All @@ -366,7 +401,7 @@ console.log(fileType);
//=> {ext: 'mp3', mime: 'audio/mpeg'}
```
*/
export function fileTypeFromTokenizer(tokenizer: ITokenizer): Promise<FileTypeResult | undefined>;
export function fileTypeFromTokenizer(tokenizer: ITokenizer, customDetectors?: Iterable<Detector>): Promise<FileTypeResult | undefined>;

/**
Supported file extensions.
Expand Down Expand Up @@ -399,6 +434,7 @@ A smaller sample size will result in lower probability of the best file type det
**Note:** Requires Node.js 14 or later.

@param readableStream - A [readable stream](https://nodejs.org/api/stream.html#stream_class_stream_readable) containing a file to examine.
@param customDetectors - Optional: An Iterable of Detector functions. They are called in the order provided.
@returns A `Promise` which resolves to the original readable stream argument, but with an added `fileType` property, which is an object like the one returned from `fileTypeFromFile()`.

@example
Expand All @@ -416,11 +452,15 @@ if (stream2.fileType?.mime === 'image/jpeg') {
}
```
*/
export function fileTypeStream(readableStream: ReadableStream, options?: StreamOptions): Promise<ReadableStreamWithFileType>;
export function fileTypeStream(readableStream: ReadableStream, options?: StreamOptions, customDetectors?: Iterable<Detector>): Promise<ReadableStreamWithFileType>;

/**
Detect the file type of a [`Blob`](https://nodejs.org/api/buffer.html#class-blob).

@param blob
Borewit marked this conversation as resolved.
Show resolved Hide resolved
@param customDetectors - Optional: An Iterable of Detector functions. They are called in the order provided.
@returns The detected file type and MIME type, or `undefined` when there is no match.
sindresorhus marked this conversation as resolved.
Show resolved Hide resolved

@example
```
import {fileTypeFromBlob} from 'file-type';
Expand All @@ -434,4 +474,4 @@ console.log(await fileTypeFromBlob(blob));
//=> {ext: 'txt', mime: 'plain/text'}
```
*/
export declare function fileTypeFromBlob(blob: Blob): Promise<FileTypeResult | undefined>;
export declare function fileTypeFromBlob(blob: Blob, customDetectors?: Iterable<Detector>): Promise<FileTypeResult | undefined>;
38 changes: 26 additions & 12 deletions core.js
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,16 @@ import {extensions, mimeTypes} from './supported.js';

const minimumBytes = 4100; // A fair amount of file-types are detectable within this range.

export async function fileTypeFromStream(stream) {
export async function fileTypeFromStream(stream, customDetectors) {
const tokenizer = await strtok3.fromStream(stream);
try {
return await fileTypeFromTokenizer(tokenizer);
return await fileTypeFromTokenizer(tokenizer, customDetectors);
} finally {
await tokenizer.close();
}
}

export async function fileTypeFromBuffer(input) {
export async function fileTypeFromBuffer(input, customDetectors) {
if (!(input instanceof Uint8Array || input instanceof ArrayBuffer)) {
throw new TypeError(`Expected the \`input\` argument to be of type \`Uint8Array\` or \`Buffer\` or \`ArrayBuffer\`, got \`${typeof input}\``);
}
Expand All @@ -30,12 +30,12 @@ export async function fileTypeFromBuffer(input) {
return;
}

return fileTypeFromTokenizer(strtok3.fromBuffer(buffer));
return fileTypeFromTokenizer(strtok3.fromBuffer(buffer), customDetectors);
}

export async function fileTypeFromBlob(blob) {
export async function fileTypeFromBlob(blob, customDetectors) {
const buffer = await blob.arrayBuffer();
return fileTypeFromBuffer(new Uint8Array(buffer));
return fileTypeFromBuffer(new Uint8Array(buffer), customDetectors);
}

function _check(buffer, headers, options) {
Expand All @@ -59,9 +59,23 @@ function _check(buffer, headers, options) {
return true;
}

export async function fileTypeFromTokenizer(tokenizer) {
async function runCustomDetectors(tokenizer, detectors) {
if (detectors) {
for (const detector of detectors) {
const fileType = await detector(tokenizer);
if (fileType) {
return fileType;
}
}
}

return undefined;
}

export async function fileTypeFromTokenizer(tokenizer, customDetectors) {
try {
return new FileTypeParser().parse(tokenizer);
return await runCustomDetectors(tokenizer, customDetectors)
|| await new FileTypeParser().parse(tokenizer, customDetectors);
} catch (error) {
if (!(error instanceof strtok3.EndOfStreamError)) {
throw error;
Expand All @@ -78,7 +92,7 @@ class FileTypeParser {
return this.check(stringToBytes(header), options);
}

async parse(tokenizer) {
async parse(tokenizer, customDetectors) {
this.buffer = Buffer.alloc(minimumBytes);

// Keep reading until EOF if the file size is unknown.
Expand Down Expand Up @@ -211,7 +225,7 @@ class FileTypeParser {
}

await tokenizer.ignore(id3HeaderLength);
return fileTypeFromTokenizer(tokenizer); // Skip ID3 header, recursion
return fileTypeFromTokenizer(tokenizer, customDetectors); // Skip ID3 header, recursion
}

// Musepack, SV7
Expand Down Expand Up @@ -1602,7 +1616,7 @@ class FileTypeParser {
}
}

export async function fileTypeStream(readableStream, {sampleSize = minimumBytes} = {}) {
export async function fileTypeStream(readableStream, {sampleSize = minimumBytes} = {}, customDetectors) {
const {default: stream} = await import('node:stream');

return new Promise((resolve, reject) => {
Expand All @@ -1618,7 +1632,7 @@ export async function fileTypeStream(readableStream, {sampleSize = minimumBytes}
// Read the input stream and detect the filetype
const chunk = readableStream.read(sampleSize) ?? readableStream.read() ?? Buffer.alloc(0);
try {
const fileType = await fileTypeFromBuffer(chunk);
const fileType = await fileTypeFromBuffer(chunk, customDetectors);
pass.fileType = fileType;
} catch (error) {
if (error instanceof strtok3.EndOfStreamError) {
Expand Down
1 change: 1 addition & 0 deletions fixture/fixture.unicorn
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
UNICORN FILE CONTENT
4 changes: 2 additions & 2 deletions index.js
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import * as strtok3 from 'strtok3';
import {fileTypeFromTokenizer} from './core.js';

export async function fileTypeFromFile(path) {
export async function fileTypeFromFile(path, customDetectors) {
const tokenizer = await strtok3.fromFile(path);
try {
return await fileTypeFromTokenizer(tokenizer);
return await fileTypeFromTokenizer(tokenizer, customDetectors);
} finally {
await tokenizer.close();
}
Expand Down
98 changes: 92 additions & 6 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ console.log(fileType);

## API

### fileTypeFromBuffer(buffer)
### fileTypeFromBuffer(buffer, customDetectors)

Detect the file type of a `Buffer`, `Uint8Array`, or `ArrayBuffer`.

Expand All @@ -126,7 +126,13 @@ Type: `Buffer | Uint8Array | ArrayBuffer`

A buffer representing file data. It works best if the buffer contains the entire file, it may work with a smaller portion as well.

### fileTypeFromFile(filePath)
#### customDetectors

Type: `Iterable<Detector>`

Optional: An Iterable of [Detector](#custom-detectors) functions. They are called in the order provided.

### fileTypeFromFile(filePath, customDetectors)

Detect the file type of a file path.

Expand All @@ -145,7 +151,14 @@ Type: `string`

The file path to parse.

### fileTypeFromStream(stream)
#### customDetectors

Type: `Iterable<Detector>`

Optional: An Iterable of [Detector](#custom-detectors) functions. They are called in the order provided.


### fileTypeFromStream(stream, customDetectors)

Detect the file type of a Node.js [readable stream](https://nodejs.org/api/stream.html#stream_class_stream_readable).

Expand All @@ -164,7 +177,14 @@ Type: [`stream.Readable`](https://nodejs.org/api/stream.html#stream_class_stream

A readable stream representing file data.

### fileTypeFromBlob(blob)
#### customDetectors

Type: `Iterable<Detector>`

Optional: An Iterable of [Detector](#custom-detectors) functions. They are called in the order provided.


### fileTypeFromBlob(blob, customDetectors)

Detect the file type of a [`Blob`](https://developer.mozilla.org/en-US/docs/Web/API/Blob).

Expand All @@ -189,7 +209,18 @@ console.log(await fileTypeFromBlob(blob));
//=> {ext: 'txt', mime: 'plain/text'}
```

### fileTypeFromTokenizer(tokenizer)
#### blob

Type: [`Blob`](https://developer.mozilla.org/en-US/docs/Web/API/Blob)

#### customDetectors

Type: `Iterable<Detector>`

Optional: An Iterable of [Detector](#custom-detectors) functions. They are called in the order provided.


### fileTypeFromTokenizer(tokenizer, customDetectors)

Detect the file type from an `ITokenizer` source.

Expand Down Expand Up @@ -248,7 +279,13 @@ Type: [`ITokenizer`](https://github.com/Borewit/strtok3#tokenizer)

A file source implementing the [tokenizer interface](https://github.com/Borewit/strtok3#tokenizer).

### fileTypeStream(readableStream, options?)
#### customDetectors

Type: `Iterable<Detector>`

Optional: An Iterable of [Detector](#custom-detectors) functions. They are called in the order provided.

### fileTypeStream(readableStream, options?, customDetectors)

Returns a `Promise` which resolves to the original readable stream argument, but with an added `fileType` property, which is an object like the one returned from `fileTypeFromFile()`.

Expand Down Expand Up @@ -297,6 +334,13 @@ Type: [`stream.Readable`](https://nodejs.org/api/stream.html#stream_class_stream

The input stream.

#### customDetectors

Type: `Iterable<Detector>`

Optional: An Iterable of [Detector](#custom-detectors) functions. They are called in the order provided.


### supportedExtensions

Returns a `Set<string>` of supported file extensions.
Expand Down Expand Up @@ -469,6 +513,48 @@ The following file types will not be accepted:
- `.csv` - [Reason.](https://github.com/sindresorhus/file-type/issues/264#issuecomment-568439196)
- `.svg` - Detecting it requires a full-blown parser. Check out [`is-svg`](https://github.com/sindresorhus/is-svg) for something that mostly works.


## Custom detectors
FredrikSchaefer marked this conversation as resolved.
Show resolved Hide resolved

A custom detector is a function that allows specifying custom detection mechanisms.

An iterable of detectors can be provided as argument for filetype detection methods.

The detectors are called before the default detections in the provided order.

Custom detectors can be used to add new FileTypeResults or to modify return behaviour of existing FileTypeResult detections.

If the detector returns `undefined`, the `tokenizer.position` should be 0 (unless it's a stream). That allows other detectors to parse the file.

Example detector array which can be extended and provided as argument to each public method:
```
const customDetectors = [
async tokenizer => {
const unicornHeader = [85, 78, 73, 67, 79, 82, 78]; // "UNICORN" as decimal string
const buffer = Buffer.alloc(7);
await tokenizer.peekBuffer(buffer, {length: unicornHeader.length, mayBeLess: true});
if (unicornHeader.every((value, index) => value === buffer[index])) {
return {ext: 'unicorn', mime: 'application/unicorn'};
}

return undefined;
},
];
```
#### tokenizer

Type: [`ITokenizer`](https://github.com/Borewit/strtok3#tokenizer)

Usable as source of the examined file.

#### fileType

Type: FileTypeResult

Object having an `ext` (extension) and `mime` (mime type) property.

Detected by the standard detections or a previous custom detection. Undefined if no matching fileTypeResult could be found.

## Related

- [file-type-cli](https://github.com/sindresorhus/file-type-cli) - CLI for this module
Expand Down