Subscribe to PHP Freaks RSS

Creating a custom magic file database

syndicated from planet-php.net on January 17, 2019

The unix file utility command uses a "magic" database to determine which type of data a file contains, independently of the file's name or extension.

Here is how I created a custom magic database for testing purposes:



Test files

At first I created some files to run the tests on:



test.html


test.php
<?php echo 'foo'; ?>


test.23
Test 23


Let's see what the standard magic database detects here:

$ file test.*
test.23:   ASCII text
test.foo:  PHP script, ASCII text
test.html: html document, ASCII text

$ file -i test.* test.23: text/plain; charset=us-ascii test.foo: text/x-php; charset=us-ascii test.html: text/html; charset=us-ascii


Magic database

The magic database contains the rules that are used to detect the type.

It's a plain text file with a rule on each line. Lines may refer to the previous line, so that rules can be combined. The full documentation is available in the magic man page.

Here is my simple file that detects "23" within the first 16 bytes of the file and returns the "text/x-23" MIME type:

my-magic
0 search/16 23 File containing "23"
!:mime text/x-23


We can already use it:

$ file -m my-magic test.23 
test.23: File containing "23", ASCII text


Compilation

If you want to use it many times, you should compile it to a binary file for speed reasons:

$ file -C -m my-magic
$ file -m my-magic.mgc test.*
test.23:   File containing "23", ASCII text
test.foo:  ASCII text
test.html: ASCII text

$ file -i -m my-magic.mgc test.* test.23: text/x-23; charset=us-ascii test.foo: text/plain; charset=us-ascii test.html: text/plain; charset=us-ascii

The html and PHP files that have been detected properly earlier are not detected anymore, because my own magic database does not contain the rules of the standard magic file (/usr/share/misc/magic.mgc).

You may however pass multiple magic files to use, separated with a :

$ file -i -m my-magic.mgc:/usr/share/misc/magic.mgc test.*
test.23:   text/x-23; charset=us-ascii
test.foo:  text/x-php; charset=us-ascii
test.html: text/html; charset=us-ascii


Programming language detection

With this knowledge, I wrote a magic file that detects the programming language in source code files, so that phorkie can automatically choose the correct file extension: MIME_Type_PlainDetect.