The JPEG Sheriff

Feeding Time, Eating Habits


  1. preface and clarification

    What follows is not to be considered an "attack" or even criticism on any list, producer of any list or program using such lists. If anything it is an attempt to clarify why I made certain design decisions.

    I wrote the Sheriff from scratch, based primarily on the ScanMaster list and later the Qx list. After it was done I started looking at other lists that I could find. This page describes my thoughts and feelings regarding the suiteability of certain formats for the Sheriff alone. One always gets the question Why don't you read list such and so? Well, this page describes why. As related to the Sheriff.

    A secondary reason for the extent of this "diatribe" is to provide an overview of the variety of formats and to attempt to put them into historic perspective. By doing this I hope to contribute a little to the process of coming to standardized lists in the general sense.

    Maybe there already is a FAQ for textual file formats. In which I would like to know the URL so I can link to it. It is better to come to one standard FAQ then to duplicate one anothers work. In case you're wondering. No, I did - at the time - not know about Checker and others existence for Win32. Had I known I would have done a Java version instead.


  2. introduction

    Since there are more people that look at JPEG images than there are those that create them, it might be worthwile to take the time to look at what the Sheriff will eat. And what it will not eat.

    Of course it eats list of files. But not just any file. Its favorite meal consists of lists it itself has generated. Prefereably with all the options on top. That way it knows all the necessary ingredients are present and no necessary vitamins have been left out.

  3. why so picky

    A good question. In order to fully appreciate its dietary requirements, indeed to grock them as it were, it is necessary to look at some of the lists floating around the Usenet. For this I have picked a reasonably random selection based primarily on what was present at the time.

    One of the formats that seems to be popular is often termed the CSV list. Probably this refers to Comma Seperated Value list. CSV lists originated, as far as I am aware, with at least WordStar for CP/M. That wordprocessor used those list to facilitate mail-merging form letters. WordStar CSV lists look like this:

    In short they are in USASCII. Each line forms one entry for the mail merger to mail a form letter to. Each line is seperated from the next by a CR/LF (CarriageReturn, LineFeed) pair. Each item in a line is seperated from the next, if present, with a comma. An item may be numeric or it may be textual. Textual items are surrounded with quotes.

    Alas, over time this format has deteriorated somewhat. An example of benign deterioration can be found, for example, in the checker_wysiwyg.txt file I stumbled across. Symptomatic of this light form of deterioration is the lack of quotes surrounding textual items that form one whole word. In other words:

    Though this still looks quite resonably, the venom is in its tail. What, exactly, makes a word. Normally a period ends a sentence and, by extension, the word it is placed after. Yet in this case it is clear that does not hold true for the item calanp01.jpg. Even so, if that is all then it is easily overcome.

    Unfortunately, life is never simple. As may become evident from looking at our next example. This is taken from the SWA_Phantom.csv file that was found hidden in a posted zip file of the same name. Albeit with a '.zip' extension. The CSV file as presented by Scanners With Attitudes looks thus:

    From looking at this snippet it becomes clear that the period still is treated like a normal alphabetic character. Furthermore, the space has now also been demoted from its traditional special status as a seperating character used to delimit words to just another letter. At least, if we assume that Kim Page is to be treated as a single item and not as the item Kim followed by the item Page. Quite natural if you happen to be a warm blooded human being with the heart more or less at the right place. But alas, the Sheriff happens to be a program and full Artificial Intelligence was a bit much for a first release. The heart, by the way, is an option available at a slightly inflated price.

    But wait, our trained eye has spotted quotes. Even double quotes as prescribed by the good old WordStar format. Unfortunately they have been put to wrong use. Granted it looks quite fine, but then, I have not shown you the ace in my sleave yet. The quotes in this case surround the item 206,351. Quite reasonable considering that it has a comma in it. By surrounding the item with double quotes the author is telling us to treat it as a single item nonetheless. Me thinks it is about time to show you that ace:

    This happens to be the first line of that list. I think it is safe to assume that the author is trying to tell us something with it. Presumably that would be that the first item is the filename of the JPEG image. The second item its (file)size and the last item will contain the models portrayed by the image. Presumably we are also to deduce that only three items will ever be present. Probably meaning that if we happen to find a comma in the third item we must treat it as yet another regular letter. Again, quite reasonable if you happen to be a human. With the optional heart.

    But another thing it tells us is that the second item is the filesize. Yet of all the items found in the file only one was surrounded by quotes. Namely the second one. The one supposed to stand for the size. Again, no problem if you happen to be a human with a lot of understanding. Unfortunately one of the things sorely lacking in the current release of the Sheriff. Like compassion. A somewhat less undestanding program will look at the double quotes and think: text! And treat it thus. A good thing too since there are comma's in them letters.

    But let us cut through this chase and take a look at the venom in its tail. Not only is the filesize a number that will be treated as text, even where we to treat it as a number we would run into trouble. For there is a comma in that number. Depending on the international settings of the Windows installed that may or may not be correct. Fortunately the Sheriff will discard all none-numerals from items it treats as numbers. Even so another problem with the CSV format has been highlighted. To wit, sometimes the specifications will be applied in reverse.

    Another problem that alas was not present in this list was the case of the comma in the last item. Judging from that first line there are only supposed to be three items. The question remains whether it is possible to build in sufficial artificial intelligence into the next release so that the Sheriff will be able to read and act accordingly. One can try, but one should not hold ones breath waiting. An example of the vaunted comma in a single text item can be found in the checkfile known as checker_simulator.csv. I think this particular selection is the most appropriate:

    A real beauty, indeed.

    Even though there is undoubtedly a lot more to tell about the Comma Seperated Value file format, I am inclined to let it rest for a while. I think we have seen enough examples to agree with the acronym in so far as that it has indeed comma's between the values. At the very least. In short, it used to be a nice and clean format. Back in 1978. But I think today we can do without it. If you come across a CSV file you need to use, for now just change (most of) the comma's in it to spaces. Double quotes are also to be considered harmful.

  4. a little variation

    A variant of the CSV format is the tab delimited list, which I have first seen used with Microsoft's Work. Though doubtlessly it was in use before that. The tab delimited list is comparable to the CSV list. It also consists of multiple items per line, but here the items are seperated from one another by tabulation characters. The tab characters are produced by Ctrl-I (or the Tab key) and are represented by the ASCII code 9. Since the tab character is not a regular character but a control character - it originated by the manual typewriters and was used to move the carriage to the next tabulation stop, in fact still is - it can not form part of normal text. Which makes it ideal as a seperating character. More so since tabs are also used to tell wordprocessors to act like typewriters and tabulate. I.e. make neat looking columns.

    There is also a slight disadvantage to their usage. Which is that the tabs are all there is. The format, again as far as I am aware, has no provision to indicate whether a column is textual or numeric. The CSV format uses the double quotes to indicate text. All else are numbers. Since the tab delimited list does not have this, you can only make an educated guess. Fortunately the Sheriff is well educated. Indeed, as you might have guessed, the Sheriff has no problem with a diet consisting of tab delimited lists. An example of their use would be the list as posted by RonScan. But there is hardly a point in including it since HTML does not do tabs very well. And using tables would be self defeating.

    This is not to say that it is perfect. For what is. It took me a while to find an example and realize that I had found it. It was only after the addition of a special tab delimited format recognizer that it became apparent. This problem holds for any delimited format so I shall provide an example using CSV, tabs being difficult to visualize.

    This list, the RonScan 212 list, had a full complement of information. To wit, filename, filesize, dimension, CRC and description being the name of the model depicted. The list in comma seperated format looks like this: a-mlms02.jpg, 99094, 768, x, 529, 524A21F0, , Anna Maria adalms01.jpg, 107409, 768, x, 570, 9789260F, , Alexah Adams adrlin01.jpg, 118276, 768, x, 562, 3878323E, , Rhonda Adams In other words, the parts of the dimension have been seperated and an extra delimiter was inserted before the description. Now the Sheriff has no problem with this when read as text file, for then a tab character is seen as so much whitespace. Since one or more spaces are used to delimit fields it does not see anything unusual. But not so with true tab delimited files. There it will read several fields each of which have been seperated from one another with a tab character. The field in the above case it would expect are , , , .

    Unfortunately that is not what it will get. The fields as implied by the format of the file would be , , , <"x">, , , <"">, . Even though there have been some updates to the Sheriff since I started this analysis of common formats, the AI module was not one of them. In this particular case it will try to interpret the field as a dimension. Since it is not, a valid dimension field is < number "x" number >, it will not recognize any line from the file as a valid image verification line.

  5. the real meal

    So show us a full meal, already! Ok, here it is:

    Generated by: The JPEG Sheriff (beta release 15 April 1997 16:20)
    Generated at: Do, 17 Apr 1997 07:26:37
    Find it at: http://www.worldonline.nl/~iboa
    
    Filename       Filesize      W x H      CRC-32   Description
    ------------  ----------  -----------  --------  -----------------
    lexrip01.jpg      55.463   450 x  400  D2A5E9B7  Anna Nicole Smith
    lexrip02.jpg      60.514   450 x  400  162BC753  
    lexrip03.jpg      64.541   450 x  400  A9E3C6DC  
    lexrip04.jpg      58.010   450 x  400  B0069E1A  
    lexrip05.jpg      74.106   450 x  400  D714E567  
    lexrip07.jpg      39.550   450 x  400  53E3377C  
    lexrip08.jpg      43.281   450 x  400  9F66795B  
    lexrip09.jpg      52.126   450 x  400  30E68CE5  
    lexrip10.jpg      63.781   450 x  400  BEFB5E72  
    lexrip11.jpg      60.671   450 x  400  6B1471CA  
    lexrip12.jpg      68.684   450 x  400  97B06EF6  
    ------------  ----------  -----------  --------  -----------------
    
    Count of files: 11  
    Total of sizes: 640.727   
    

    A list generated by yours truly. Though LexHaring did post a Sheriff list to the Usenet, I seem to have missed it. This of course also means that there is no garantuee that the shown CRCs are the true ones.

    Even though this diet is made by the Sheriff for the Sheriff it does not mean that it is perfect. Its worst drawback is that spaces in the filename will put it off. Now this can easily be solved by building in a recognizer specialized in tab delimited lists. I might in fact even do just that. One of these days. On the bright side, however, is that these lists are easy to read. They can also be mailed or posted without having to UUencode or MIMEify (AKA Base64 encoding) them first.

    It will also read such a file as is. As well as anything vaguely resembling it. Including CSV lists with the comma's and quotes turned into spaces or tabs. However, the more columns are present the better it will be able to determine if a line is valid or not. To demonstrate we can feed it the above list and tell it to expect nothing but the mandatory filenames and filesizes. In this pathological case it would generate the following wanted list:

    Generated by: The JPEG Sheriff (beta release 18 April 1997 03:34)
    Generated at: Vr, 18 Apr 1997 03:36:23
    Generated as: wanted files list
    Find it at: http://www.worldonline.nl/~iboa
    
    Filename       Filesize   Description
    ------------  ----------  ---------------------------------------
    lexrip01.jpg      55.463  450 x  400  D2A5E9B7  Anna Nicole Smith
    lexrip02.jpg      60.514  450 x  400  162BC753
    lexrip03.jpg      64.541  450 x  400  A9E3C6DC
    lexrip04.jpg      58.010  450 x  400  B0069E1A
    lexrip05.jpg      74.106  450 x  400  D714E567
    lexrip07.jpg      39.550  450 x  400  53E3377C
    lexrip08.jpg      43.281  450 x  400  9F66795B
    lexrip09.jpg      52.126  450 x  400  30E68CE5
    lexrip10.jpg      63.781  450 x  400  BEFB5E72
    lexrip11.jpg      60.671  450 x  400  6B1471CA
    lexrip12.jpg      68.684  450 x  400  97B06EF6
    ------------  ----------  ---------------------------------------
    
    Count of files: 11  
    Total of sizes: 640.727