ASORT Version 1.1 - Copyright 1989,1996 Roger E. Donais.

ASORT is a text extract and sort utility that creates an output file of
fixed length lines, is able to sort on a multi-part key, and can handle
large files.  ASORT maintains similar performance for random input,
ordered input, reversed input, unique keys and identical keys.

In a few applications, ASORT has been used to create fixed length
records while maintaining the same record sequence by being given a
single character key that was known to be a space between columns or
known to be beyond the end of the longest input line.

ASORT has also been used to add a spaced columns within an existing text
file by defining both key and field segments that were known to be
beyond the end of the longest input line.


INPUT FILE:
-----------
The input file is expected to be a carriage-return line-feed delimited
ascii text file. Source lines may be random length, but cannot exceed
4096 characters.


END-OF-LINE:
------------
The end of an input record is marked by a carriage return (ascii 13),
line-feed (ascii 10), or form-feed (ascii 12).  Consecutive carriage
returns, line-feeds, and form-feeds are discarded, therefore blank
lines are also discarded.  Asort does *NOT* recognize a control Z as
being the logical end of a file.  The control Z and subsequent
characters will be processed as one or more records.  On the typical
sort, this often places a control Z as the first character in the
output file.  If this will cause problems, you should use one of the
many available utilities that will physically truncate a file at the
logical end of file mark.


OUTPUT FILE:
------------
The output file is defined as one or more segments that are to be
extracted from each source line.  Each extract segment is defined as a
starting column and ending column separated by a colon, or as a
starting column and segment length separated by the letter "x".  Any
extract segment that is beyond the end of a short source line will be
space filled.  Portions of the source line that are not described in
the extract definition are discarded.  Each line of the completed
extract file will be the combined length of the concatenated extract
segments followed by an carriage-return and line-feed character
sequence (ascii 13 and 10).  Multiple segment descriptions are comma
separated with no intervening spaces.


SORT KEY:
---------
The sort key is defined as one or more key segments that are to be
constructed from each source line.  Each key segment is defined as a
starting column and ending column separated by a colon, or as a
starting column and segment length separated by the letter "x".
Multiple sort key segment descriptions are comma separated with no
intervening spaces.

The key description list is read from left to right.  The left most
segment definition representing the major sort key and the right most
segment definition representing the minor, or least significant key.

Any key segment that is beyond the end of a short source line will be
padded with spaces.  This means that non-existent key segments will
have the same collating sequence as an existing space filled key
segment.

An optional letter suffix may be applied to each sort key segment to
control case sensitivity and specify the ascending or descending order
to be applied to that key segment.  The letter "A" or "a" specifies an
ascending sort sequence, and the letter "D" or "d" specifies a
descending sequence.  Use an uppercase "A" or "D" to force conversion
to upper case, thereby removing case sensitivity; and use a lowercase
"a" or "d" to retain case sensitivity.

A pseudo numeric key segment may be declared by appending a minus sign
"-", to any of the lettered options, or may be appended directly to the
physical description.  If used, this option must be the last character
of a segment description.

Each segment of the sort key retains its ASCII characteristics,
therefore non-numeric characters may be contained in a right justified
numeric field. Entries are determined to be negative when the first
non-space character is a minus sign, "-"; the last non-space character
is a minus sign, "-"; or when the first non-space character is a left
parenthesis, "(", and the last non-space character is a right
parenthesis, ")".  The minus sign, or parenthesis defining a negative
value are discarded, leading and trailing spaces are removed from the
remaining portion, and the result is then right justified in a space
filled field.

Justification and case conversion are applied only to the sort key.
Each section of the data extract is faithfully copied from the source
file retaining the same character case, justification, and/or centering
that occurred in the original file.

If the completed sort records are larger than the available memory,
ASORT will create an intermediate work file.  Sufficient free disk
space must be available on the scratch drive to contain the work files,
and sufficient memory must exist to provide a minimal buffer for each
intermediate block flushed to disk.


USAGE:
------
ASORT is invoked using the following syntax -

    asort <infile >outfile <extract> <sort key> [<temp path>]

    where   <infile        is a standard DOS redirected input
                           specification.

            >outfile       is a standard DOS redirected output
                           specification.

            <extract>      is a comma separated field description list.

            <sort key>     is a comma separated field description list.

            <temp path>    is an optional DOS drive/path specification
                           for the intermediate work file.

    and each extract and sort key field entry is specified as -

        <start column>:<end column> OR <start column>X<field width>

    and each sort key field may have one lettered and/or a
    justification suffix

            a      case sensitive ascending sequence (default option)
            d      case sensitive descending sequence
            A      case insensitive ascending sequence
            D      case insensitive descending sequence
            -      right justified / pseudo numeric


EXAMPLE:
--------
asort  <e:\test1.def  >tempdata.$$$
          1:1,6:13,53x2,92:98,55x2  1:1D,6:13A,55x2-,92:98-,53x2A  D:\

Will extract data from TEST1.DEF located in the root directory of drive
E: and write the sorted result to TEMPDATA.$$$ located in the current
directory of the default drive. The root directory of drive D: will be
used to contain the intermediate work file.

The extract will consist of the concatenation of

          column 1 through column 1
    and   column 6 through column 13
    and   two characters starting at column 53
    and   column 92 through column 98
    and   two characters starting at column 55

Records (carriage return - line feed delimited lines) will be sorted
from major to minor keys as -

    column 1 through column 1                  upper case descending
    column 6 through column 13                 upper case ascending
    two characters starting at column 55       ascending numeric
    column 92 through column 98                ascending numeric
    two characters starting at column 53       upper case ascending


MESSAGES:
---------
All runtime messages are written to the stderr device, which are by
default directed to the crt display.  When invoked with no parameters,
ASORT displays a full screen help message.  Additional messages that
may be displayed are -

      ASORT COMMAND ERROR:
            indicates an error was detected in the command line.  The
            command line will be displayed with a carat (^) marking the
            position where the error was detected.

      ** insufficient memory **
            indicates that there is not enough memory available to
            provide the necessary merge buffers.

      ** unable to create intermediate file **
            indicates an error occurred when ASORT attempted to create
            the intermediate work file (ASORT000.$$$).

      ** intermediate file write error **
            indicates an error occurred when writing to the
            intermediate work file.

      ** output file write error **
            indicates an error occurred when writing to the redirected
            stdout file handle.


SYSTEM REQUIREMENTS:
--------------------
ASORT was tested using a 9-Meg file consisting of 128,000 seventy (70)
character carriage-return line-feed delimited records.  The record
extract was defined as the original seventy (70) characters, and the
sort key was defined as the first ten (10) characters.  A TSR was used
to consume all but 64k of available memory. The sort was completed in
less than forty-nine minutes using a 16 mHz 386 equipped with a 16ms
MiniScribe 9380E.  The same test completed in less than five minutes
when 490,000 bytes of memory was reported available by Turbo Power's
MAPMEM utility.

In an actual task, asort was fed a 450 meg file containing
approximately 1.2 million records.  The task was performed on a 486DX25
workstation connected to a Novell network through an 8-bit Thomas
Conrad ArcNet card.  The task completed in just under 7-1/2 hours.

If you are concerned about really large files, you can compute the
amount of memory per record and minimal buffer requirement as -

    9 + extract length + key length + number of numeric sort fields

Assuming 100k of memory was available for work space, approximately 600
merge buffers would be available to sort a file of 78 character records
using a 78 character key, which would be sufficient to allow sorting a
28-Meg file.

Naturally, the more memory, the faster the sort!

The size of the intermediate file can be computed as the number of
records times the sum of:

    (extract length + key length + number of numeric sort fields)

Thus, the above example will require approximately 56-meg in order to
produce a 28-meg output file.  Although memory is an important factor of
speed, it will usually not be the limiting factor for the size file
that can be created.


HISTORY:
--------
The algorithm has migrated from Turbo-3, Turbo-4, Turbo-5, Turbo-C, and
finally to the assembly language version presented here.  Of the high
level languages, Turbo-3 performed best, sorting a 10-meg file of 80
character records using a 10-Mhz Wyze in 35 minutes.  Turbo-4 and 5
took one minute longer.  Turbo Pascal 4 & 5 produced substantially
smaller executable files, but lost the race with Turbo-C when sorting
small files, but performed substantially better when sorting larger
files. This improvement was probably due to my ability to control the
Pascal heap better than the Turbo-C heap.  For all its inefficiency,
the assembly version is one fourth the size of the smallest high level
implementation, and six times faster the best them.

ASORT was used as the foundation for its sister program, VSORT, which
is a direct replacement for the DOS SORT utility that can handle large
files and optionally truncate output lines to a specified maximum
length and/or at a specified character value.

Roger E. Donais <rdonais@southeast.net>
6121 Collins Road #127
Jacksonville, FL 32244

