Success Stories
CoSORT: The Emerging ETL Engine
(published July 1999, Database Trends)
by M. Denis Hill
CoSORT, from IRI, Inc., is an essential ingredient in many of today's
favorite data warehousing recipes. Its coroutine sort architecture,
which was originally intended to assist users migrating from legacy
system sorts, is designed to accept and produce an interactive stream
of records. Like a roux that can be the basis of many sauces, CoSORT's
fundamentally correct core technology enables it to embrace a number
of key functions of data warehouse ETL-extraction, transformation
and loading.
The Start of a Sort
CoSORT technology was born 30 years ago in the mainframe world.
But it's a product only known now on UNIX and NT. CoSORT was the
first independent sort developed for open systems, moving from CP/M
in 1978, to DOS in 1980, then UNIX in 1985, and Windows NT in 1990.
It is now is the world's fastest, and most widely licensed, commercial-grade
sort package for UNIX systems, and the top performing sort on NT
according to PC Week.
According to IRI engineer Sue Strickland, a combination of luck
and foresight prepared the company for the data warehousing boom.
"When we developed SortCL [CoSORT's sort control language]
for CoSORT in 1992, data warehousing (much less ETL) was a gleam
in some eyes," she says. "Our intent was to build a familiar DML
for sorting and reporting for UNIX and NT users leaving the mainframe.
Fortunately, the nature of our sort architecture is to accept and
produce an interactive stream of records: As they are read in, they
can be selected and modified. As they are transformed, they can
be sorted, aggregated, and reformatted for loading. Because we are
not locked to a single mainframe sort syntax, it's easier to expand
the power of the language. So for data warehousing, SortCL
can be used as both a front-end manipulation tool and a back-end
transformation engine. As it now happens, sorting is just another
CoSORT option!"
Made for Marts
As the volume of corporate, financial, scientific, and government
data grows, so expands the need for products like CoSORT. Used wherever
sorting, loading and report generation occur, CoSORT is best known
for its open approach to mainframe legacy sort and batch COBOL migrations
in UNIX and Windows NT and now more so for its role in accelerating
database utility operations and data warehouse manipulation.
Among the organizations capitalizing on CoSORT's efficient handling
of volume data is VIPS Information Solutions, which employs the
tool to speed 70GB Red Brick PTMU parallel loads. In addition to
this medical and financial data warehouse application, Bill McCaslin,
IRI's data warehouse segment manager notes that Ardent Software
recommends CoSORT for the sort/aggregate stage for its DataStage.
Exo Solutions, Hyperconsultoria, EBE Computing, Hyperion, and New
Dimensions integrate SortCL and other CoSORT pieces into
their standalone load and OLAP tools. CoSORT is often chosen as
the sort engine under the hood for data warehousing with Cincom's
Supra, Sabre's Airmax, Micro Focus COBOL, and Software AG's Natural.
SAS and BMC also easily integrate CoSORT.
A Tool of Many Talents
The CoSORT package is actually a collection of standalone utilities
and APIs for file sorting; for one-pass extraction, sorting, summarization,
and reporting; and, for providing sort functionality within databases,
data warehouses, and application programs. The central sort engine
is a minimal time algorithm in a coroutine architecture that transfers
records through memory.
The adaptability of CoSORT may be attributed, in part, to its support
of any file size, record format, or data type, including: alpha
and binary forms, C and COBOL numerics, EBCDIC, zoned decimal, floating
point, currency, and Julian and multinational timestamps. For non-standard
or encrypted data, CoSORT even supports user exits to perform special
compare procedures. The same is true for nonstandard input and output
sources and criteria. The usual input and output are from and to
new or existing files, tape or optical devices, stdin/stdout (and
pipes), and application programs.
The most popular of CoSORT's several standalone end-user utilities
is its sort control language, SortCL. The SortCL interface
uses familiar mainframe sort commands, but in a more intuitive and
explicit SQL-based framework, with centralized data dictionaries
and one's own symbolic field names. SortCL's cutting edge
record mapping technology performs precision field selection and
extraction, multi-key comparisons, record grouping and filtering,
advanced drill-down summary functions, horizontal mathematical and
expression evaluations, field-level data type translations, and
multi-output reformatting for report generation. Some of its speed
and resource economy derives from performance of these functions
in a single pass through the data.
The CoSORT package includes command line conversion tools to automatically
build UNIX and NT SortCL scripts from MVS and other sort
parms. It incorporates drop-in replacements for the Win32, UNIX
/bin/sort (called sort), SAS System 7 (PROCsort), and Micro Focus
COBOL sort verbs. A user-friendly interactive and batch interface
provides on-line help. An open API for application development is
included. The API supports direct C, COBOL, and FORTRAN calls to
CoSORT's central sort engine. To facilitate balancing performance
with system needs, CoSORT provides sophisticated resource tuning
facilities. Script commands, environment variables and control files
can be used to optimize CPU, memory and disk parameters.
Customizable ETL Language
For data warehouse and data mart applications, CoSORT's SortCL
performs source data extraction, data cleansing, sorting, reformatting,
data type conversion, aggregation, and indexing, all in a single
pass. Most operational data in commercial and public sector enterprises
reside internally in sequential flat files, mainframe (relational)
database tables, or are imported from data tapes and transmissions
generated externally. These historical databases are optimized for
ad hoc queries and transactions, rather than for extraction. IRI's
SortCL accepts multiple input files (large-scale tables or
flat file data dumps), or records streaming through pipes, to perform
conditional selection on records for downstream processes.
Beyond conditional include or omit criteria, additional record
filtering functions can be used to "horizontally" select virtual
records for sorting, reformatting, translation, aggregation and
output reporting. SortCL's data cleansing-though not as sophisticated
as fuzzy logic tools dedicated to the task-includes conditional
or unconditional elimination, reduction, or writing to an error
file of duplicate records, headers, fields, and bytes. This data
scrubbing increases the efficiency of downstream warehousing processes.
CoSORT resolves complex conditions based on inter- and intra-field
events, removes duplicates, and separates data into different outputs
and structures
The parallel coroutine sort engine distributes the data across
multiple CPUs to provide the fastest possible reordering of the
virtual data based on specified key fields and collating sequences.
The results are merged and prepared for output mapping.
SortCL expedites and eases the pain of large-scale conditional
selection and reformatting. It uses intuitive command syntax and
symbolic field references, along with references to centralized
data dictionaries (where the metadata are defined and stored), to
make output field and file layouts simple to declare and modify.
Remapping from input to output includes field repositioning, resizing,
padding, mathematical operations, as well as data type conversion
and aggregation.
While fixed or variable position fields are mapped from input to
output, their data can be relocated, resized, and converted by type;
e.g., from EBCDIC to ASCII, or from mixed packed decimal to signed
and zoned decimal. This eliminates the many mainframe binary forms
undesirable for subsequent data propagation, mining and access tools
running on open systems.
Counts, sums totals, averages, maximum and minimum values based
on multiple inter- and intra-record break conditions are possible
using SortCL to produce sophisticated EIS detail and drill-down
summary reports. SortCL's aggregation is also widely used
for ad hoc MIS reports.
Selecting, sorting, reformatting, and aggregating operational data
prepares it for database repopulation not only qualitatively, but
quantitatively as well-the amount of data going back in is vastly
reduced. 100 million rows can be aggregated down to 10 million,
for example, vastly improving the efficiency of a loading utility
like Red Brick's PTMU. This reduces future query and transaction
times on the new enterprise warehouse since the table data are now
in sorted order.
Rising to the Opportunities
The popularity of CoSORT in data warehousing is boosting the
visibility of the company, according to vice president of business
development, David Friedland. "IRI, Inc. (a.k.a., CoSORT) has been
a profitable, privately held company since 1978," says Friedland.
"We've grown to 20 outlets world wide caring for more than a thousand
customers without any external capital. That, plus the fact that
much of the product is silently embedded into other applications,
has limited awareness of the CoSORT brand. Nevertheless, the market
potential for CoSORT is too big not to more thoroughly exploit at
this point. Industry trends and alliances we're continuing to form
will combine to position the company and product line for more rapid
growth over the next 6-18 months."
As to the future of data warehousing IRI's McCaslin predicts trends
toward better tools and industry consolidation. "CoSORT is already
involved in Web logging and reporting warehouses because of the
explosive growth of web traffic and on-line transaction volumes.
So rather than the analysis of legacy archives, I think the growth
in data sources and the need for advanced extraction and transformation
will shift to EDI and more immediate trend/profile analyses."
Whatever the source of ingredients for the data warehouses and
data marts of the future, CoSORT's developers at IRI are likely
to be in the kitchen, cooking up essential tools to whip large volumes
of data into digestible chunks and palatable forms.

|