Contact: Lisa Mangino, CoSORT USA 321.777.8889 x224

Success Stories
CoSORT: The Emerging ETL Engine
(published July 1999, Database Trends)

by M. Denis Hill

CoSORT, from IRI, Inc., is an essential ingredient in many of today's favorite data warehousing recipes. Its coroutine sort architecture, which was originally intended to assist users migrating from legacy system sorts, is designed to accept and produce an interactive stream of records. Like a roux that can be the basis of many sauces, CoSORT's fundamentally correct core technology enables it to embrace a number of key functions of data warehouse ETL-extraction, transformation and loading.

The Start of a Sort
CoSORT technology was born 30 years ago in the mainframe world. But it's a product only known now on UNIX and NT. CoSORT was the first independent sort developed for open systems, moving from CP/M in 1978, to DOS in 1980, then UNIX in 1985, and Windows NT in 1990. It is now is the world's fastest, and most widely licensed, commercial-grade sort package for UNIX systems, and the top performing sort on NT according to PC Week.

According to IRI engineer Sue Strickland, a combination of luck and foresight prepared the company for the data warehousing boom. "When we developed SortCL [CoSORT's sort control language] for CoSORT in 1992, data warehousing (much less ETL) was a gleam in some eyes," she says. "Our intent was to build a familiar DML for sorting and reporting for UNIX and NT users leaving the mainframe. Fortunately, the nature of our sort architecture is to accept and produce an interactive stream of records: As they are read in, they can be selected and modified. As they are transformed, they can be sorted, aggregated, and reformatted for loading. Because we are not locked to a single mainframe sort syntax, it's easier to expand the power of the language. So for data warehousing, SortCL can be used as both a front-end manipulation tool and a back-end transformation engine. As it now happens, sorting is just another CoSORT option!"

Made for Marts
As the volume of corporate, financial, scientific, and government data grows, so expands the need for products like CoSORT. Used wherever sorting, loading and report generation occur, CoSORT is best known for its open approach to mainframe legacy sort and batch COBOL migrations in UNIX and Windows NT and now more so for its role in accelerating database utility operations and data warehouse manipulation.

Among the organizations capitalizing on CoSORT's efficient handling of volume data is VIPS Information Solutions, which employs the tool to speed 70GB Red Brick PTMU parallel loads. In addition to this medical and financial data warehouse application, Bill McCaslin, IRI's data warehouse segment manager notes that Ardent Software recommends CoSORT for the sort/aggregate stage for its DataStage. Exo Solutions, Hyperconsultoria, EBE Computing, Hyperion, and New Dimensions integrate SortCL and other CoSORT pieces into their standalone load and OLAP tools. CoSORT is often chosen as the sort engine under the hood for data warehousing with Cincom's Supra, Sabre's Airmax, Micro Focus COBOL, and Software AG's Natural. SAS and BMC also easily integrate CoSORT.

A Tool of Many Talents
The CoSORT package is actually a collection of standalone utilities and APIs for file sorting; for one-pass extraction, sorting, summarization, and reporting; and, for providing sort functionality within databases, data warehouses, and application programs. The central sort engine is a minimal time algorithm in a coroutine architecture that transfers records through memory.

The adaptability of CoSORT may be attributed, in part, to its support of any file size, record format, or data type, including: alpha and binary forms, C and COBOL numerics, EBCDIC, zoned decimal, floating point, currency, and Julian and multinational timestamps. For non-standard or encrypted data, CoSORT even supports user exits to perform special compare procedures. The same is true for nonstandard input and output sources and criteria. The usual input and output are from and to new or existing files, tape or optical devices, stdin/stdout (and pipes), and application programs.

The most popular of CoSORT's several standalone end-user utilities is its sort control language, SortCL. The SortCL interface uses familiar mainframe sort commands, but in a more intuitive and explicit SQL-based framework, with centralized data dictionaries and one's own symbolic field names. SortCL's cutting edge record mapping technology performs precision field selection and extraction, multi-key comparisons, record grouping and filtering, advanced drill-down summary functions, horizontal mathematical and expression evaluations, field-level data type translations, and multi-output reformatting for report generation. Some of its speed and resource economy derives from performance of these functions in a single pass through the data.

The CoSORT package includes command line conversion tools to automatically build UNIX and NT SortCL scripts from MVS and other sort parms. It incorporates drop-in replacements for the Win32, UNIX /bin/sort (called sort), SAS System 7 (PROCsort), and Micro Focus COBOL sort verbs. A user-friendly interactive and batch interface provides on-line help. An open API for application development is included. The API supports direct C, COBOL, and FORTRAN calls to CoSORT's central sort engine. To facilitate balancing performance with system needs, CoSORT provides sophisticated resource tuning facilities. Script commands, environment variables and control files can be used to optimize CPU, memory and disk parameters.

Customizable ETL Language
For data warehouse and data mart applications, CoSORT's SortCL performs source data extraction, data cleansing, sorting, reformatting, data type conversion, aggregation, and indexing, all in a single pass. Most operational data in commercial and public sector enterprises reside internally in sequential flat files, mainframe (relational) database tables, or are imported from data tapes and transmissions generated externally. These historical databases are optimized for ad hoc queries and transactions, rather than for extraction. IRI's SortCL accepts multiple input files (large-scale tables or flat file data dumps), or records streaming through pipes, to perform conditional selection on records for downstream processes.

Beyond conditional include or omit criteria, additional record filtering functions can be used to "horizontally" select virtual records for sorting, reformatting, translation, aggregation and output reporting. SortCL's data cleansing-though not as sophisticated as fuzzy logic tools dedicated to the task-includes conditional or unconditional elimination, reduction, or writing to an error file of duplicate records, headers, fields, and bytes. This data scrubbing increases the efficiency of downstream warehousing processes. CoSORT resolves complex conditions based on inter- and intra-field events, removes duplicates, and separates data into different outputs and structures

The parallel coroutine sort engine distributes the data across multiple CPUs to provide the fastest possible reordering of the virtual data based on specified key fields and collating sequences. The results are merged and prepared for output mapping.

SortCL expedites and eases the pain of large-scale conditional selection and reformatting. It uses intuitive command syntax and symbolic field references, along with references to centralized data dictionaries (where the metadata are defined and stored), to make output field and file layouts simple to declare and modify. Remapping from input to output includes field repositioning, resizing, padding, mathematical operations, as well as data type conversion and aggregation.

While fixed or variable position fields are mapped from input to output, their data can be relocated, resized, and converted by type; e.g., from EBCDIC to ASCII, or from mixed packed decimal to signed and zoned decimal. This eliminates the many mainframe binary forms undesirable for subsequent data propagation, mining and access tools running on open systems.

Counts, sums totals, averages, maximum and minimum values based on multiple inter- and intra-record break conditions are possible using SortCL to produce sophisticated EIS detail and drill-down summary reports. SortCL's aggregation is also widely used for ad hoc MIS reports.

Selecting, sorting, reformatting, and aggregating operational data prepares it for database repopulation not only qualitatively, but quantitatively as well-the amount of data going back in is vastly reduced. 100 million rows can be aggregated down to 10 million, for example, vastly improving the efficiency of a loading utility like Red Brick's PTMU. This reduces future query and transaction times on the new enterprise warehouse since the table data are now in sorted order.

Rising to the Opportunities
The popularity of CoSORT in data warehousing is boosting the visibility of the company, according to vice president of business development, David Friedland. "IRI, Inc. (a.k.a., CoSORT) has been a profitable, privately held company since 1978," says Friedland. "We've grown to 20 outlets world wide caring for more than a thousand customers without any external capital. That, plus the fact that much of the product is silently embedded into other applications, has limited awareness of the CoSORT brand. Nevertheless, the market potential for CoSORT is too big not to more thoroughly exploit at this point. Industry trends and alliances we're continuing to form will combine to position the company and product line for more rapid growth over the next 6-18 months."

As to the future of data warehousing IRI's McCaslin predicts trends toward better tools and industry consolidation. "CoSORT is already involved in Web logging and reporting warehouses because of the explosive growth of web traffic and on-line transaction volumes. So rather than the analysis of legacy archives, I think the growth in data sources and the need for advanced extraction and transformation will shift to EDI and more immediate trend/profile analyses."

Whatever the source of ingredients for the data warehouses and data marts of the future, CoSORT's developers at IRI are likely to be in the kitchen, cooking up essential tools to whip large volumes of data into digestible chunks and palatable forms.