|
Transform and Clean Big Data in the Same Pass
Challenges:
Data cleansing can be complicated, time-consuming, and expensive. The data quality functions inside your tools may not satisfy your business rules or do the whole job. Custom functions may have to run in separate batch steps, or within a special "script transform component" that you must connect to your tool's data flow and run in smaller chunks. When data volumes are large, cleansing times can really add up. The bottom line? If you have more than one million rows, you may find that improving data quality is an inefficient or cumbersome process.
Solutions:
CoSort's SortCL tool can scrub many large files at the same time it transforming, protecting, and/or reporting from them. Native scrubbing functions you can perform/combine include:
• de-duplication
• character validation
• data homogenization
• find (scan) and replace
• horizontal, and conditional vertical selection
For advanced cleansing (based on complex business rules) at the field level, you can plug in your own functions or those in data quality vendor libraries. SortCL now supports custom transformations during the inrec or outfile phases of your job script. This means you can declare a cleansing function for any field in either place (i.e. up to two DQ routines per field, per job). One example in the CoSort documentation is a Melissa Data address standardization library.
The bottom line? With CoSort and the data quality library functions you have, you can cleanse your data in the same I/O in which you filter, transform, protect, and/or present it.
See also:
Select/Filter
Custom Transforms
Products > CoSort > SortCL |
"Data scrubbing is about as enjoyable as cleaning an encrusted frying pan with a worn sponge. Specialized cleansing tools can be expensive, ranging in price from $20,000 to $300,000, depending on the scope of the project and the systems involved..."
John Edwards, CIO Magazine |
Available Melissa Data Cleansing Functions:
• AddressObject
• AddressDoctor
• CleanAddress
• NameObject
• DQ*Plus
• DPV
• PhoneObject
• GeoCoder Object
• RBDI
• PersonatorAPI
• RightFielderAPI
• DoubleTakeAPI
|
1-800-333-SORT
1-321-777-8889
|