...a Disc Interchange technical article
Mainframe to PC Conversion Issues
Mainframe files can use concepts foreign to PC languages and applications. To ensure a smooth data conversion, review the layouts of your mainframe files and decide how to deal with any fields or data types your PC application can't understand directly.
Mainframe type files can be written to any tape that supports variable block mode. (Allows differing block sizes to be written to a single tape.) This includes 4mm, 8mm, AIT, LTO, DLT and TK, some QIC, and others, but the most common are 9-track tapes, 3480/90, 3490E, 3570, and 3590 tapes.
A 2400 foot 9-track tape can hold about 42 MB maximum at 1600 BPI, and 160 MB at 6250 BPI. A 3480 can hold about 210 MB maximum. 3490 is a compressed 3480 and typically holds about 400 MB. A 3490E typically holds about 1600 MB. 3570s hold between 5 and 21 GB, and 3590s hold from 10 GB to 180 GB, depending on the drive model (B, E, or H), the tape length, and compression. More details are provided on the Identifying Media detail pages.
The method of recording files on these tapes is identical for all these media. All these tapes can contain multiple files on each tape, and a single file can span multiple tapes (a multivolume set).
Other media is described on the Identifying Media page.
Mainframe Tape Formats
This describes the method of recording data on the tape. There are two main components to the tape format: Fixed-block or Variable-block recording, and Labeled or Unlabeled files.
- Fixed Block: Fixed block tapes are by far the most common. Fixed-length data records are written to tape in groups, resulting in a tape where all the blocks (except the last) are of the same size.
- Variable Block: Variable block tapes write records of varying size to tape blocks which therefore also vary in size. There are several methods for doing this.
- Labeled Tape: Each data file on a labeled tape is preceded by a special file called a "header label", and is followed by a "trailer label" of a similar type. These "Labels" contain information about the data file they bracket: the DSN (file name), record size, block size, creation date, and more. The label also tells you the type of file and blocking: Fixed, Variable, or Undefined.
Unlabeled Tape: Unlabeled tapes omit the labels and write just the raw data to tape, again in either Fixed block or Variable block files. As such, there is no file name or record size information on the tape.
Data (Field) Types
Many mainframe data types are not compatible with PC data types. If all the fields in the mainframe record are of type "character", or "alpha-numeric", meaning the entire file is composed of the letters A-Z, the numbers 0-9, spaces and punctuation (i.e. no binary types), then a simple EBCDIC to ASCII conversion will usually work. But mainframe numbers are often stored in a binary format. Some of the most common field types are listed below.
- Fixed Length Records: Most mainframe data is stored in a fixed-field, fixed-record format, where every field, and therefore every record, is fixed in width. There are no delimiters between either fields or records. Most PC databases can import this type of file if we append a record delimiter to the mainframe data when we do the conversion. This is generally the most direct and least expensive way to convert this data.
- Delimited: A very popular record type for PCs, this trims trailing spaces from each field and puts a delimiter, usually a comma or tab, between fields to mark the end of the field. Records are usually delimited with CR-LF (carriage-return, line-feed). Delimited records are almost never found on a mainframe.
- Databases: Database programs store data internally in many proprietary formats. Most programs will have import and export functions to read in standard data types and write out standard data types.
If your mainframe data includes any of these field types other than alpha-numeric, we will need a layout in order to write a program to convert them, and will need to know what data types to convert each field to.
Aside from the technical issues of the data type is the representation of the data. We will look at three common situations, with the intent of getting you to think about the data you will be working with after the conversion. If you can identify data that needs to be altered or cleaned-up prior to ordering your conversion, we may be able to perform that work at a lower cost as part of the conversion than if we do it as a separate job afterwards. This is only a sampling of the many possibilities.
- Alpha-Numeric, or Character fields: These fields are composed of only letters, punctuation, and the numbers 0-9 represented as characters. Mainframe character fields are in EBCDIC, and can be converted to ASCII for a PC without loss of information by a simple translation table.
- Binary fields: Binary fields can be integer fields, floating point fields, bit or coded fields, and other types. Mainframe binary values are not usually stored the same way as PC binary. To convert these we need to know the type of binary field, the number of bytes or words, and the byte and word order.
- COBOL comp fields: These are also binary fields, and the exact type is dependent on both the compiler and the CPU of the computer. We need the same information as for binary fields.
- COBOL comp-3 fields: Also called "packed fields", this is a standard COBOL numeric data type that stores ("packs") two digits into each byte. The last nybble (half byte) is the sign. This format is standard across compilers and CPUs.
- IBM Signed fields: Also called "Zoned", these fields overpunch the sign onto the last (or first) digit of the field. The rest of the field is numeric (character) data. These fields should not be converted to ASCII with a translation table because of the sign overpunch.
- Leading sign numeric: This is the standard numeric data type on a PC. It is composed of a leading sign, and all the digits are regular characters. For example, "-12345", or "+12345", or " 12345". COBOL "display" fields are also of this type.
- Implied Decimal: Implied decimal can apply to any kind of numeric field (character or binary), and simply means there is a decimal point implied at a specified location, but not actually stored in the file. For example, the number 123 with an implied decimal of two digits represents the actual value 1.23 Using implied decimal saves space in the file.
- Coded Fields: COBOL programmers, and others, sometimes assign binary codes or bit patterns to a field (usually a 1 byte field). For example, hex 00 may represent a certain customer status, hex 01 another status, hex 02 another, etc. Because these are binary codes they need to be converted for most PC applications.
Dates: Dates can be stored many ways. For example, a date of February 1st 2007 could be represented like this:
MMDDYY like 020107
MMDDYYYY like 02012007
DDMMYY like 010207
DDMMYYYY like 01022007
YYYYMMDD like 20070201
YYDDD like 07032
YYYYDDD like 2007032
The last two are called Julian dates, where the DDD is the day of the year, from 1 to 365. (366 on a leap year). With Y2K issues we saw starting to see 4 digit years in Julian dates. Most PC applications don't understand Julian dates, so you may want us to convert them to Gregorian dates.
Parsing: In this context, parsing is the process of separating each element of a field into separate fields. For example, if you have a list of names that you want to sort by last name but the name field is a full name (e.g. "John Smith"), then you need to parse the full name field into first name and last name fields.
Likewise, if you need to sort a list by ZIP CODE for a bulk mailing, but the city, state, and zip are all in one field, you need it parsed into separate fields for city, state, and zip.
This only touches on the possibilities. If you review your data you will likely find some things that need improvement. Call us to see if we can improve your data.
Redefined Fields and Records
Often the redefined fields are of a different type altogether. For example, redefining a character field as a binary field. This is much more serious than the above example, and the original field and the redefined field require different conversions (character and binary).
- Redefined Fields: Mainframe languages, especially COBOL, often reuse, or "redefine" an area in a record to save space. A common example is a mailing list where the addressee may be either a person or a company, but never both.
To include both an individual name field and a company name field would waste space, since only one of them would ever be filled, so the name field can be reused (redefined) as company name. Further, the individual name is usually composed of two fields, last name and first name, so for example, bytes 1-12 might be last name, and bytes 13-20 first name. But when redefined, bytes 1-20 would be the company name. There should be another field in the record that indicates which definition -- individual or company name -- is used in this particular record. Most PC applications do not deal with this well, especially when the field boundaries are different.
For example, take two records, one with an individual's name of "Smith John " and the other with the company name "Disc Interchange ". If you ignore the redefined issue, and treat the field as the company definition, then "Disc interchange" will be correct, but the mail to John Smith will be addressed to "Smith John". If you treat the fields as name fields and put the first name before the last name, then the name will be correct, like "John Smith", but the company name will get scrambled, like "ange Disc Interch". If your application can't deal with this, we can convert the data to a record with both individual name fields and company name fields.
For a number of reasons, data from mainframes will often contain "junk". That junk is often random binary values that will cause problems when brought into your PC database. There are four primary reasons for this junk:
- Redefined Records: Complex data sets usually cannot store all their data in just one record type, so they have multiple record types.
For example, medical records may have one record type to identify a patient (name, address, etc.), another record type for treatment data, and a third for payment information. These could be stored in three files, or in one. If they are stored in one file, then that file has "multiple record types", or "redefined records".
PC databases can make use of relational tables (or files), but usually can't deal with all three record types in one file. DISC can split the data into three files so you can build a relational database on your PC.
- Databases that are not properly initialized when created can have literally any value in a byte.
- Unused fields are often initialized to nulls (hex 00), and if never populated they remain as nulls.
- It's common practice to reserve spare space in "filler" fields. Filler fields are commonly not initialized, and can therefore contain anything.
- Sometimes when you get a file there will be fields in the file for "internal use" that are not specified on the layout. Since these are not specified, they could be anything, and are often binary values.
This binary junk can cause a number of problems, from funny characters in your data to crashing your database. One of the most serious is a control-Z (1A hex) in a file; this signifies end-of-file to many PC applications, so the database will stop importing the file when it sees a control-Z.
DISC has written several programs to scan your files to catch these problems and fix them before they cause you any grief. We routinely scan all jobs for control codes, bytes with the high bit set, irregular records (short or long records, a CR of LF in the middle of a record), control-Z, and other problems. We don't just blindly convert your file.