Small Steps of Main to BIG

Serde for Cobol Layout to Hive table

View project on GitHub
<< 02 Transfer from Mainframe to HDFS >> 04 Trouble shooting

Cobol-to-Hive Serde

Cobol-to-Hive serde requires

  1. Mainframe files to be in their raw format (Transfer the file as binary). See here for more
  2. Cobol layout needs to be provided

DDL for using this serde

Purpose

Simple hive create table will suffice for using the serde. Column names will be created automatically using the cobol layout provided. Mainfame file format FB or VB needs to specified in DDL using input format.

For FB file use INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' also FB length needs to specified using 'fb.length' property in TBLPROPERTIES

For VB file useINPUTFORMAT 'com.savy3.mapred.MainframeVBInputFormat'

Example

FB file -


ADD JAR |path|/CobolSerde.jar;
CREATE TABLE Cobol2Hive
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' 
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/home/hduser/hive/warehouse/ram.db/lolol'
TBLPROPERTIES ('cobol.layout.url'='/namenode/user/ram/cobol_layout/maincobol.copybook',
                      'fb.length'='450');

VB file -


ADD JAR |path|/CobolSerde.jar;
CREATE TABLE Cobol2Hive
ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' 
STORED AS
INPUTFORMAT 'com.savy3.mapred.MainframeVBInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/home/hduser/hive/warehouse/ram.db/lolol'
TBLPROPERTIES ('cobol.layout.url'='/namenode/user/ram/cobol_layout/maincobol.copybook');

How to provide Cobol layout/copybook

Current version supports 2 options
cobol.layout.url -- for which cobol layout/copybook needs to be in hdfs and location needs to provided for this option Example: TBLPROPERTIES ('cobol.layout.url'='/namenode/user/ram/cobol_layout/maincobol.copybook');


cobol.layout.literal -- for which cobol layout needs to be typed in DDL statement. there is limit of 100 characters Example :

TBLPROPERTIES ('cobol.layout'='01 WS-VAR. 05 WS-NAME PIC X(12). 05 WS-MARKS-LENGTH PIC 9(2). 05 WS-marks OCCURS 0 to 6 TIMES DEPENDING ON WS-MARKS-LENGTH. 10 WS-MARK PIC 999. 05 WS-NICKNAME PIC X(6)');

How cobol copy book is intrepreted?

Identifying the cobol fields

Since cobol fileds in layout can span across multiple rows, input layout is split with character combination ". " to get each field description.

Unsuported characters in field names

Cobol field name consists of "-" which are not supported in hive, we convert them to "_" in hive column names. For example: WS-NAME will be converted as WS_NAME in hive.

Duplicate field names

Cobol field are preceeded with a level number. Every field with level number will be converted to corresponding column. Duplicate names are possible in cobol layout, in such scenarios serde will create a new column name by suffixing with number. For example:

... 
05 FILLER X(10). 
05 FILLER X(20).

will be converted as FILLER and FILLER_1 columns in hive

Redefines clause

Cobol layout contains redefines clause which is similar to union clause. For simplicity purposes we will create seperate columns. For example:

 05 WS-RED  PIC X(4). 
05 WS-VAR REDEFINES WS-RED.
   10 WS-VAR1 PIC X(2).
   10 WS-VAR2 PIC X(2). 

will be converted to


WS_RED  string, 
WS_VAR1 string, 
WS_VAR2 string

If the value of this field is "CHAR" then table will have data as below:

WS_RED WS_VAR1 WS_VAR2 CHAR CH AR

OCCURS clause

Cobol layout contains occurs clause which is similar to arrays. For simplicity purposes each instance is converted to seperate column based on the number of instances. For example:

 05 WS-marks OCCURS 0 to 6 TIMES DEPENDING ON WS-MARKS-LENGTH. 
                  10 WS-MARK PIC 999. 
               05 WS-NICKNAME PIC X(6).

will be converted to

WS_MARK     int, 
WS_MARK1    int,
WS_MARK2    int,
WS_MARK3    int,
WS_MARK4    int,
WS_MARK5    int,
WS_NICKNAME string    

Field Type Conversions

Cobol Picture clause definitions are used to determine their corresponding data-types in hive. Below is mapping where n and m are any numbers

Mainframe Hive
PIC X(n) String
PIC A(n) String
PIC 9(n) tinyint if n =19
smallint if n <5
int if n <10
bigint if n <19
string if n >=19
PIC 9(m)V9(n) DECIMAL(m,n)
PIC 9(n) COMP-3 tinyint if n =19
smallint if n <5
int if n <10
bigint if n <19
string if n >=19
PIC 9(m)V9(n) COMP-3 DECIMAL(m,n)

<< 02 Transfer from Mainframe to HDFS >> 04 Trouble Shooting

Updates

Code has been updated, Below are updates


For easy refernce, Jar is located https://github.com/rbheemana/Cobol-to-Hive/tree/gh-pages/target
  1. Use com.savy3.hadoop.hive.serde3.cobol.CobolSerDe
  2. New serde will generate cobol.layout.generated which shows how serde parsed the layout. Useful to verify the COBOL Layout descripancies
  3. New serde will generate "cobol.hive.mapping" which displays Serde mapping of Cobol fields to Hive Columns Mapping and its length in mainframe file, useful for debugging
  4. Fixed COMP-3 issues
  5. Added more debug information when serde throws exception
  6. Code structure is modified for easy readability and future changes
  7. Column comments are updated with corresponding cobol field, offset and length

Comments

Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account. What? Like you already don't have one?!).