Small Steps of Main to BIG

Serde for Cobol Layout to Hive table

View project on GitHub
01 Introduction 02 Transfer from Mainframe to HDFS >>

Conversion of Mainframe data to Hadoop distributed data:

This thought poses with two basic questions:

  1. How to transfer the data from mainframe to Hadoop? See more here
  2. Once transferred, how to interpret the data? See more here

Before answering above questions lets examine if there is any advantage in going through this path.

For who is it useful?

Currently, there are multiple banking and insurance industries who are trying to implement their business models on Hadoop technologies. For almost all the Hadoop projects there will be a source system file which resides on the Mainframe (hard fact) and that needs to brought over to Hadoop. This process is identified by Sqoop team and tried to create Mainframe file extraction process similar to Database but that is not sufficient enough. I would redirect the demerits of sqoop mainframe extraction to this Blog but my point here is to highlight the significance of the conversion.

This conversion is difficult because we need to establish a translator between different eras of programming models. One talks EBCIDIC and other ASCII :-)

Pain-Areas:

Below is short-list I could come up with:

  1. Data representation conversion from EBCDIC format (Mainframe) to ASCII (JAVA/HADOOP). This is complex because of direct conversion fails because of presence of COMP-3 fileds
  2. Field separations are based on offset(Mainframe) instead of separators(HADOOP).
  3. Dynamic arrays declartion in mainframe is complex to convert. Dynamic Array length is stored in a field which needs to read first before reading the actual data in array. (For Eg: WS-FIELD OCCURS 1 TO 50 TIMES DEPENDING ON WS-FIELD-LENGTH. )
  4. REDEFINES in mainframes is conceptually similar to UNION in C++ but varies a lot is implementation.
  5. File Formats in mainframe are Variable length and Fixed Length which needs to handled differently.
  6. Folder structure in mainframes is hard to visualize
  7. File versions in mainframes (GDG) is a cool feature but hard to handle in HADOOP.
  8. Last but not least Computational-Fields conversion.

What!!! Not enough issues to start working on.

Enough complaining.. Tell me what Industry is doing now!!

Hmm.. :-( Most of the industries are adding additional layers in between mainframe and hadoop to convert the formats. And some are converting mainframes files to desired formats in mainframe systems and then transferring. Others are using Tools like Informatica Power Exchange, SYNCSORT DMX-h etc.

My Approach

Develop a custom Serde which exhibits below properties:

  1. Cobol Layout is supplied through TBL PROPERTIES similar to AvroSerde and it will build the hive table definition automatically.
  2. Deserailzer should be able to extract the field data based on the offset at runtime.
  3. EBCDIC to ASCII to conversion is handled internally
  4. ....

Benefits:

  1. Easier migration from mainframe systems to hadoop
  2. Removal of additional layers.
  3. Faster processing time.
  4. Cost saving because my approach is an open source. YAYYY!!

Glimpse of final usage:

Mainframe Input File (will be in EBCIDIC format which is unreadable, converted for example purpose)

Ram Manohar  6123123123123123123king
heheh        5012012012012012comment
Lipi         3001001001darling
Kanu         2006006loving

Cobol Layout:

01 WS-VAR. 
   05 WS-NAME PIC X(12). 
   05 WS-MARKS-LENGTH PIC 9(2). 
   05 WS-marks OCCURS 0 to 6 TIMES DEPENDING ON WS-MARKS-LENGTH. 
      10 WS-MARK PIC 999. 
   05 WS-NICKNAME PIC X(6)

Hive DDL:

CREATE TABLE Cobol2Hive
ROW FORMAT SERDE 'com.savy3.cobolserde.CobolSerde' 
LOCATION '/home/hduser/hive/warehouse/ram.db/lolol'
TBLPROPERTIES ('cobol.layout'='01 WS-VAR. 05 WS-NAME PIC X(12). 05 WS-MARKS-LENGTH PIC 9(2). 05 WS-marks OCCURS 0 to 6 TIMES DEPENDING 

ON WS-MARKS-LENGTH. 10 WS-MARK PIC 999. 05 WS-NICKNAME PIC X(6)');

Output:

select * from Cobol2Hive;
OK
ws_name       ws_marks_length ws_mark ws_mark_1 ws_mark_2 ws_mark_3 ws_mark_4 ws_mark_5 ws_nickname 
Ram Manohar     6               123     123       123       123       123       123       king
heheh           5               12      12        12        12        12        null      comment
Lipi            3               1       1         1         null      null      null      darling
Kanu            2               6       6         null      null      null      null      loving

01 Introduction 02 Transfer from Mainframe to HDFS >>

Comments

Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account. What? Like you already don't have one?!).