Reality-of-sqoop-mainframe

View project on GitHub

Sqoop-mainframe

Syncsort submitted a patch (SQOOP-1272) to extend Sqoop for transferring data from Mainframe to Hadoop, allowing multiple Mainframe data sets to be moved to HDFS in parallel.

How to use the utility is detailed at: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_import_mainframe_literal

What is missing?

However based on my validation dated 10/27/2015, below are the some of the important features that are missed.

  1. Sqoop-mainframe can only handle Fixed length files, variable length files are not supported.
  2. Only folder structures like data can be downloaded i.e., PDS and Fixed length GDGs as a whole.
  3. EBCIDIC to ASCII conversion return invalid data when the data has computational fields.

If your target file needs to be in hive I would suggest you an alternate approach using hive custom serde.

GDG Issue

To be more specific lets look at a common scenario. A Mainframe system updates daily/monthly activities in GDG file. Usually a new version of GDG is created, for that day/month and previous day/month will be in earlier version. Example of such dataset in Mainframe will look like this:

   MF.FBFILE.GDG          -- Base
   MF.FBFILE.GDG.G0001V00  -- 10/27/2015 data
   MF.FBFILE.GDG.G0002V00  -- 10/28/2015 data
   MF.FBFILE.GDG.G0003V00  -- 10/29/2015 data
   MF.FBFILE.GDG.G0004V00  -- 10/30/2015 data
   MF.FBFILE.GDG.G0005V00  -- 10/31/2015 data

Sqoop-mainframe utility will allow us to download the entire GDG meaning all the data (5 days in our example). But in most of our scenarios we are interested only in today's data latest file. Sqoop mainframe will not allow to get a single file, it is all or none. Mainframe system usually stores year worth of data in these GDGs which makes it not worth all the files when we actually need single day data. In mainframe terminology, we can use MF.FBFILE.GDG(0) to get latest version every day and MF.FBFILE.GDG(-1) to get previous day data and so on. This convention is not supported in sqoop-mainframe, which makes us to write complex logic to get the file version number (such as G0005V00).

EBCDIC to ASCII Issue

Sqoop-mainframe automatically converts EBCDIC data to ASCII and then stores on the hadoop. This sounds cool but this fails when the data have computational fileds. Almost every file we need from mainframe will computational fields(COMP, COMP-3 etc). In these scenarios, sqoop-mainframe EBCIDIC To ASCII conversion interprets the data incorrectly and returns unreadable characters. To interpret mainframe data correctly, we need to have its corresponding cobol copybook layout. Through which we have to exclude the computational fields data and then convert rest of the data from EBCIDIC to ASCII. Computational fields needs to be seperately converted based on their representation, usually these fields are represented in binary or big-endian format.

Variable length files issue

Most important feature support for variable length files is missing in sqoop-mainframe. If we try to give VB length file, sqoop-mainframe return as if that file doesnot exist on mainframe

Work-around or How to make it work?

So if anyone is planning to use sqoop-mainframe one needs to write a mainframe job

  • to convert VB files to FB files
  • to convert computational fields to text
  • place the updated file in GDG with only one version
  • make sure to delete the older versions in GDG.

Comments

Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account. What? Like you already don't have one?!).