Small Steps of Main to BIG

Serde for Cobol Layout to Hive table

View project on GitHub
<< 01 Introduction 03 Cobol to Hive Serde >>

Transfer of mainframe file to HDFS

Generally, when we need a file from other systems to hadoop, we get the data to edge using FTP/SFTP/CURL etc protocols. But when you are FTPing the data from mainframe system (Ofcourse mainframe ftp port should be opened) you need to extra careful to download the data in binary format. This is required since the mainframe uses EBCDIC encoding. One could argue to convert the file to ASCII while FTPing the data. That would work perfectly fine if the file that you are downloading has no usage of any computational fields. Examples of such are copybooks, code, plain text, etc. Problem arises when the data consists of computational fields, one should convert data to ASCII excluding these fields. This action become complicated as the position of the such fields can only be determined by its layout.

I recommend importing the mainframe data without converting to ASCII, that is in EBCIDIC format. We can achieve this by copying the data in binary format to edgenode using whatever protocol that is common between two systems and then using hdfs dfs -put command to transfer to HDFS. Or use my approach.

Assuming FTP protocol is most common, I have written simple java program to connect to mainframe system using FTP Client API and bring the data directly to the HDFS.

Usage of my program to FTP the file

hadoop jar CobolSerde.jar com/savy3/util/FTPtoHDFS {Mainframe FTP server:port} {userid} {pwd} {Mainframe file name} {hadoop file location}

Now we have the mainframe file in HDFS, but how do we interpet it? See details here


<< 01 Introduction 03 Cobol to Hive Serde >>

Comments

Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account. What? Like you already don't have one?!).