Parse CSV Buffer

 

I have several applications where I store data in Excel spreadsheets, then process them by exporting the data into a a flat text file as quoted, comma separated fields.  These were easy to process using compiled QuickBASIC programs, but not so much for COBOL.  I searched for a solution on the Internet written in COBOL and found nothing, so I wrote this subprogram to parse the records into discrete fields.  Each field may still need additional formatting (right justification, padding with spaces/zeros, etc.), but the majority of the work is done with one call.

As delivered here, the program will process a passed buffer (single record read from a CSV text file) of 5,000 characters and will return up to 25 fields of 200 characters each.  This will accommodate most needs, but the subprogram can be expanded fairly easily. 

I also include a very simple test program that will just display the fields returned from the subprogram and a small file of test data exported from an Excel spreadsheet of test data.

To install the source for the subroutine and the test program, as well as the sample data, and compile them, execute the bash script: ParseCSVsetup [md5: 6ef9130a3aeb8db7785dbcdec7f6ec18].  Right click and save the script in the location where you want the source programs to reside, then execute it.  It will create two files containing the source programs, plus the sample data file, then execute the GnuCOBOL compiler to compile the subprogram and test program.  The GnuCOBOL compiler must be installed prior to executing the bash script.  

If you execute the test program it will display output to standard output:

jay@Phoenix $ ./testParse
Record: 0001
Field +000000001 Length: +0006 Data: Row ID                                                                                                                                                                                                  
Field +000000002 Length: +0008 Data: Order ID                                                                                                                                                                                                
Field +000000003 Length: +0010 Data: Order Date                                                                                                                                                                                              
Field +000000004 Length: +0014 Data: Order Priority                                                                                                                                                                                          
Field +000000005 Length: +0014 Data: Order Quantity                                                                                                                                                                                          
Field +000000006 Length: +0005 Data: Sales                                                                                                                                                                                                   
Field +000000007 Length: +0008 Data: Discount                                                                                                                                                                                                
Field +000000008 Length: +0009 Data: Ship Mode                                                                                                                                                                                               
Field +000000009 Length: +0006 Data: Profit                                                                                                                                                                                                  
Field +000000010 Length: +0010 Data: Unit Price                                                                                                                                                                                              
Field +000000011 Length: +0013 Data: Shipping Cost                                                                                                                                                                                           
Field +000000012 Length: +0013 Data: Customer Name                                                                                                                                                                                           
Field +000000013 Length: +0008 Data: Province                                                                                                                                                                                                
Field +000000014 Length: +0006 Data: Region                                                                                                                                                                                                  
Field +000000015 Length: +0016 Data: Customer Segment                                                                                                                                                                                        
Field +000000016 Length: +0016 Data: Product Category                                                                                                                                                                                        
Field +000000017 Length: +0020 Data: Product Sub-Category                                                                                                                                                                                    
Field +000000018 Length: +0012 Data: Product Name                                                                                                                                                                                            
Field +000000019 Length: +0017 Data: Product Container                                                                                                                                                                                       
Field +000000020 Length: +0019 Data: Product Base Margin                                                                                                                                                                                     
Field +000000021 Length: +0009 Data: Ship Date                                                                                                                                                                                               
 
Record: 0002
Field +000000001 Length: +0001 Data: 1                                                                                                                                                                                                       
Field +000000002 Length: +0001 Data: 3                                                                                                                                                                                                       
Field +000000003 Length: +0010 Data: 10/13/2010                                                                                                                                                                                              
Field +000000004 Length: +0003 Data: Low                                                                                                                                                                                                     
Field +000000005 Length: +0001 Data: 6                                                                                                                                                                                                       
Field +000000006 Length: +0006 Data: 261.54                                                                                                                                                                                                  
Field +000000007 Length: +0004 Data: 0.04                                                                                                                                                                                                    
Field +000000008 Length: +0011 Data: Regular Air                                                                                                                                                                                             
Field +000000009 Length: +0007 Data: -213.25                                                                                                                                                                                                 
Field +000000010 Length: +0005 Data: 38.94                                                                                                                                                                                                   
Field +000000011 Length: +0002 Data: 35                                                                                                                                                                                                      
Field +000000012 Length: +0018 Data: Muhammed MacIntyre                                                                                                                                                                                      
Field +000000013 Length: +0007 Data: Nunavut                                                                                                                                                                                                 
Field +000000014 Length: +0007 Data: Nunavut                                                                                                                                                                                                 
Field +000000015 Length: +0014 Data: Small Business                                                                                                                                                                                          
Field +000000016 Length: +0015 Data: Office Supplies                                                                                                                                                                                         
Field +000000017 Length: +0022 Data: Storage & Organization                                                                                                                                                                                  
Field +000000018 Length: +0048 Data: Eldon Base for stackable storage shelf, platinum                                                                                                                                                        
Field +000000019 Length: +0009 Data: Large Box                                                                                                                                                                                               
Field +000000020 Length: +0003 Data: 0.8                                                                                                                                                                                                     
Field +000000021 Length: +0010 Data: 10/20/2010                                                                                                                                                                                              
 ... continues for contents of 20 records

In order to dynamically call the subroutine from other programs, you should copy the object module (ParseCSV.so) to a location available through the COB_LIBRARY_PATH.

* Updated September 20, 2019 *  I was tired of looking for and copying the Working-Storage fields for this, and a few other, routines, so I updated the installation script to provide a copybook that may be used in the calling program for this routine.  For easiest use, you should copy the copybook file into the directory that GnuCOBOL searches during compile time for copy books (on my Linux system that is /usr/local/share/gnucobol/copy).  At the same time I modified the original code to utilize a variable length field for the input field, where the length is determined at execution time.

* Updated May 31, 2022 *  I received feedback from Alfred Cleve, who had done some performance benchmarking of this routine; he found that the INSPECT of the WORK-BUFFER replacing spaces with x'00' was a serious hit on performance, especially when processing a large number of records.  I had not done any performance testing, as my uses of the routine are usually for a small number of records, mostly below 10,000.  I have improved the performance by initializing the WORK-BUFFER to low-values, adding the parameter INITIAL to the PROGRAM-ID to ensure that each call of the subprogram gets a fresh copy of the local storage, and removed the INSPECT statement, as it is no longer necessary when the buffer is pre-initialized.  For comparison, here are the time values for a test execution of the routine using a sample file of 113,906 records:

Original code
Modified code
real 0m59.197s
user 0m58.649s
sys  0m0.516s
real 0m9.278s
user 0m8.762s
sys  0m0.512s

This page was last updated on May 31, 2022.