Parsing Unstructured File


The biggest challenge with ETL (Extract, Transform and Load of data) process is that sometimes the data source is in a file that cannot be easily parsed into a data table. Those files are designed for human to read, but they ended up being the only way we can get the data we need. We call these files unstructured files. They actually have structure. Any files created by computer have structures, but these files just don’t have the data processing folks’ favorite structures – fix width, tab or comma deliminated. Today, I would like to talk about how to parse these unstructured files into fix width, or tab deliminated files.

Let’s use the following report as an example. Click here to see the full width file from UnstructuredFile.Com .

H03/09/10                                             CREDIT CARD REJECT REPORT                                         PAGE      1
                                    ABBA1000          5000 PEACHTREE STREET, ATLANTA GA
                      SERIAL TRANSAHAION  LOC                             REQUESTED        COMPLETED
  CREDIT CARD         NUMBER DATE   TIME  CODE ACCOUNT          NR            TOTAL            TOTAL R SEC  DS  TRAN DESC
0 0000704017864322    3833   100309130135 HN   010100003440     25           200.00              .00 5 000  13  INSUFFICIENT FUND
0 1000302007395600    5621   100309123949 HE   000500000500     36            40.00            40.00 5 000  11  INSUFFICIENT FUND
0 0000764006000809    4716   100308211400 HA   000360000526     09            60.00            60.00 5 000  11  ACCOUNT NOT FOUND
0 1000282001389165    5520   100309104536 HA   000800009579     09            20.00            20.00 5 000  11  ACCOUNT NOT FOUND
0 2000801811178532    9054   100309110145 HA   000000346461     09           300.00              .00 5 000  13  INSUFFICIENT FUND
0 0000760102949444    5425   100309160151 HA   000000290219     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND
0 0000764007759155    9683   100308210045 HA   300000353459     09            30.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764004552579    3786   100309080825 HA   000005471825     09            20.00            20.00 5 000  11  INSUFFICIENT FUND
0 0000704014962525    3708   100309112904 HA   000000178713     25           380.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764007729026    2583   100309153515 HA   003000011079     09              .00              .00 2 070      INSUFFICIENT FUND
0 0000764007524799    8686   100309152452 HA   000010670789     09            20.00              .00 5 000  11  INSUFFICIENT FUND
0 3000760035366790    2263   100309132256 HA   000000184919     12            50.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282000968712    2395   100309195136 HA   300006250618     09           500.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282003811927    9445   100309070605 HA   000002912833     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 0000760009335499    6496   100309181313 HA   000099724489     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 1000282004196039    4515   100309091645 HA   380000027935     09           200.00              .00 5 000  11  INSUFFICIENT FUND
0 3333900023993421    8040   100309115609 HA   000000000000     00            20.00              .00 7 000  11  INSUFFICIENT FUND
0 2000801800001810    1567   100309183922 HA   009361603254     09            40.00            40.00 5 000  11  ACCOUNT NOT FOUND
0 0000764005927390    2055   100308211040 HA   300005846269     09           200.00           200.00 5 000  13  ACCOUNT NOT FOUND
0 0000764005927390    2056   100308211214 HA   300005846269     09            60.00            60.00 5 000  13  ACCOUNT NOT FOUND
0 0000764007031332    2285   100309145212 HA   300000002865     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND         

T                                                                            460.00           460.00                                 

H03/09/10                                             CREDIT CARD REJECT REPORT                                         PAGE      2
                                    ABBA1001          5001 PEACHTREE STREET, ATLANTA GA
                      SERIAL TRANSAHAION  LOC                             REQUESTED        COMPLETED
  CREDIT CARD         NUMBER DATE   TIME  CODE ACCOUNT          NR            TOTAL            TOTAL R SEC  DS  TRAN DESC
0 0000704017864322    3833   100309130135 HN   010100003440     25           200.00              .00 5 000  13  INSUFFICIENT FUND
0 1000302007395600    5621   100309123949 HE   000500000500     36            40.00            40.00 5 000  11  INSUFFICIENT FUND
0 0000764006000809    4716   100308211400 HA   000360000526     09            60.00            60.00 5 000  11  ACCOUNT NOT FOUND
0 1000282001389165    5520   100309104536 HA   000800009579     09            20.00            20.00 5 000  11  ACCOUNT NOT FOUND
0 2000801811178532    9054   100309110145 HA   000000346461     09           300.00              .00 5 000  13  INSUFFICIENT FUND
0 0000760102949444    5425   100309160151 HA   000000290219     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND
0 0000764007759155    9683   100308210045 HA   300000353459     09            30.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764004552579    3786   100309080825 HA   000005471825     09            20.00            20.00 5 000  11  INSUFFICIENT FUND
0 0000704014962525    3708   100309112904 HA   000000178713     25           380.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764007729026    2583   100309153515 HA   003000011079     09              .00              .00 2 070      INSUFFICIENT FUND
0 0000764007524799    8686   100309152452 HA   000010670789     09            20.00              .00 5 000  11  INSUFFICIENT FUND
0 3000760035366790    2263   100309132256 HA   000000184919     12            50.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282000968712    2395   100309195136 HA   300006250618     09           500.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282003811927    9445   100309070605 HA   000002912833     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 0000760009335499    6496   100309181313 HA   000099724489     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 1000282004196039    4515   100309091645 HA   380000027935     09           200.00              .00 5 000  11  INSUFFICIENT FUND
0 3333900023993421    8040   100309115609 HA   000000000000     00            20.00              .00 7 000  11  INSUFFICIENT FUND
0 2000801800001810    1567   100309183922 HA   009361603254     09            40.00            40.00 5 000  11  ACCOUNT NOT FOUND
0 0000764005927390    2055   100308211040 HA   300005846269     09           200.00           200.00 5 000  13  ACCOUNT NOT FOUND
0 0000764005927390    2056   100308211214 HA   300005846269     09            60.00            60.00 5 000  13  ACCOUNT NOT FOUND
0 0000764007031332    2285   100309145212 HA   300000002865     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND         

T                                                                            460.00           460.00                                 

H03/09/10                                             CREDIT CARD REJECT REPORT                                         PAGE      3
                                    ABBA1002          5002 PEACHTREE STREET, ATLANTA GA
                      SERIAL TRANSAHAION  LOC                             REQUESTED        COMPLETED
  CREDIT CARD         NUMBER DATE   TIME  CODE ACCOUNT          NR            TOTAL            TOTAL R SEC  DS  TRAN DESC
0 0000704017864322    3833   100309130135 HN   010100003440     25           200.00              .00 5 000  13  INSUFFICIENT FUND
0 1000302007395600    5621   100309123949 HE   000500000500     36            40.00            40.00 5 000  11  INSUFFICIENT FUND
0 0000764006000809    4716   100308211400 HA   000360000526     09            60.00            60.00 5 000  11  ACCOUNT NOT FOUND
0 1000282001389165    5520   100309104536 HA   000800009579     09            20.00            20.00 5 000  11  ACCOUNT NOT FOUND
0 2000801811178532    9054   100309110145 HA   000000346461     09           300.00              .00 5 000  13  INSUFFICIENT FUND
0 0000760102949444    5425   100309160151 HA   000000290219     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND
0 0000764007759155    9683   100308210045 HA   300000353459     09            30.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764004552579    3786   100309080825 HA   000005471825     09            20.00            20.00 5 000  11  INSUFFICIENT FUND
0 0000704014962525    3708   100309112904 HA   000000178713     25           380.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764007729026    2583   100309153515 HA   003000011079     09              .00              .00 2 070      INSUFFICIENT FUND
0 0000764007524799    8686   100309152452 HA   000010670789     09            20.00              .00 5 000  11  INSUFFICIENT FUND
0 3000760035366790    2263   100309132256 HA   000000184919     12            50.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282000968712    2395   100309195136 HA   300006250618     09           500.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282003811927    9445   100309070605 HA   000002912833     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 0000760009335499    6496   100309181313 HA   000099724489     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 1000282004196039    4515   100309091645 HA   380000027935     09           200.00              .00 5 000  11  INSUFFICIENT FUND
0 3333900023993421    8040   100309115609 HA   000000000000     00            20.00              .00 7 000  11  INSUFFICIENT FUND
0 2000801800001810    1567   100309183922 HA   009361603254     09            40.00            40.00 5 000  11  ACCOUNT NOT FOUND
0 0000764005927390    2055   100308211040 HA   300005846269     09           200.00           200.00 5 000  13  ACCOUNT NOT FOUND
0 0000764005927390    2056   100308211214 HA   300005846269     09            60.00            60.00 5 000  13  ACCOUNT NOT FOUND
0 0000764007031332    2285   100309145212 HA   300000002865     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND         

T                                                                            460.00           460.00

Most of the content of this file are fix width transactions, which is easy to be parsed. however, there are information in the report heading section, such as the date and the page number that need to be parsed as well.  Also, the file includes pages, which adds an extra layer of grouping.

In order to parse it, first we need to seperate the files into sections, each section starts with the first character of the header, the H, and ends with the last line before the next header that starts with H.  In this example, we have three sections.

Within each section, there is a header, which is the first 4 lines. The 3rd and 4th line are actually the column names for the transactions, but when we do parsing, we can ignore them, and make them part of the header.

Transactions starts with 0 as the first character, and it’s already fix width.

Footer starts with the next line after last transaction. Sometimes we have to parse fields from the footer section, and here we will ignore everything from the footer.

In order to simply the data structure, we will normalize the data and append the fields in the header and footer to the fields from transactions.

Next, we need to define a template on how to parse the file. A template will look like this.

Header Section
  Literal “H”
  8 Characters (REPORT_DATE column)
  Any characters
  Literal “PAGE”
  Any Spaces
  Any numbers (PAGE column)
  Any Spaces
  Line Break

  …

Item Section
  Literal “0”
  1 Space
  16 Numbers (CARD_NUM column)
  4 Spaces
  4 Numbers (SERIAL_NUM column)

  …

  Line Break

Footer Section
  Any Characters
  Line Break 
  Any Characters
  Line Break

With the template file, we can write the code to parse the file. Here’s a pseudo code on how to parse the file.

FOR EACH MATCHING SECTIONS

    FIND MATCHING HEADER
    PARSE THE FIELDS FROM HEADER

    FIND MACHING ITEMS
        FOR EACH MATHCING ITEMS

            PARSE THE FIELDS FROM ITEM

        END LOOP

    FIND MATCHING FOOTER
    PARSE THE FIELDS FROM FOOTER

END LOOP

Here’s a screen shot of an application that used to design template and parse data. The upper box on the left side has the unstructured file. Header has been colored golden, and transctions green. All field that needs to be imported are colored navy. The lower box on the left side has the parsed data. The box on the right side is the template.

Click here to see the full width image from UnstructuredFile.com

Software will be available from http://www.unstructuredfile.com in May 2010.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s