Introductory Programming in Python: Lesson 20
Basic Parsing

[Prev: Regular Expressions] [Course Outline] [Next: Operating System Functionality]

What is Parsing?

The vast majority of your time as a bioinformaticist will be spent converting data from one format to another. You invariably want to massage your raw data into a format which you can plug into an analysis tool, or set of tools. Parsing is the name given to the reading and programmatic interpretation of flat file formats. There are two basic approaches to parsing files, namely the serial access approach, and the document object model approach.

General Comments of File Structure

The majority of file formats have a structure that involves a small bit of overall information, if any, followed by a collection of repeated structures. A FastA file for example, has no header information, but consists entirely of the repeated structure of title line followed by sequence lines. If we write some code to read the repeated structures, and put this in a loop, we have a parser. Multiple levels of nesting of structures can be handled similarly, by writing some code for the most deeply nested structure, and put that code inside the loop or code for the outer structure, and so on and so forth.

The Serial Access Approach

The serial approach involves reading the file in question from top to bottom, in sequence, acting on each structural element of the file format as it is encountered, then overwriting it with the next one.

Pros:

Files are usually accessed in a serial or sequential manner anyway, so this approach adds no complexity
Because we are only keeping one structural element in memory at a time, we can handle large files with ease.

Cons:

We have to read all the data in the file up to the structure of interest, even if it's of no importance to us.
Some file structures contain back references, i.e. structures that point to formerly encountered structures. If we have thrown these previously encountered structures away, then the structure we are dealing with at the moment does not have complete information. We must reprocess on a second pass to fill out this information.
The problem of forward referencing is solvable without a second pass, although the solution is complex. Whenever we encounter a forward reference we must record where the reference was made, and to what it is being made. Every structure read from file after that must be checked against the entire list of 'unfulfilled references', and if it matches one of them, appropriate data from the structure is substituted into the structure referencing it.

The Document Object Model (DOM) Approach

The document object model approach refers to mapping out the file format structure in memory using data structures that mirror the structure in the file. The entire file is read in to memory, then closed. Henceforth everything is dealt with in memory. So an outline of the DOM approach is

Look at the file
Identify it's structure
Create a data structure in memory that mirrors the file's structure
Read the whole file into that data structure
Close the file

Pros

We can access the data in any order and far faster then reading to it serially, once the entire file is loaded.
We gain complete understanding of the file format's structure
Because all the structures are available at once, and accessible directly without any tedious serial access, back referencing and forward referencing are as simple as looking up the relevant data.