Introductory Programming in Python: Lesson 20
Basic Parsing
What is Parsing?
The vast majority of your time as a bioinformaticist will be spent
converting data from one format to another. You invariably want to
massage your raw data into a format which you can plug into an analysis
tool, or set of tools. Parsing is the name given to the reading and
programmatic interpretation of flat file formats. There are two basic
approaches to parsing files, namely the serial access approach, and the
document object model approach.
General Comments of File Structure
The majority of file formats have a structure that involves a small
bit of overall information, if any, followed by a collection of
repeated structures. A FastA file for example, has no header
information, but consists entirely of the repeated structure of title
line followed by sequence lines. If we write some code to read the
repeated structures, and put this in a loop, we have a parser. Multiple
levels of nesting of structures can be handled similarly, by writing
some code for the most deeply nested structure, and put that code
inside the loop or code for the outer structure, and so on and so
forth.
The Serial Access Approach
The serial approach involves reading the file in question from top
to bottom, in sequence, acting on each structural element of the file
format as it is encountered, then overwriting it with the next one.
Pros:
- Files are usually accessed in a serial or sequential manner
anyway, so this approach adds no complexity
- Because we are only keeping one structural element in memory at
a time, we can handle large files with ease.
Cons:
- We have to read all the data in the file up to the structure of
interest, even if it's of no importance to us.
- Some file structures contain back references, i.e. structures
that point to formerly encountered structures. If we have thrown
these previously encountered structures away, then the structure we
are dealing with at the moment does not have complete information.
We must reprocess on a second pass to fill out this
information.
- The problem of forward referencing is solvable without a second
pass, although the solution is complex. Whenever we encounter a
forward reference we must record where the reference was made, and
to what it is being made. Every structure read from file after that
must be checked against the entire list of 'unfulfilled
references', and if it matches one of them, appropriate data from
the structure is substituted into the structure referencing
it.
The Document Object Model (DOM) Approach
The document object model approach refers to mapping out the file
format structure in memory using data structures that mirror the
structure in the file. The entire file is read in to memory, then
closed. Henceforth everything is dealt with in memory. So an outline of
the DOM approach is
- Look at the file
- Identify it's structure
- Create a data structure in memory that mirrors the file's
structure
- Read the whole file into that data structure
- Close the file
Pros
- We can access the data in any order and far faster then reading
to it serially, once the entire file is loaded.
- We gain complete understanding of the file format's
structure
- Because all the structures are available at once, and
accessible directly without any tedious serial access, back
referencing and forward referencing are as simple as looking up the
relevant data.
Cons:
- Large files or data sets do not conveniently fit into
memory
- We have to read the whole file
Exercises