Introductory Programming in Python: Lesson 20
Basic Parsing

[Prev: Regular Expressions] [Course Outline] [Next: Operating System Functionality]

What is Parsing?

The vast majority of your time as a bioinformaticist will be spent converting data from one format to another. You invariably want to massage your raw data into a format which you can plug into an analysis tool, or set of tools. Parsing is the name given to the reading and programmatic interpretation of flat file formats. There are two basic approaches to parsing files, namely the serial access approach, and the document object model approach.

General Comments of File Structure

The majority of file formats have a structure that involves a small bit of overall information, if any, followed by a collection of repeated structures. A FastA file for example, has no header information, but consists entirely of the repeated structure of title line followed by sequence lines. If we write some code to read the repeated structures, and put this in a loop, we have a parser. Multiple levels of nesting of structures can be handled similarly, by writing some code for the most deeply nested structure, and put that code inside the loop or code for the outer structure, and so on and so forth.

The Serial Access Approach

The serial approach involves reading the file in question from top to bottom, in sequence, acting on each structural element of the file format as it is encountered, then overwriting it with the next one.

Pros:

Cons:

The Document Object Model (DOM) Approach

The document object model approach refers to mapping out the file format structure in memory using data structures that mirror the structure in the file. The entire file is read in to memory, then closed. Henceforth everything is dealt with in memory. So an outline of the DOM approach is

  1. Look at the file
  2. Identify it's structure
  3. Create a data structure in memory that mirrors the file's structure
  4. Read the whole file into that data structure
  5. Close the file

Pros

Cons:

Exercises

[Prev: Regular Expressions] [Course Outline] [Next: Operating System Functionality]