phppdflib is a php class library for the purpose of creating dynamic documents in the PDF format developed by Adobe.
With the design goals in mind, the decision was made to produce a two-stage engine.
The first stage is the collection of data for objects that will go into the resultant file. Most of the class methods are involved with this stage. Each step of this stage is generally invoved in validating the data, then storing the necessary information in a structured array for later use.
The second stage is the processing of the data objects to convert them into the PDF format.
This stage is initiated by the generate()
method but calls many other methods
to do the work.
Other than generate()
, these methods are intended only for internal use within
the library.
generate()
first preprocesses the objects in an attempt to combine as many as
possible into mstream
objects.
This process reduces the number of array objects that are converted to PDF objects, thus
reducing the size and complexity of the resultant PDF.
The generate()
function then creates some static PDF objects, specifically
the document catalog, root pagenode, and the resource dictionary.
While the location of these objects is static, the content of the root pagenode and resource
dictionary is dynamic.
generate()
then processes the objects in the structured library and converts
them into properly formatted PDF objects, building the data stream as it goes.
During this process, data is recorded on the size of each object necessary for the generation
of the PDF xref table.
At the end of the process, the xref table is generated and appended to the data stream.
A document trailer and end of file marker are then appended to the data stream and the
stream is return
ed to the calling process.
The two step process has advantages as well as disadvantages.
Since the addition of objects and conversion into the PDF format is done seperately, the user may create the PDF file in any order, except that parent objects must be created prior to their children. For example, a user may create all the pages in the PDF, then paint to them; or he may create each page and paint it as a seperate step; or any combination of the two. A user my even paint to pages out of order, for example, after all other data has been written, the script may then add a footnote to each page denoting the total number of pages. It is the job of the generation process to be sure the document hierarchy is reorganized such that it is valid PDF.
The obvious disadvantage is that memory usage is increased. During the later stages
of generation, the entire PDF is stored in memory twice (once as the structured array,
and again as the PDF stream itself). Additionally, the library provides no method for
freeing the memory after generation is complete. It is my thought that it is unlikely
that scripts will continue to do processing after generation is complete, but if it is
the case, the library instance should be unset()
to free the (possibly
significant) memory allocated.
Significant focus is placed on making the resultant PDF file as small as possible, even at the expense of addition processing or memory usage during generation. The rational is that most servers using the library will have significant CPU and memory resources, while many end users receiving the PDF files will have limited bandwidth. The only feature of this mentality which is tunable (outside of recoding the library) is the use of compression, which (because it implements a compression algorithm internally supported by php) does not seem to use significant amounts of memory or processor compared to the rest of the process.
I'm writing this prior to the code being written, to solidify my ideas. It's possible that the resultant system will be different than conceived here - I'll try to keep this updated.
The concept of the packer engine started with perl's Tk library, and the packer included. phppdflib's packer will run somewhat differently, since the main restraining factor is available page space and phppdflib can create new pages at will - two things the perl packer doesn't factor in to its reasoning.
The basic idea I have is that each page will store (in it's library array) an array of rectangles that indicate the unused space on the page. Initially, a page will consist of a single rectangle (bounded by the page margins - note that pages will have to remember the margins they were created with?) All painting functions will remove the space they use from this pool, thus keeping track of how much space is still available to use on the page. Special functions will allow a client script to paint an object "in the next available space", and the packer should create new pages as needed to place objects.
Inside the machine, the painting of an object will take the rectangle that it's
placed in and break it into sub-rectangles that identify remaining space. To illistrate,
a page initially consists of a single "field" (that's the term I'm going to coin, we'll
see if it sticks):
+----------+ | | | | | | | | | | | | +----------+When we place (for example) an image on the page, it allocates some space (a) and the remaining space is broken into new rectangles (b & c)
+----+-----+ | | a | | | | | b +-----+ | | | | | c | | | | +----+-----+The allotment of remaining space is not arbitrary, it will attempt to keep the largest vertical area possible (b) since that's how text normally flows. A special function will exist to "fill in" text in the remaining space - the idea being that a user can place all her/his images in the document, and then automagically have the packer flow the text around the images.
There are several unresolved issues I have:
Serious development of the packer will probably start after the template functions have started to solidify some. Watch this space.