Apr 04 2008
Some issues about web pages rendering
The idea of creating a browser-into-the-browser for Farfalla seems to be the best solution for accessibility: it will allow the rendering of webpages inside a sort of protected environment, where the pages themselves could be dinamically modified to meet the most basical accessibility requirements.
For example, parsing the DOM tree of a document we can find every common inappropriate usage of some HTML tags and correct it, making the page ready for further processing. We can also change the CSS properties for each tag, so we can evidentiate some parts of the documents we are browsing. There is a lot more things we could do working on the DOM tree through Javascript, but there is a big problem, too: for security reasons, it is only possible to access documents which reside on the same domain where our Javascript does.
A PHP wrapper was built to overcome this issue: it uses the function file_get_contents() pointing to an external URL to read the HTML included into any webpage. This means that getting contents from external URLs must be allowed by the webserver we are working on. Through this method we can receive the whole content of a web page and put it into a variable. If we simply try to print it on the screen, we receive a page with the whole text but without (almost) any images and, probably, without any css formatting. This is because getting the file content we actually copied the HTML inside one of the documents of our webserver, but the relative path of every link, and also the “src” of every image still searches inside a relative path.
Luckily, HTML has an instrument for this kind of problems: it is the <base> tag. It has to be put in the header of the file we are rendering, and that is very simple using PHP and some regexp replacement. This tag takes an url as an “href” attribute, making every relative link in the page use that url as a base for its path.
We also have to extract the path to the page we are currently browsing. This can be done using the parse_url() PHP function, cutting off the unneeded parts of the url (mainly the filename and its get options) via regexp replacement again.
