Introduction
This particular HTML browser project has been discontinued. Major
development has been in 1996, with the last minor bugfix version from
2002-03-18, which is available for download.
Instead there is a new ongoing project:
Synx.
A simple HTML Browser realized as a Java Applet
Scanning and Parsing
To evaluate a complex language, the technique of scanners and parsers
can be used. Before interpreting a text written in a (formal or informal)
language, it is evident that the text is broken into separate tokens. This
lexical process is done by the scanner which must know what tokens to recognize.
Then the resulting TokenSequence is checked and the tokens are semantically
grouped. This grammatical process is done by the parser which must
know rules for the grammar of the language. Having all tokens scanned and
semantically grouped according to grammatical rules, the interpretation
of the statements (or sentences) can start depending upon the specific
task.
The Tokens
Tokens consist of
-
a grammatical type which they belong to, like TAG
-
a symbol that they match, like Assurance
-
a token string which they raise when they occur, like BODY
Tokens, if declared literally in a file are specified as:
Two kinds of tokens exist:
-
definite tokens, explicitly naming a String that they match to,
like Assurance
-
indefinite or regular tokens, matching to all Strings that
fulfil a certain Regular Expression, like [A-Z][0-9]*
The Scanner
The scanner is a base class that consecutively scans in tokens from an
input Reader or scanner. The token Declarations must be provided, either
directly as an array or indirectly with the token declaration syntax in
a (File-)Reader.
This implementation scans definite tokens explicitly declared
in the lexical token declaration file (like Assurance) and regular
tokens specified as a Regular Expression (like [A-Z][0-9]*).
Reading from an input Reader, the scanner first tries to match the
longest definite token possible. If no token matches or no alternativeToken()
can be found for the current symbolPart, then all Regular Expression specifications
are run through a non deterministic Automata matching Regular Expressions
if possible.
The Parser
A parser is capable of parsing a TokenSequence and returning Symbols in
that the tokens result. For HTML the parser is rather simple in a way that
it simply concatenates the tokens of type WORD and CIRCUM. Additionally
he recursively starts another HTML parser when a token of type TAG is found
and finishes of, if the matching ETAG (if necessary) is found.
The Interpreter - Browser
For this simple HTML Browser, the Interpreter only prints out the set of
symbol words parsed and reacts to the enclosing Tags. For every Tag known
by this Browser, another graphical style is used before the text is displayed.
I know that this is not all a Browser does when displaying but for a single
demonstration of a parser's possibilities it's enough, I think.
Of course, this HTML Browser has a regard on line breaks and automatically
fits the text to the next line if necessary. The Browser still does not
follow links (which actually is the most important facility a Browser should
offer, by the way).
The token declaration file
For the HTML Browser, the lexical token declaration file is:
TAG|<HTML>|HTML|
ETAG|</HTML>|HTML|
TAG|<HEAD>|HEAD|
ETAG|</HEAD>|HEAD|
TAG|<BODY>|BODY|
ETAG|</BODY>|BODY|
TAG|<H1>|H1|
ETAG|</H1>|H1|
TAG|<H2>|H2|
ETAG|</H2>|H2|
TAG|<H3>|H3|
ETAG|</H3>|H3|
TAG|<B>|B|
ETAG|</B>|B|
TAG|<I>|I|
ETAG|</I>|I|
STAG|<BR>|BR|
STAG|<P>|P|
STAG|</P>|/P|
WORD|([a-zA-Z]+)|=|
SCHAR|([_.,;:!])|=|
CIRCUM|ä|ä|
CIRCUM|ö|ö|
CIRCUM|ü|ü|
CIRCUM|ß|ß|
CIRCUM|é|é|
SKIP|([ ]+)|=|
SKIP|
| |
For this HTML Browser example, the types are
-
TAG - represents a beginning tag like <A>.
-
ETAG - represents an ending tag like </A>.
-
STAG - represents single tags that don't need a matching end tag like <BR>.
-
WORD - regular expression that represents any natural language word
(with normal chars).
-
SCHAR - represents a special character like full stops and exclamation
marks.
-
CIRCUM - represents the HTML Circumscription chars for all unicode characters
like é for é.
-
SKIP - a type internal to the scanner that defines that all these tokens
are regarded useless and be skipped.
Java HTML Browser
requires Java 2 Platform or Java Collections Framework. Application
will display the parsed HTML page in a separate window.
Download
If you want to test the simple HTML browser or scanner & parser:
-
download Simple Browser, or
- download the sources which are part of the examples in the
Orbital library 1.0 documentation
-
Note that our simple browser project is discontinued in favor
of Synx, so the simple browser is no longer contained in
Orbital library release 1.1, but only in 1.0.
The Future
See introduction.
|