Chapter 7: Parsing STxT (*)
Parsing an STxT file is much easier than parsing files from other technologies. It may seem paradoxical, since it is actually a very powerful language, but at the same time, it is based on very simple principles.
I will explain my way of parsing a file. It may not be the best or most optimal method, but it is one way to do it. In fact, if you want to see the implementation I have made, it is available online:
This implementation has been done in Java, as it is the language I am most familiar with.
I hope that STxT is successful, and that other implementations appear very soon.
I will not go into all the details, but I would like to explain some points that require more attention.
If you do not intend to implement a parser, do not continue reading. The next chapter is much more interesting ;-)
Generic Process
Line-by-line Parsing
The parsing process can be done line by line, so we can say that, in general, we have:
while not end of file read line process line end while
During the process, it is advisable to have a list of the last nodes we have encountered according to the level, as this is crucial for correct processing.
Line Processing
The first step in line processing is the normalization of the line. A line is normalized when it is in compact (or semi-compact) form, so we need to check if it is, and if not, transform it. During normalization, comment lines are also removed.
Keep in mind that if the previous node was a text node, it will be part of that same node when it reaches a certain level. In other words, it will be continued text. It will also be part of it if it does not reach the level, but the line is completely blank, in which case it will be translated into text with a line break ([[chapter_06.html#index_9|See advanced tutorial]]).
Once we have compacted the line, processing continues independently, and all that remains is to obtain the level of the new line and distinguish between a few cases:
- Are we in the first node?
- Is the line text from a previous text node?
- Is a new node starting?
In each case, we update the state of our variables and proceed with the process.
Note: The most important thing here is to see that this is a process that can be done line by line, and the decisions to be made are relatively simple. This allows us to have a very efficient parser, which can also act as a grammar and node validator.
Validations
Validations are performed at various points in the parsing process:
- When creating a new node: When creating a new node, it is validated that its namespace can be deduced. Otherwise, it means that it could not be created in that position and would be incorrect.
- When closing a node: When we consider a node closed, it is validated. ** If it is a NODE type, it is validated that the number of children is correct. ** If it is not a NODE type, it is validated that it has the appropriate content depending on its type.
When is a node considered closed? This is an interesting point, as there are two circumstances that cause a node to be considered closed. One of them is when another node with an equal or lower level appears. The other is when the entire file has been processed and there are no more nodes to validate. At these points, the node is considered closed, and validations can begin.
Language Nodes
In the language description, we mentioned that data types have no limitations and are not tied to a specific language, so validations should only be checked using regular expressions or methods that ensure this.
We have the following node types:
- NODE
- TEXT
- URL
- NATURAL
- INTEGER
- RATIONAL
- NUMBER
- BINARY
- HEXADECIMAL
- BASE64
- BOOLEAN
For example, the regular expressions we could use to validate nodes are:
BINARY = ^(0|1|\s)+$ BOOLEAN = ^0|1$ HEXADECIMAL = ^([a-f0-9]|\s)+$ INTEGER = ^(\-|\+)?\d+$ NATURAL = ^\d+$ NUMBER = ^(\-|\+)?\d+\.\d+(e(\-|\+)?\d+)?$ RATIONAL = ^(\-|\+)?\d+\/\d+$
Grammars
Storage
Grammars are obtained from the Internet, but it is neither practical nor efficient to have to fetch definitions remotely every time. The most efficient strategy is to have a kind of grammar repository, on disk, and always fetch them from there. If not found, they would be fetched from the Internet, and that repository would be updated. It is also possible to set verification times or other strategies. The idea is that grammars do not change over time, or at least they should be backward compatible.
Initial Grammar
Keep in mind that it is not possible to create a parser without having the grammar beforehand. To parse a grammar, you need the definition of the base grammar already parsed. For this reason, there will be an initial grammar definition in the code itself.
Details to Consider
There are some details to consider in parsing:
- Case-insensitive: All nodes are considered CASE-INSENSITIVE, so the appropriate transformations must be made during the parsing process.
- Base64: With BASE64 text, line breaks must be allowed, and a standard parse of the obtained content must be performed.
- For reading lines, both UNIX and DOS formats must be considered. Therefore, both line breaks and line break + carriage return will be allowed. This is done to allow quick file edits from any environment, although the most appropriate standard is always UNIX (line break only).