This FORTH is non standard but should be familiar.
This FORTH is implemented mainly in AARCH64 assembly language (main.asm), and runs hosted under OSX on the Apple M1.
-
A few C library functions are used, the program is linked against the C library, it is possible to call C functions (from assembler), which will be useful, in order to talk to the operating system.
-
The implementation is specific to features of the ARM V8 64 bit processor, such as wordsizes.
-
FORTH primitives are implemented as assembly language functions, the compiler converts high level FORTH words into list of tokens for the token interpreter(s) to execute.
This is not a standard implementation, I am aiming to provide a very small set of practical FORTH like words that improve comfort, safety and convenience for the user of the language (such as me.)
-
Throwing in 'everything and the kitchen sink in case it may some day be useful' is counter to FORTH principles.
-
I expect to extend the ASM file as I write my own apps, I plan to test and script my apps in FORTH and assembly language, using C to connect with the OS.
-
There is no assembler defined in FORTH, the machines assembler is used.
Untyped - Storage has no type other than the size of storage cells used.
The syntax follows FORTH closely, including reverse polish notation, composition of new words (functions) by word concatenation, very similar control flow constructs etc.
Inner interpreter
The compiler compiles words to tokens, which are then executed by a simpler interpreter.
The inner interpreter:-
-
is not running at the command line in this implementation.
-
only runs when a high level word is executing, otherwise the command line interpreter and compiler are just running machine code, the interpeter and compiler are written in ASM, they are not written in FORTH.
Each high level word invokes a version of the inner interpreter to run itself, multiple different versions of the interpreter(s) exist, and they can be selected after the word has been compiled or while it is being compiled.
I am using a certain ammount of brute force and ignorance in the current design of this program, which may not scale, but which works for the moment, in the spirit of getting up and running.
The forth.forth file is loaded when the application starts.
This file should contain any high level FORTH words you want to add to your program.
It is presently set up to clear the screen, and display the words, before entering the interactive session.
Values are preferred to VARIABLES, as they are much safer to use, and simpler to access.
-
A value is created with the VALUE word.
-
A value has no type, it could be used to represent an int, float, char etc.
-
A value has only a size in bits, the base VALUE size is 64 bits.
199 VALUE myvalue
A value may be read (fetched to the stack) by calling itself, and then printed with dot.
myvalue .
A value is set with the TO word
200 TO myvalue
The default cell size for values is 8 (64 bits) each element can store a large integer value or a double float, they are untyped other than bit length.
An array of values is created with the VALUES plural word.
128 VALUES myvalues
These VALUES are allotted from pools of memory that are zero filled, so all values will initially read as 0.
To change every value.
1 FILLVALUES myvalues
Every entry in myvalues becomes 1.
To change one entry use TO while providing the new value and index
1000 7 TO myvalues
7 myvalues .
Value entry 7 in myvalues is set to 1000.
You can increment values efficiently using +TO
+TO is just like TO except it adds its argument to the VALUE.
0 VALUE counter
counter 1+ TO counter
// is the same as
1 +TO counter
// which is shorter and faster
Some special values are built in.
These are not FORTH standard words.
I believe that LOCALS are key to making FORTH words easier and safer.
LOCALS provide each word with only eight (64bit) local variables.
-
LOCALS can be treated as a special array of values of length 8 (0..7) that provide some small local memory storage for each word invocation.
-
LOCALS backing memory is implemented as a stack, allowing about 250 levels of depth.
On entry to a word LOCALS are erased, all values will be initally read as zero.
- LOCALS cease to exist and are later reused when a word ends.
Storage for WLOCALS
-
WLOCALS use the same memory as LOCALS providing word sized access (32bits) to 16 (0..15) Values (rather than 8.)
-
If it is more convenient to have 16 smaller values use WLOCALS instead of LOCALS
-
These are both just views over the same 64 bytes of memory in the local memory stack.
To recap locals are auto created between : and ;
e.g.
: t1
127 FILLVALUES WLOCALS
// filled the 16 values with 127
15 WLOCALS
14 WLOCALS
+ .
;
Should return 254.
Every high level word, normally gets its own fresh set of LOCALS when it starts.
They are not normally shareable between words.
After t1 runs; if you type 14 WLOCALS . it will be zero, the command line level has its own set of LOCALS as well.
Locals are fine to use, they are not slower than juggling the stack, and they are easier to think about.
You may have a recursive word that you do not want to eat into the LOCALS stack.
- You can declare that a word is FLAT like this.
FLAT word.
: FIB ( n -- n1 ) [ FLAT FIB ] DUP 1> IF 1- DUP 1- FIB SWAP FIB + THEN ;
This makes FIB a very, very, tiny fraction faster, since the allocation of LOCALS are not slow.
A FLAT word has no locals of its own - it sees the locals of its parent word, the parent being the word that called it, or the command line.
-
Using the standard local access can be cumbersome, the name LOCALS does not mean much.
-
FLAT words can be used to create new words for accessing the parents locals, in the simplest case this just lets you give local variables with some meaningful names.
// allocate a local to our speed.
: set-speed ( n -- ) [ FLAT set-speed ]
0 TO LOCALS ;
: speed ( -- n ) [ FLAT speed ]
0 LOCALS ;
: test
10 set-speed
speed . ;
set-speed and speed are working on the LOCALS array shared with test.
Accessor words could do a lot more, like checking that the given speed is valid etc.
Without FLAT, set-speed and speed would each read their own LOCALS a level above test and test would not work.
You can simply read LOCALS also using some predefined accessors.
These are named a .. h
Set them with n a! .. h! e.g.
10 a! a .
Prints 10.
There is a handy accessor for just counting. Provided only for a, b, c and d.
a++
The a is set to zero when the word starts, a++ adds one.
LOCALS whatever we name them use the same storage meaning some of the names overlap with each other.
The names and addresses of the locals are listed in the table.
LOCALS | WLOCALS | Accessor 64 bits | Accessor 32 bits |
---|---|---|---|
0 | 0 | a a! a++ | |
1 | |||
1 | 2 | b b! b++ | |
3 | |||
2 | 4 | c c! c++ | |
5 | |||
3 | 6 | d d! d++ | |
7 | |||
4 | 8 | e e! e++ | |
9 | |||
5 | 10 | f f! | |
11 | |||
6 | 12 | g g! | |
13 | |||
7 | 14 | h h! | x x! |
15 | y y! |
e.g. h is 7 LOCALS
h overlaps with 14 WLOCALS and 15 WLOCALS
h is also accessible as two 32bit words called x and y.
Standard FORTH tends to use >R and R> for 'extra space' to keep values. Since R is also used for control flow, that seems like a dangerous thing to do, using locals is the normal approach in this FORTH implementation.
Just use e! and e instead of >R and R> in common cases.
You can feed up to 8 (64bit) parameters from the stack into the words LOCALS with the PARAMS word.
: sq 1 PARAMS a a * ;
1 PARAMS
loads 1 argument from the stack into a.
If there are not enough parameters on the stack this will cause an error.
The LOCALS are available in several ways
For example they can be accessed with a .. h
For a word that takes lots of parameters this can be helpful.
The example above really did not need to do this :)
e.g.
: sq DUP * ;
is much nicer.
Variables are names that point to an address, they return an address not a value.
This is why they are unsafe, they rely on using words that work on raw memory addresses which is a potentially hazardous thing to do.
\ create a variable
10 VARIABLE fred
\ add to fred
20 fred +!
\ fred returns its address
fred .
4307234272
\ fetch the value from fred
fred @ .
30
\ TO is a safer way to set the value of fred.
7 TO fred
\ reading fred is dangerous
fred @ .
7
\ set the memory at fred to a value, not safe.
9 fred !
Ok
SEE fred
SEE WORD :4307234240 fred
0 :4307234272 ^VALUE [ 9 ]
8 :4305758228 VARIABLE
\ classic code to create a text buffer
0 VARIABLE buffer 256 ALLOT
\ try this instead
' my text ' STRING myText
256 STRING myStaticString
Where possible use VALUES instead of variables to avoid the direct memory read and write words, ! and @.
None the less VARIABLE does exist, it is still useful to get the address of some memory sometimes.
These are processor register (global volatile) variables named §c .. §f
The prefix symbol is found under escape and is not shifted, at least on UK keyboards.
These are volatile because they are stored in the floating point 64bit registers D8 on.
D0-D7 are freely trashed by C functions and also used by FORTH words so they are not available for general use.
The 24 remaining registers were floating around often unused, which seems like an opportunity.
Set a volatile variable with the name followed by store (!)
0 §c!
Read them just with their name.
§d
\ clear the volatiles to zero
§clear
These are the floating point registers but you can store any 64bit values in them, including any floating point values.
They are a useful way to feed values into a complex primitive function, perhaps the inner loop in something doing floating point or vector operations.
The § prefix makes them stand out when searching code.
Using these is NOT faster than using LOCALS or standard VALUES with TO.
For §c .. §f there are ++ increment add +! and floating add +! suffixes, that increment and add values to the volatiles respectively.
I will be using these words to draw some graphics later, I want to use floating point.
The available volatiles are c, d, e, f, x, y and z. If you need more of the 24 available registers in your code, uncomment them in the asm file.
A word can refer to itself in its own definition (recursion) like FIB did above.
Also a word has pointers to itself available.
There are two special variables that allow a word to see it is own address.
These are
-
CODE^ which points to the current words token code.
-
SELF^ which points to the current words dictionary header.
In a FLAT word, these usefully both point at the parent words CODE and HEADER.
An unusal way to make a word repeat itself is to do this.
CODE^ 2 - IP!
Using a flat word we can use this information to create a new control flow word.
: TAIL [ FLAT TAIL ] CODE^ 2 - IP! ;
-
This sets the instruction pointer to just above the parent words code.
-
The interpreters next step will be to start running the parent word again from the top.
SELF^ may be more useful, as a word can use it to look inside its own dictionary header.
A word can print its own name for example with
SELF^ >NAME $.
Again a new flat word could be created to print its parents name
: .name [ FLAT .name ] SELF^ >NAME $. ;
These two values (SELF^ and CODE^ ) are stored on the LOCALS stack, below the locals.
Strings are intentionally non standard.
Strings are zero-terminated. To emphasize the major differences from standard FORTH, string literals here use single quotes not normal FORTH double quotes.
A string is created with an initial text value like this.
' This is my initial value ' STRING myString
A string returns the address of its data.
myString $.
Given the address $. will print the string.
$'' STRING myEmptyString
Is how to create an initially empty string.
-
A string should not normally be mutated, each unique string exists only once in the string pool, changing one would impact all other usages everywhere in a program.
-
Strings can be compared with
$= and $ == which check if they are same and $COMPARE wich checks if one is equal, greater, or less than the other. -
The STRING word creates a new word that points to the string pool storage and gives it a name.
Just as you can create a STRING you can also create a number of strings.
10 STRINGS myStrings
Will create 10 strings.
To set the value of a string you use TO, e.g.
' string zero ' 0 TO myStrings
Ok
0 myStrings $.
string zero
0 myStrings returns the address of the 0th string.
$. is a word that prints a string.
The storage for the strings text normally lives in the string pool a STRING word just points a name at it.
Static Strings
A static string is a string that stores its own data outside of the shared strings pool.
The idea is to define these using the STRING creating word.
These are created with an additional length in characters (bytes.)
// define the UTF8 unicode monster
8 STRING _mstr 0xF0 , 0x9F , 0x91 , 0xBE , 0 ,
Creates a string that points to 8 bytes of its own storage.
These can be loaded with data using the comma operator
In this case the string is loaded with the 4 byte UF8 code for the monster symbol.
If you print it with $. you will see 👾
The quote comma operator can append text like this
12 STRING ' hello world',
// the clear line terminal escape sequence
8 STRING cln 27 , ' [2K',
// 27 is the escape code , appends.
Often a larger string needs to be built up from smaller parts
String literals can be appended using $+
Any string literal ending in '+ is auto appended to A$.
Given the address of a string,
These are not strings.
These are memory blocks outside of the strings literal pool.
A$ is storage for appending, clear A$ with CLR.A$ and then append.
B$ is storage used by the internal string functions
C$ is some free storage if you want to nake a copy of A$.
To use A$ in a regular STRING you need to intern its contents into the string pool.
e.g.
CLR.A$
' This is some text '+ ' so is this '+ A$ $intern
STRING myText
CLR.A$
' test '+ ' this '+ A$ $intern TO myText
-
As this is an interpreter it is almost always slower to use two words when one word will do.
-
It is also faster in general if each word does more, there can be as much overhead from the interpreter calling each word as there is in the word itself.
The compiler also does not optimize, so it is often up to the programmer to choose to use a faster optimal word not up to the compiler to invent them on the fly.
A good example is that 1 +
is slower than 1+
so if your word does 1 +
millions of times, this will have a performance impact.
For this reason FORTH interpreters often come with dozens of little optimized words and this is no exception. DRUP is faster than DUP DROP, a number of common word sequence have their own name.
Primitive words are written in machine code and are faster than high level words written in FORTH.
An approach in this implementation is to provide a few words for defining some of the optimized words, so you can define the words your specific program actually benefits from.
Shifting left and right
You can define words that perform left and right shifts for faster multiplication and division.
3 SHIFTSL 8*
Defines the new word 8* that SHIFTS the top of stack left 3 times, effectively multiplying by 8. SHIFTSR is the opposite word that does division.
1 ADDS 1+
Defines a fast little word for adding 1 to the top of the stack, the opposite word is SUBS.
' this is the age of the train' 23 5 $slice $.
Prints 'train'
Slicing uses a slice buffer, and the string pool for storage.
To save the result from a slice send it somewhere, such as to a STRING e.g.
$'' STRING vehicle
' this is the age of the train' 23 5 $slice TO vehicle
The B$ and A$ storage is used while appending.
When used in the interpreter, only the final result is placed in the string pool, the literals being available from the interpreted text.
When used in the compiler any literal text parts have to be stored in the string pool (since these also need somewhere to live.)
FORTH has a parameter (or data) stack; used for evaluating expressions, and passing arguments to words. This has a rich set of words that operate on it, such as SWAP, ROT, OVER, PICK etc.
There is also a return stack that is used in this implementation for some of the control flow constructs, for various loops, exposed by the R> >R words.
There is a hardly visible stack used for the local variables made available to each word.
We also have user stacks, these are extremely simple, and only allow values to be pushed or popped. There are none of the features of the main parameter stack.
Like most stacks, the last thing added is the first thing removed LIFO (last in first out)
To declare a user stack just use the STACK creation word.
\\ create a stack of file handles, of depth 18.
18 STACK file_handles
\\ add some values
0 TO file_handles
1 TO file_handles
2 TO file_handles
\\ remove and print the values
file_handles .
2
-
To take a value from a STACK just use its name, the value on the top of the stack is read.
-
To push a value use the TO word, with just the value to push.
-
The storage of a stack is initially set to all zeros.
-
A STACK resembles a VALUES array object, that maintains its own reference to the last item pushed.
Floating point words begin with f e.g. f.
Numbers containing a decimal point . are assumed to be floating point numbers.
The same parameter stack is used, which may hold 64 bit (double) floats or 64 bit (quad) integers.
A small set of floating point operations are implemented, typical comparison and math operations that the CPU directly supports are found starting with f, such as f+ and f.
e.g.
22.0 7.0 f/ f.
Most of the LOOPing words, only work inside a compiled word definition between : and ;
A loop that works anywhere is the very simple n TIMESDO word loop.
: DOT-DASH CHAR . EMIT CHAR - EMIT ;
\\ TIMESDO works in the interpreter
32 TIMESDO DOT-DASH
.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
Ok
n TIMESDO - executes the word that follows it, n times.
It is simpler than other LOOPS and less powerful, it is also faster at doing the simple thing it does.
It also works in a compiled word e.g.
\\ Also works in a compiled word.
: DASHED-LINE CR 32 TIMESDO DOT-DASH CR ;
These loops work only inside of compiled words, not at the command line.
DO LOOP counts between Finish and Start
A LOOP can be nested and has an index value accessed by depth called I, J, K.
: RECT 255 0 DO I 16 MOD 0= IF CR THEN .' *' LOOP ;
: TABLE CR 11 1 DO 11 1 DO I J * . SPACE LOOP CR LOOP ;
- I, J, K work within a word, but NOT across words.
- LEAVE is supported ONCE in a word inside of loops.
- You can not have multiple LEAVES.
- WHILE is supported only ONCE.
- WIERD mash-up combinations of the three different LOOPS do not work.
- They do in some FORTHS which is interesting.
Recusion is always an option for looping.
Just call yourself in a word, a word knows its own name.
Recursion uses space on the machine code stack and (typically) on the locals stack.
Iteration
Any repeating word can also be expressed using BEGIN .. EXIT .. AGAIN
The I/O is presently set up for the UNIX terminal.
The interpreter accepts lines from the terminal, with buffering, the input is often STDIN, although the input can sometimes be a file, as it is at startup.
The various printing words, . $. f. .' hey'
etc print to STDOUT, and they buffer for terminal efficiency.
-
KEY - reads a user key press from STDIN
-
EMIT - writes the char to STDOUT, without delay.
-
KEY really needs NOECHO to be set
- After using NOECHO use RETERM to return the terminal to its normal state.
-
KEY? returns true if a key is pending, it can be used in a LOOP like this.
-
FLUSH - flushes output.
: .keys
NOECHO
BEGIN
KEY? IF
KEY DUP DUP EMIT CHAR =
EMIT . 32 EMIT FLUSH
81 = IF RETERM EXIT THEN
THEN
100 MS
AGAIN
;
This will display key pressed and its multiple ASCII value(s).
Terminals accept a wide variety of commands.
FORTH typically only implements PAGE
to clear the screen and AT-XY
to place the cursor.
The input normally just uses UNIX getline.
You direct FORTH console to load from a file with FLOAD filename (from the ok prompt only)
You can accept a line of text from a user with ACCEPT e.g.
\\ Empty String
$'' STRING myInput
Ok
ACCEPT TO myInput
Test this then <- Typed in
Ok
myInput $.
Test this then
Exciting interactions become possible
$'' STRING yourName
: ask-name? .' What is your name ? '
CR ACCEPT TO yourName
CR .' Hello :' yourName $. ;
ask-name?
What is your name ?
Alban
Hello :Alban
-
Writing the interpreter and compiler in simple assembly language provides some benefits, such as testing the token compiled code. The interpeter is not made out of the same token compiled code being tested.
-
The interpreter in assembler, also means it is not as open to high level FORTH extensibility as it would be if it was written in FORTH.
-
The implementation misses some of the self-extending awesome powers of standard FORTH.
- The various compile time words are frozen forever (until you edit them) in the assembly language file.
Safety is a goal, which may seem strange given how unsafe FORTH is.
I want not to crash often, which is a high ambition.
FORTH functions exist to peek and poke values into addresses, which is dangerous.
To add safety the primitive words check for valid word types and bounds, when that is feasible, to prevent memory being stomped on by accident.
This does not really slow the interpreter down much, because interpreting high level code is relatively slow, and the primitives are comparatively as fast as lightning, so adding a few extra instructions to primitives to check if there is room in a string, before blatting it, and crashing, is just worth it.
I think the crashiness of FORTH is one of the things that put people off it, and made them prefer BASIC back in the day.
I really like the FORTHS that bootstrap from a dozen compiled words, and then implement everything in FORTH code, they are elegant and concise.
I loved the FIG forth that arrived as printout, that you could type into a machine code monitor in HEX and then debug the I/O for your 8 bit system.
The meta compiled self-hosting FORTHS are amazing.
I like and enjoy most of the different FORTHS out there, commercial and non commercial.
It is nice when a language is straightforward enough that anyone can write a simple version of it, and most fun of all, you can write a FORTH (and its interpreter) interactively and iteratively one feature at a time, so you always have something that works, and builds on itself over time.
This implementation is using a simple token interpreter that is mostly written in assembly language.
I have paid some attention to the performance of the inner loop, it is easy to test as you can try out different versions and time the results.
I chose to use 16bit tokens to represent words, rather than 64 bit addresses, the addresses would probably be faster, but that would be a different implementation, as lots of words are tuned for the token memory layout. 64 bit addresses felt.. wastefull ..
Forth words typically use dozens of tokens, very large words may use a few hundred.
It is a simple interpreter but FORTH is also a simple language.
There are commercial FORTH compilers that generate code that is closer in speed to optimized machine code.
There was a brief time on 8 bit systems when FORTH could claim to be faster than the dumb non optimizing C compilers available at that time, that is not the case now.
It is also not likely that simple assembler code is faster than C, although it should be smaller.
C functions can be used in FORTH primitives.
- The assembler (asm) code can call into C library code.
- e.g. add new C functions to addons.c and call them.
- Using only system calls would be limiting even in the terminal.
Writing C code is not always simpler than writing asm, C code is often faster, that is just a fact, the modern C compiler is very good.
Balanced against that, optimized C makes heavy use of all of the machines registers, so we have to preserve lots of registers that FORTH uses whenever we call into C, these are wide 64bit registers, stacking and unstacking them uses up memory bandwidth.
So as a rule we want to use C code when we need to, and when the function does significant useful work, and we want to stay in the FORTH world most of the time to avoid that call overhead.
If code is sufficiently simple, or at odds with the C compilers model of the machine, it can still be faster to write it in assembler.
Finally it is faster in terms of getting things done, to sometimes call C, and then later in the project convert that back to asm or FORTH.
The irony when writing a FORTH intepreter is that your FORTH code is relatively slow, so you may not end up writing very much FORTH.
When a few millisconds does not matter, and when words are rarely used, or are specific to a particular word and not widely shared, FORTH is fine.
It is a bad idea to write FORTH words that are frequently and widely used in a FORTH interpreter as the machine does decelerate tenfold when running them.
I view the FORTH as almost the script that controls the asm and C code.
The high level logic of an App should be written in FORTH, some of its key functions if they seem slow should be written in asm and added as primitive words.
As many of the base primitive words as possible should be written in asm.
The faster the base primitives are, the more FORTH you can reasonably write using them.
Since this is an interpreter, The less FORTH used when implementing FORTH the better the implementation will be at running FORTH.
This is not the case if your FORTH is an optimizing compiler, in that case the more of FORTH is written in FORTH, the better.
Honestly most implementations of FORTH I have seen, are not written in FORTH.
FORTH (like BASIC and LISP) is an interactive experience.
This interactivity is what makes it fun (in my experience.)
I started this in November 2021, with a reasonable recollection of how FORTH works, no experience with AARCH64 (although I do like ARM32) and a blank screen in Visual Code.
I have not based this on any particular FORTH design template, so it may have some original ideas, for good or ill, although it is essentially the same.
I wanted to get some practice in with the ARM64 so I wrote this in assembler, FORTH is allegedly written in FORTH normally, but hardly ever in practice.
First thing I did was get the outer interpreter to work, that is the interpreter that recognizes the difference between a word and a number, pushes a number to the stack and calls a word.
This gave me an interactive experience straight away, so I could test each new word I added.
I then added a bunch of standard FORTH words, and tested them.
The outer interpreter in FORTH is very simple, all it does is read words from the input, where a word is just something followed by a space, it then needs to find the word in a dictionary, check if it is a number or print an error.
The FORTH dictionary is a collection of WORDS, usually several lists of WORDS, to keep things simple I created a dictionary that was really just a fixed size array of word headers.
I decided initially to store everything in these headers, I created dictionary words that were 128 bytes in length, to give me room for (very) small compiled words. Later I added TOKENS space for words to store tokens, and the ALLOTMENT for data.
I decided to use tokens to look up words in the dictionary, a token is essentially an index into the dictionary.
The compiler is just like the interpreter but instead of recognizing and running words, it recognizes words, and stores the token for each word in the compiled word.
Right away this allows the compilation of straightline words like square
: sq DUP * ;
In order to see what the compiler was doing, I implemented SEE.
SEE sq
SEE WORD :4375017216 sq
0 :4370506944 ^TOKENS
8 :4370320908 PRIM RUN
16 : 0 ARGUMENT 2
24 : 0 PRIM COMP
32 : 0 Extra DATA 1
40 : 0 Extra DATA 2
48 :sq NAME
TOKEN COMPILED FAST
4370506944 : [ 1373] DUP
4370506946 : [ 7732] *
4370506948 : END OF LIST
In this implementation a compiled word is just a list of tokens.
The PRIM RUN for a compiled word, is the address of an interpreter that knows how to handle the list of tokens.
1373 happens to be the TOKEN for DUP at that moment.
` DUP NTH .
1373
Ok
The word NTH looks up a TOKEN given a words ADDRESS, I use backtick rather than ' because I use ' for text strings.
To begin with I compiled TOKENS right into the dictionary slot, later I allocated storage for these, and shrunk the dictionary elements to just be the current header and data elements, 64 bytes.
As SEE shows; a word in the dictionary has a PRIM RUN funtion, that is the pointer to the function that executes at run time.
This is what the outer interpreter calls, and what the compiler tokenises into high level words.
Every word also has a PRIM COMP pointer, these are primitive functions that are called by the compiler as it is compiling that word into a new definition.
The FORTH compiler is very simple and small and can be built up over time; because it calls lots of little helper words.
Although this implementation only uses native code primitive words in the compiler, the overall pattern is the same as other FORTH versions, words are expected to help compile themselves.
A computer only runs machine code, so something has to run the token 'compiled' code, everything a computer runs is machine code, and anything else is just data.
The function that runs the token code is the inner interpreter, given a list of TOKENS its only job is to find the machine code for each one, and then call it.
There are many ways to implement this, I settled on something simple:-
10: ; next token
.rept 16
LDRH W1, [X15, #2]!
CBZ W1, 90f
LSL W1, W1, #6
ADD X1, X1, X27
LDP X0, X2, [X1]
CBZ X2, 10b
BLR X2 ; with X0 as data and X1 as address
.endr
b 10b
X15 is the IP, interpreter pointer.
This inner most part of the inner interpreter turns tokens back into addresses and calls the code at the address.
I keep the start of the dictionary in X27, and this code just multiplies the TOKEN by the dictionary element size.
I decided to pass the data pointer and the address of the word in the dictionary to each word.
This allows words to look themselves up easilly, a word can see the data in its own dictionary slot and use that as it runs.
The essential simplicity of FORTH is that you can incrementally add and test the rest of the language by creating new words that compile themselves.
Note that the loop ends when it encounters 0, Zero is also the TOKEN for the word (EXIT) which is also the first word in the dictionary.
A major challenge with any programming language is that IF can not just be a function, it has to choose at runtime a path to take.
SEE IF
SEE WORD :4374851456 IF
0 : 0 ^TOKENS
8 :4370318884 PRIM RUN
16 : 0 ARGUMENT 2
24 :4370318892 PRIM COMP
32 : 0 Extra DATA 1
40 : 0 Extra DATA 2
48 :IF NAME
IF seen here has no tokens or data, it has a PRIM run and PRIM comp function.
At the OK prompt type IF.
IF
Error: use of some words not allowed outside of definitions.
Ok
What you see here is what the PRIM RUN function does, it lets you know that you can only use IF inside a colon definition.
IF also has a PRIM COMP, and that code runs when the compiler sees IF.
Example of using IF
: t1 NOECHO
BEGIN
KEY? IF
KEY DUP DUP EMIT CHAR = EMIT . 32 EMIT FLUSH
81 = IF RETERM EXIT THEN
THEN
100 MS
AGAIN
;
This code checks if a key is pressed, and if it is pressed, displays it.
There are two IF statements, one is nested inside the other.
In FORTH an IF statement looks like
f IF .. ELSE .. THEN
Where f means flag, this reads as IF the flag is true do this ELSE do that THEN carry on with the rest of the word.
SEE shows there is a lot of code
SEE t1
SEE WORD :4375034688 t1
0 :4370506952 ^TOKENS
8 :4370320908 PRIM RUN
16 : 0 ARGUMENT 2
24 : 0 PRIM COMP
32 : 0 Extra DATA 1
40 : 0 Extra DATA 2
48 :t1 NAME
TOKEN COMPILED FAST
4370506952 : [ 4015] NOECHO
4370506954 : [ 840] BEGIN
4370506956 : [ 3232] KEY?
4370506958 : [ 3] (IF)
4370506960 : [ 78] *
4370506962 : [ 3233] KEY
4370506964 : [ 1373] DUP
4370506966 : [ 1373] DUP
4370506968 : [ 1638] EMIT
4370506970 : [ 1] (LITS)
4370506972 : [ 61] *
4370506974 : [ 1638] EMIT
4370506976 : [ 7736] .
4370506978 : [ 1] (LITS)
4370506980 : [ 32] *
4370506982 : [ 1638] EMIT
4370506984 : [ 1930] FLUSH
4370506986 : [ 1] (LITS)
4370506988 : [ 81] *
4370506990 : [ 7751] =
4370506992 : [ 3] (IF)
4370506994 : [ 42] *
4370506996 : [ 5052] RETERM
4370506998 : [ 1639] EXIT
4370507000 : [ 5595] THEN
4370507002 : [ 5595] THEN
4370507004 : [ 1] (LITS)
4370507006 : [ 100] *
4370507008 : [ 3755] MS
4370507010 : [ 576] AGAIN
4370507012 : END OF LIST
What IF does at compile time is compile (IF) Token #3, followed by a number that is used to skip quickly to the matching THEN.
When the word runs (IF) checks the flag (top of stack), if it is not true it causes the instruction pointer (IP, X15) to skip to THEN.
So at compile time IF ELSE and THEN, create the logic in the word, at runtime, (IF) and (ELSE) executes the logic.
The THEN literally does nothing at all at runtime, at compile time THEN does all the work to fix up the branches for IF and ELSE.
Other than that THEN just helps us see where each IF ends.
IF essentially compiles a branch, that is based on conditionally changing the address of the next instruction to be run by the inner interpreter.
THEN works out the address for IF to use.
The code for (IF)
; if top of stack is zero branch
dzbranchz:
do_trace
dzbranchz_notrace:
LDR X1, [X16, #-8]!
CBNZ X1, 90f
; it is zero, branch forwards n tokens
80:
LDRH W0, [X15, #2] ; offset to endif
SUB W0, W0, #32 ;
SUB X0, X0, #2
ADD X15, X15, X0 ; change IP
RET
90:
ADD X15, X15, #2 ; skip offset
RET
(IF) Just checks the top of the stack and branches forward if it is zero (not true.)
It does this by changing the value of X15, which is the IP in the inner loop (above.)
(IF) reads the offset from the next token half word, and adds it to the IP, or just skips the token and continues.
There are a few other words defined that read the next token and use it as an argument.
These can be seen in the dictionary as the first 16 words, 0-15, not all are used.
; words that can take *inline* literals as arguments
makeword "(END)", dexitz, 0, 0 ; 0 - never runs
makeword "(LITS)", dlitz, dlitc, 0 ; 1
makeword "(LITL)", dlitlz, dlitlc, 0 ; 2
makeword "(IF)", dzbranchz, 0, 0 ; 3
makeword "(ELSE)", dbranchz, 0, 0 ; 4
makeword "(5)", 0, 0, 0 ; 5
makeword "(6)", 0, 0, 0 ; 6
makeword "(7)", 0, 0, 0 ; 7
makeword "(8)", 0, 0, 0 ; 8
makeword "(WHILE)", dwhilez, 0, 0 ; 9
makeword "($S)", dslitSz, 0, 0 ; 10
makeword "($L)", dslitLz, 0, 0 ; 11
makeword "(.')", dslitSzdot, 0, 0 ; 12
makeword "(LEAVE)", dleavez, 0, 0 ; 13
makeword "(14)", 0, 0, 0 ; 14
makeword "(15)", 0, 0, 0 ; 15
All the words enclosed in round brackets are used by the compiler, just like IF compiled in the (IF) token #3.
The first 16 tokens are reserved for words that take the next token after them as an argument.
All the round bracketed words are at a fixed position in the dictionary, the compiler, uses their unchanging token numbers.
The compiler itself has no idea what an IF statement, a DO .. LOOP or a BEGIN .. UNTIL loop actually are.
The compiler just knows that it needs to call any compile time functions when it finds them in a word, and the words all work together to take care of themselves.
What is the compiler?
The compiler is a loop that runs inside of the interpreter loop, it compiles in the tokens for any words with a run time action and immediately runs the words with compile time actions.
It also compiles in any literal values.
Literal values
When the compiler sees a number, it converts that into a literal value, which involves compiling in (LITS) or (LITL) depending on the size of the value, values that fit into a token (half word) slot are short, longer values are looked up in the longlits space.
A short literal
: lits 989 . ;
SEE lits
..
TOKEN COMPILED FAST
4370507016 : [ 1] (LITS)
4370507018 : [ 989] *
4370507020 : [ 7736] .
Here we can literally see the literal in the tokens.
A long literal (could be a large number, a string or a float)
Here we have a word that prints a float.
: litl 3.13459 f. ;
SEE litl
..
4370507026 : [ 2] (LITL)
4370507028 : [ 12] *: [4614240887977997050]
4370507030 : [ 1909] f.
4370507032 : END OF LIST
What is compiled into the word after (LITL) is just the index (12) in the LITERALS pool where this float is stored.
The same logic applies to large numbers, floats and strings etc.
Literals are stored in LITERALS.
12 LITERALS f.
3.13
That entry was created for the word above.
Compiling in literals is literally the only smart thing the compiler has to do by itself, the rest of the work is all defined in the words that work together to compile themselves.
There is no optimization by this compiler, the words that compile themselves can do some specialization, they can choose a more generic or a more specific function to fit the data they are working on, TO for example is a word that does this.
The compiler does nothing at all to look at expressions and implement them in an optimized way, it is as dumb as a rock.
It would be nice to later add an OPTIMIZE word, that takes a TOKEN compiled word and just translates it to the simplest possible machine code, that would also be a good way of learning some of the nitty gritty detail of the ARM instruction set.
FORTH words are stored in the dictionary.
Word names may be up to 15 characters in length.
FORTH names were originally only five characters in length, I thought eight might do, but later made the interpreter check both fields, which allows for 15 letters.
This is handy as I also decided to make words starting with _ private.
That just means they are hidden when listed.
The idea is to make words that only used to implement another word private.
You can also ALIAS words, which just means naming them with a different name.
List WORDS with WORDS, and look at the source code.
NAME >DATA2 >DATA1 >COMP >ARG2 >RUN
Move to fields within a words dictionary entry.
e.g. ` DUP >NAME $. prints the name of DUP, which is DUP
:
Defines a new high level word e.g.
: SQ DUP * ;
If the word exists : will refuse to change it.
::
Redefines an existing high level word
e.g.
:: SQ DUP * . ;
Will redefine an existing word created with : to have a new behaviour.
The new behaviour will apply to all existing words that used the redefined word.
The key point being the new word has the same TOKEN as the word that it redefined.
ADDS
7 ADDS 7+
Create a word 7+ that adds 7 to the value on the stack.
ALIAS
Names a word with an additional ALIAS name
e.g. ALIAS ten 10
ten is now 10.
Using ALIAS a single word (or number) is aliased with a new name. You can UNALIAS ten and ten will be undefined again.
ALIAS is just a magic trick in the interpeter that provides a new additional name for an existing word.
It does not create anything real in the dictionary and can not be used with a string literal for example.
CLRALIAS clears ALL ALIASes
Use .ALIAS to list all the aliases, they do not show up as WORDS in the dictionary.
A$
Short for APPEND buffer. Storage used when appending by the string builder.
ALLWORDS
Lists all words, user and compiler words.
ALLOT>
0 VARIABLE this_variable n ALLOT> this_variable
Adds n bytes of storage to a variable created earlier
ALLOT
0 VARIABLE room 200 ALLOT
Adds n bytes of storage to the LAST word created
ARRAY
n ARRAY myArray
Creates an ARRAY of size n 64bit words.
Sizes for ARRAY elements are defined by the name.
Array Creator word | Size |
---|---|
ARRAY | 64 bits |
WARRAY | 32 bits |
HWARRAY | 16 bits |
CARRAY | 8 bits |
ADDR
If you have the token number (16bits) for a word This calculates the address.
This is a similar function that is used by the inner interpreter.
ACCEPT
Gets a line from the user and interns it as a string, returning its address.
AGAIN
Last part of a never ending indefinite loop in a compiled word.
e.g. BEGIN ... AGAIN will repeat forever unless something invokes EXIT.
ABS
A maths function, applied to the top value on the stack.
AND
A logical functon applied to the top two values on the stack.
ALLWORDS
Lists all the public words, including words no one in their right mind should use.
**a++, a and a! **
A local word that increments 0 LOCALS by 1.
A simple and fast way for a word to create a counter, since a is set to zero when a word starts, and this adds one to it. A can be read with a.
The value is returned by a or 0 LOCALS.
a .. h exist as accessor words for 0 .. 7 LOCALS.
AT
Moves the cursor to a position on the terminal, great for video games, (from the 1970s.)
10 10 AT MSTR
BEGIN
The start of an indefinite loop in a compiled word.
Converts the token for a word to its address (see NTH)
B$
Short for string BUFFER.
A storage space used by STRINGs
BREAK
Sends a break signal, that should result in a low level debugger or a crash. Hazardous
CHAR
CHAR A .
Converts an ASCII letter to a number That prints 65.
CARRAY
Like ARRAY one byte wide. See ARRAY
CVALUES
Like VALUES one byte wide. See VALUES
C@
Reads one byte of memory from the address on the top of the stack. Potential hazard. Use VALUES instead.
C!
Stores one byte of memory. Potential hazard.
CLRALIAS
CLEARS all ALIAS
CONSTANT
3.14159265359 CONSTANT PI
An alias for VALUE that expresses a promise to never use TO.
See VALUE.
Nothing is immutable but lets pretend.
CPY ( src dest -- src1 dest1 )
COPY 8 bytes from src to dest and increments src and dest by 8.
Can be used to create a fast copying word.
Hazardous, only copy memory you own.
CCPY ( src dest -- src1 dest1 )
Short for CPY CPY copies 16 bytes increments src and dest by 16.
CREATE
Used to create a word header in the dictionary. Not often used. Used a lot in standard FORTH.
CR
Print a carriage return and line feed Moving the cursor down and to the left hand start of the next line.
DO
Part of a compiled definite loop.
DOWNDO
Part of a descending definite loop.
DUP
Duplicates the top value on the stack.
DDUP
DUP and then DUP again, like DUP DUP and not like 2DUP.
DROP
Drops the top value on the stack.
DDROP
DROP and then DROP again, the same as 2DROP, but I like it.
DEPTH
Returns the depth of the stack.
EXECUTE
Calls the words function, using the address at the top of the stack.
ELSE
Part of a compiled IF .. ELSE .. THEN control flow.
ENDIF
Another name for THEN
Part of a compiled IF .. ELSE .. ENDIF control flow.
EMIT
Prints the charachter that is the top value on the stack.
EXIT
Exits the current word, part of a compiled word. Also used to crash if you are in the interpreter, as you can not execute that.
The F section lots of floating point maths words.
FFIB
Calculates a FIB quickly, instead of being a benchmark.
FASTER
Used to make a word run FASTER, the default interpreter.
FLAT
Used to create LOCALS access words.
See LOCALS section.
FALSE
Not TRUE, 0 in particular.
FILLVALUES
Used to set all the elements in a VALUES to a value.
FILLARRAY
Used to set all the elements in an ARRAY to a value.
FILL
A dangerouse but fast memory block filling word. A potential hazard. Code was Copyrighted by ARM, Released as BSD licensed so at least no one will be sued when they accidentally write the same nice but still (I suspect) obvious code.
FREE
Hazardous word that frees memory given an allocated pointer as an argument.
Floating point
f<> f= f>=0 f<0 f<= f>= f< f> f. f+ f- f* f/ fsqrt fneg fabs s>f f>s
Mainly the ARM floating point instructions.
These use the same stack, there is no floating point stack, and no set of floating point stack words, I do not see the point in a separate stack for floats.
FFIB
A Machine code FIB for comparison.
FORGET
Forgets the LAST word created.
Only forgets a single word.
Handy if you just made a mistake and want to start a word over.
In standard FORTH, FORGET forgets all the words created after the word being forgotten, this is not the case in this implementation, only one word is removed by FORGET, and it is normally the LAST one you created.
The words header is removed, dynamic memory used by the word is reclaimed.
You can delete any specific word by making it LAST then forgetting it.
SELECTIT word
If you just want to change an existing word, consider redefining the word with ::
Caution
If you forget a word that other words use, they will fail, and even stranger things may happen if another word gets created in the slot it used to have.
FINAL^
The final word in the dictionary
FINDLIT
Find the literal
FILLVALUES
Fill a values
FILLARRAY
Fill an array
FILL
Fill a block of memory, hazardous.
FLUSH
Flush output to the terminal
HEAP
Allocates some MALLOCed memory and leaves the address in HEAP^
Useful to set up various essential areas during startup.
Generally useful if you need a large lump of memory.
Memory is not initialized, FILL can help with that.
HWARRAY
This creates an array of Half Words, 16 bit values. See ARRAY
HWVALUES
This creates a VALUES of Half Words, 16 bit values. See VALUES
HW!
Used to dangerously store half words in memory. A potential hazard.
HW@
Used to dangerously read half words from memory.
HW@IP
A word of limited use, that reads the token of the word under the instruction pointer, mainly useless as it typically reads itself. Hazardous
IP@
Returns the current value of the instruction pointer May be useful when debugging a word. Hazardous.
IP!
Sets the current value of the instruction pointer. May be useful when debugging a word. Hazardous. tricky.
IP+ Hazardous
IN
The file we are reading from, probably STDIN most of the time.
I
The index for the current LOOP in a compiled word.
This is an index into a relative position on the return stack.
I means the index of the current LOOP I am in.
If a LOOP is nested you can also use J to get the value of the parents LOOP.
And if you are nested again, you can use K to the value of the grandparents LOOP.
Its all relative.
IF
Part of f IF .. ELSE .. THEN in a compiled word.
: DoYou?
.' Do you like Forth '
' yes' ACCEPT $contains
IF .' Good ' ELSE .' So sorry ' THEN ;
INVERT
Inverts the bits in the value on the top of the stack.
IN
The input file
LAST
The last word created, this is the only word we can FORGET without saying Bye.
It is the word that ALLOT will try to add more data bytes to.
LOCALS
The locals array of VALUES.
When a word starts 0..7 LOCALS are all set to zero.
A LOCALS value may be stored with TO, e.g. ' A string ' TO 7 LOCALS or 3.1459 TO 6 LOCALS.
Read the value with just 7 LOCALS.
A FLAT word sees the LOCALS of its parent word.
The interpreter at level zero, also has its own LOCALS.
See LOCALS above.
LOOP
Part of the DO .. LOOP
e.g. 10 1 DO I . .' Hello' CR LOOP ;
Within a LOOP you can get the LOOP index using I.
You can EXIT from a word and also by implication from any LOOPs.
You can LEAVE a LOOP and continue where LOOP ends, but only once.
LIMITED
LIMITED word
Limits the steps a word can take, used to step through a word, while you attempt to understand how your own logic let you down again.
See the (not yet written) debugging section above.
LSP^
Only really useful during startup to point the locals stack pointer at some HEAP storage.
MOD
Maths
MS
Delay for ms
MSTR
Print the unicode monster.
NTH
Convert address to token number.
NIP
Stack operation
NOECHO
Disable terminal echo
OR
Logical operation
OVER
Stack operation
OVERSWAP
Short for OVER SWAP.
PI
A floating point constant
PAGE
Clears the terminal screen
PARAMS
Copy N params from the data stack to the LOCALS stack
e.g.
3 PARAMS
Will load the LOCALS a, b and c from the stack.
You can only take up to 8 params (Locals run from a..h)
PARAMS also checks that the stack is deep enough so this word can also help with error detection.
It is also a fast way to load up to 8 arguments into the locals when a word starts up.
PCHK
n PCHK
Stops with an error if we dont have the number of params needed on the stack.
A good check to make words safer.
PICK
A stack operation that picks the nth word from the stack.
RMARGIN
Terminal right margin
REPEAT
Part of BEGIN f WHILE .. REPEAT loop
RDEPTH
Depth of return stack
ROT
A stack operation
R>
A return stack operation
R@
A return stack operation
RP@
A return stack operation
RESET
Reset and clear the parameter and return stacks and reset the terminal.
RETERM
Return terminal to standard settings.
STRING
Create a name for a string, ' Hello ' STRING greeting
STRINGS
Create a string array, 10 STRINGS messages
SWAP
A stack operation
SHIFTSL
Creates a shifting left word
SHIFTSR
Creates a shifting right word
SPACES
Displays spaces
SPACE
Displays a space
SP@
A stack operation
SP
The stack pointer
SEE
SEE word
Displays what the compiler did to compile the word into tokens.
Shows details about a word.
SELECTIT
SELECTIT WORD
Makes this word the LAST word.
SELF^
Points to the running words dictionary slot.
SYSTEM
Issues a system command from a string.
' ls -l ' SYSTEM
Lists the files in the folder.
TO
Sets a value, e.g. 10 TO thing.
Also +TO adds to a value.
e.g. 10 +TO thing
TIMEIT
TIMEIT word, displays a words run time.
That is it displays how long a word takes to run.
TRACE
TRACE word
Sets the words interpreter to the TRACEABLE one.
Used with TRON and TROF.
.e.g TRACE WORD TRON WORD
Should display a trace of the word running.
TRUE
Not false, the same as -1
TFCOL
Changes the text colour, using terminal escape codes, colours start at 30.
This also sets the background colour and if a character flashes its zany.
TRACING?
Are we tracing, perhaps the reams of code flying past should give us a clue.
This word may not be around much longer..
TICKS
Ticks from the system timer
TIMESDO
Dumb and fast repeater, for a single word.
10 TIMESDO word
TPMS
Ticks per ms.
TPS
Ticks per second
TRON
Tracing on
TROFF
Tracing off
THEN
Ends the f .. IF ... ELSE ... THEN .. statment
TUCK
short for and faster than SWAP OVER
UPTIME
Time since the program started
UNALIAS
UNALIAS word
Will remove the alias name.
CLRALIAS removes all alias names.
UNTIL
Ends the BEGIN ... f UNTIL indefinite loop
VALUE
Create a VALUE, 10 VALUE ten
create a VALUES array, 10 VALUES myValues
VARIABLE
create a VARIABLE, 10 VARIABLE myThing
WORDS
Lists the words but not all of them.
There is a basic version of WORDS built in, so it is always available, it is redefined by the fancier optional colour version in forth.forth.
WARRAY
WVALUES
Create a VALUES array for word (32bit) length data
WLOCALS
A word (32bit) values view over LOCALs storage.
WHILE
Part of BEGIN .. f WHILE .. REPEAT loop
W!
32 bit word store
W@
32 bit word fetch
$empty?
Is the string empty
Begin / end building a string.
$=
Are two strings the same
$==
Are two strings content equal
$find ( sub str -- n )
find substring in larger string returns address where found.
' test ' ' this is a test of find' $find
prints: test of find
$compare
Is a string the same, less than or greater than another.
$contains ( sub str -- f )
if str contains sub true, else false.
$len
Find the length of a string.
$occurs ( sub str -- n )
How many times does sub occur in string.
$pos
Find the pos of char in string
$slice
Take a slice from a string
$''
The empty string, the same as 0.
$intern
Take BUFFER$ and intern it into string storage
$$
Access to string storage, not very useful, since it is sparse.
Assembler code can be fragile, a mistake in a new feature can trash essential registers and blow up some distant part of the program, that is nowhere near the new change.
It is important to test after every single minor change to the ASM code.