Asmj syntax

Last revised: 23-Nov-2003
  1. Notation

    Syntax is just a pattern that describes the strings of characters that we accept. For the sake of brevity, we take some shortcuts. For instance, some instructions require an operand, while others do not allow one; the operand is almost never optional. But our generic pattern for an operation line shows an optional operand, because in an imprecise way they are "sometimes" required.

    We use a kind of Backus-Naur Form (BNF) to describe patterns. This lets us build bigger patterns from littler ones. A pattern definition begins with a word that we use as a name for the pattern, followed by the symbol "::=", which you could read as "is defined as". Following that are the elements that make up the pattern. Definitions can cover multiple lines, and are often spread out more than is really necessary just to make them more readable. A pattern ends just before a line that is either blank or begins another pattern definition (which you recognize because it contains the "::=" definition symbol).

    By way of example, we will use BNF to describe the BNF notation itself.
    	syntax ::= <definition> ...
    	definition ::= <word> "::=" <element>
    	element ::= <string>
    	         | <option>
    	         | <append>
    	         | <repeat>
    	         | <either>
    	         | <element_name>
    	         | <grouping>
    	string ::= '"' <anything_but_double_quote> ... '"'
    	         | "'" <anything_but_single_quote> ... "'"
    	option ::= "[" <element> "]"
    	append ::= <element> ( " " <element> ) ...
    	repeat ::= <element> "..."
    	either ::= <element> "|" <element>
    	element_name ::= "<" <word> ">"
    	grouping ::= "(" <element> ")"

  2. The Command Line

    Before running Asmj, your CLASSPATH variable should be set to tell java where to find the Asmj class files. Those class files are normally in a jar file such as "asmj-6809.jar", or in the two directories in which they were compiled. If you use bash, you would need something like one of these two, adjusted for wherever you did the installation and which processor you are using:

            export CLASSPATH="/usr/bill/asmj-6809.jar"
            export CLASSPATH="/usr/bill/asmj:/usr/bill/asmj/6809"

    Asmj has to be run by the Java interpreter, so the command starts with "java Asmj". Following that are the names of the assembly source code files and/or some option flags.

    	"java Asmj " <sourcefile.asm> " " <flags>

    It generates five output streams, which can mostly be directed to a file, to standard output, or turned off. By default, it generates a binary output file, and writes the listing and symbol table to standard output.

    The default name for the binary file is the same as the name of the last source file, but with a ".bin" extension instead of ".asm". (If that last source file did not end with ".asm", then the binary file default name has the same name as the last source file, but with ".bin" appended.) Of the five streams, only the binary cannot be directed to standard output.

    All of the flags that control output streams have the same format. First is a plus or minus sign, indicating that the stream is to be enabled or disabled, respectively. Next is a letter indicating which stream is being controlled. If the stream is being enabled, then the letter can be followed by an optional equals sign and then a filename. For example, "-l" disables the listing output; "+l" makes the listing go to standard output; and "+l=/tmp/listing" makes the listing go into the named file. Since an equals sign after the option letter is not treated as part of the filename, to specify any filename that actually begins with an equals sign you must put two equals signs there: the first is the optional part of the option flag, and the second begins the filename.

    Letter Stream
    b Binary
    s S-records
    l Listing
    e Error report
    t Symbol table

    Symbol definitions can be given on the command line, and are put into the symbol table just as if by an "equate" pseudo-op. The format for this option flag is as follows. First is the flag "-D", followed by an optional equals sign, followed by the definition. The definition is a symbol name, optionally followed by an equals sign and a value. For example, "-Dfoo" defines the symbol foo, but specifies no value for it (so it will be given a value of zero); "-Dfoo=0x100" defines the symbol foo to have the value 256 (hexadecimal 100). Values can be given using the same syntax as for decimal or hexadecimal numeric expressions in the source language.

    Finally, the source file may be preceded by a "-f=" flag, which indicates that it is a filename even if it might have been mistaken for one of the other options. For example, "-f=-b=/tmp/foo" specified that the filename is "-b=/tmp/foo", which would have been treated as a specifier for the binary stream had it not been preceded by the "-f".

    The "D" and "f" flags may be preceded by either a minus or a plus, which presently mean the same thing. It is recommended that the plus sign be used, since it is more in keeping with the semantics of the option.

    Option flags and input files can be given in any order. The entire command line is parsed before any input files are read, so the order is irrelevant except that input files are read in the order given. In particular, symbol definitions are all entered into the symbol table before the first source file is read.

    In the command, there must be spaces between words. These were omitted from the pattern for brevity.

    	output_flag ::= "+" ("b"|"s"|"l"|"t"|"e") [["="] <filename>]
    	              | "-" ("b"|"s"|"l"|"t"|"e")
    	definition ::= ("+"|"-") "D" ["="] <symbol> ["=" <value>]
    	source ::= [ ("-"|"+") "f" ["="] ] <filename>
    	flags ::= [<output_flag>|<definition>|<source>]...
    	command ::= "java" "Asmj" <flags> <source> <flags>


    	# use defaults for all output streams
    	java Asmj foo.asm
    	# write no binary, put the listing and s-records into files
    	java Asmj -b +l=foo.lis +s=foo.s19 foo.asm
    	# write no binary, send errors to standard output, no other outputs
    	java Asmj -b -l -s -t +e foo.asm
    	# use defaults for all output streams, define "precision" to be 3
    	java Asmj +Dprecision=3 foo.asm

  3. The Input File

    Asmj expects its input file to contain assembly language source code. Each line of the input file can have either one operation, a comment, or be blank (all white-space or length of zero). This manual covers the syntax that is common to all supported processors (currently the 6800, 6809, and 8080). Details for other processors are available in other files.

    1. Comment lines

      Comment lines are discarded by the assmbler, but are used by the programmer to leave notes for humans to read. Blank lines are considered as comment lines.

      The syntax for comments are different for different families of processors; see the documentation for the particular family that you are interested in:

    2. Operation lines

      Each operation line consists of a label word beginning with the very first character of the line, followed by one or more white-space characters, followed by the operation name, followed by one or more white-space characters, followed by the instruction's operand, followed by one or more white-space characters and a comment. The label may almost always be ommitted (except with the "equ" pseudo-op), in which case the line must begin with a white-space character. The trailing comment (and the spaces immediately preceding it) is ignored if present, and may always be ommitted. For some instructions, the operand may also be ommitted.

      	s ::= <spaces>
      	opline ::= [label] <s> <op> [<s> <operand>] [<s> <comment>]
      A label can consist only of letters, digits, and underscores, and cannot begin with a digit. Labels are case-sensitive. Operation names are all made of letters and digits, and are not case-sensitive. An operation can be either a machine instruction or a "pseudo-op". The machine instructions can be grouped into a small handful of sets, where all of the instructions in a set share the same syntax. The pseudo-ops are more varied. An operand is typically made of a number with some other stuff telling what the number means. This is explained more under "Operand in memory" below. For now, just note that the number can be given in several forms.

      1. Numeric expressions

        A number can be specified in decimal, hexadecimal, ASCII, or as a symbol which was defined elsewhere to have a numeric value; or it can be a mathematical formula combining these elements.

        1. Decimal numbers

          A number is specified in decimal as a simple sequence of digits, exactly as you would expect. For instance, seventy-eight is 78 . Nothing complicated here.

          	digit ::= "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
          	decimal_number ::= <digit>...

        2. Hexadecimal numbers

          A number is specified in hexadecimal as a "0x" prefix, followed by a sequence of hexadecimal digits. A hexadecimal digit is a decimal digit or a letter from "a" to "f". (This is not case-sensitive; you could use upper-case "A" to "F" if you like.) For instance, seventy-eight is 0x4E .

          	hex_digit ::= <digit>|"A"|"B"|"C"|"D"|"E"|"F"
          	hex_number ::= "0x" <hex_digit>...

        3. ASCII strings

          A string is just a sequence of characters between matching quotes. If single-quotes are used to delimit the string, then it may contain any number of unescaped double-quotes, and vice-versa. But whichever kind of quote is used to delimit the string, that kind can occur in the string only in escaped form (see below) or else it would be mistaken for the quote that marks the end of the string. In the definition below, we gloss over the fact that the delimiting quotes must match, and that they can occur within the string only if escaped. But you know the truth.

          	quote ::= "'" | '"'
          	escape ::= <backslash> <anyCharacter>
          	character ::= <nonQuote_nonBackslash> | <escape>
          	string ::= <quote> <character>... <quote>

          Strings are used as such in a few pseudo-ops, such as the filename when including other assembly source files, or when creating constant text in memory. But strings can also be treated as numbers. Each character of the string forms one byte of the number. If the string is longer than two characters, all of them after the second are discarded whenever the string is used as a number. For instance, seventy-eight could be represented as 'N', or as '\0N', or as '\0Nthese extra bytes are discarded when making a number'.

          Between the quote-marks, the backslash character is treated specially: the backslash and the following character both are used to represent a single byte. In most cases, the byte represented is just the character that followed the backslash. But if the character after the backslash was the lower-case letter "n", then those two characters together ("\n") represent the byte for the new-line character (ASCII value 10). Other special characters can be represented this way too. Representing a character this way is called an escape, because it is a way of letting the character escape from its normal meaning. The following table shows them all.

          code value
          character name
          '\0' 0 null
          '\a' 7 alarm/bell
          '\b' 8 backspace
          '\f' 12 form feed
          '\n' 10 newline/linefeed
          '\r' 13 carriage return
          '\t' 9 tab
          '\v' 11 vertical tab
          '\'' 39 single quote
          '\"' 34 double quote
          '\\' 92 backslash

          In situations where a limited number of bytes are expected, such as when loading a number into a register, the same rules are followed as for any other number: the least significant byte(s) of the number are used, and the rest are discarded. This produces an odd effect: when a string of two or more characters is used as the data to load into a one-byte register, the second byte of the string is the one that is used. Here is why. Whenever a string is treated as a number, its first two characters are taken. But when a number is put into a one-byte register, its least significant byte is used. The least significant byte of the first two bytes of a string is the second byte.

        4. Processor-specific syntax

          Additional syntax for numeric constants is available for each family of processors, according to the tradition of their manufacturers. See the syntax for each specific processor for more details.

        5. Symbols

          A symbol can be defined by the "equ" pseudo-op (see below) to have a numeric value, or can be defined implicitly by being used as a label on any other operation. A symbol begins with a letter, and is followed by any sequence of letters, digits, and underscores. Any such word may be used as a symbol, but note that in places where a register name might be expected, any word that matches a register name will be treated as a register name, and not recognized as a symbol. So if you define a symbol to have the same name as a register, you will not be able to reference your symbol in those places. To be safe, avoid using register names as symbols. Symbol names are case-sensitive. Using a symbol definition, we may define the word "length" to have the value seventy-eight, and then we could use that word anyplace where we might have used 78 or 0x4E or 'N' .

          The asterisk is a special pre-defined symbol. It normally represents the address of the byte immediately following the current instruction, which is normally the address of the next instruction.

          	symbol ::= "*" | <letter> (<letter>|<digit>|"_")...

          If the address of the next instruction depends on this instruction's argument value, such as for the org pseudo-op, then using "*" creates a circular dependency: the value of "*" depends on the value of the current instruction's argument, but that value depends on the value of "*". Another pseudo-op, 'rmb' (called 'ds' in 8080-land), produces the same problem. The solution is to give "*" a special meaning in those situations: it means the address of the current instruction, rather than the next one. This special meaning allows at least two potentially-useful constructs:

          Instruction Meaning
          rmb *%4 reserve up to the nearest word-aligned address
          org *+N leave N bytes out of the object image

          Note that the decision of whether to use 6800 direct or extended addressing is not such a special situation. Although the value of the argument can have some impact on the instruction size, we do not have the same kind of circular dependency, because the difference in code size is just one byte; if it is not totally clear that direct addressing is possible, extended addressing will be used. Since using "*" in the argument of such an instruction makes it unclear whether direct addressing could work, the decision would be to use extended addressing, which then determines the instruction size and the value of "*".

        6. Formulas

          A number can be represented by a formula made of the preceding elements, the mathematical operations, and parentheses. For instance, given that we had defined the word length to mean three, we could specify seventy-eight by length*2+'B'+('m'-'a')/2 . Most of the operators of the "C" programming language are supported, listed below in order of decreasing precedence:

          operator meaning
          ! boolean negation
          ~ bitwise NOT; 1's complement
          - integer negation
          * multiplication
          / addition
          % modulus; remainder after division
          + addition
          - subtraction
          << shift left
          >> shift right
          <= less than or equal
          < less than
          >= greater than or equal
          > greater than
          == equal
          != not equal
          & bitwise AND
          ^ bitwise XOR
          | bitwise OR

          The syntax for formulas is simple, since it is independent of precedence.

          	term ::= <decimal_number> | <hex_number> | <string> | <symbol> | "(" <formula> ")"
          	formula ::= <term> | (<formula> <op> <formula>)

      2. Machine Instructions

        This section is processor-dependent; see the documentation for the particular processor that you are interested in:

      3. Pseudo-ops

        Some of these vary for the different families of processors; see the documentation for the particular family of processors that you are interested in:

        Some are the same for all families of processor; these follow.

        1. end

          This marks the end of the source file, and optionally tells the address at which program execution should start. If any operand is given, it is a numeric expression for the starting address. Text in the file after this line is not read by the assembler.

        2. equ

          This is used to "equate" a symbol with a value - to define it. Its operand is evaluated, and the line's label is entered into the symbol table with that value. This one of very few operations that require a label.

        3. include

          The "include" pseudo-op allows for reading from another file as if it were part of the specified source file. The operand is the filename in quotes. Such inclusions can be nested; the first included file can include another, and so on. Relative pathnames in an "include" are relative to the directory of the source file that contains the include. That is, if "/usr/bill/foo.asm" contains 'include "lib/bar.asm", the actual file included is "/usr/bill/lib/bar.asm".

          Beware of any file that "include"s itself! (Can you say "occurs check"? How about "unbounded recursion"?)

        4. org

          The origin is the address into which subsequent operations should be put. Its argument is a numeric expression; the address. If no "org" is given, the starting address is zero.

          Because the value of the expression determines the placement in memory of the following code, and therefore the value of the symbols defined there, the expression cannot contain forward references to those symbols; all symbols used in the expression must have values defined before the 'org' itself.

        5. macro...endm

          These pseudo-ops are used to define a "macro", which is a new mnemonic representing a series of instructions. The label on the macro line is the new mnemonic. All lines between the macro line and the endm line are scooped up together and saved for later use. Anyplace later in the program where that new mnemonic is used, it is a "call" to the macro, and all of those lines are substituted in place of the calling line.

          To add flexibility, macros are allowed to take arguments. In the macro definition, lines may contain placeholders for arguments, which consist of an ampersand followed by a number or one of a few special characters. In the call to the macro, the new mnemonic can be followed by a comma-separated list of arguments, which are substituted for the numbered placeholders in the lines that replace the call.

          The calling line consists of an optional label, spaces, the macro name, more spaces, the argument list, and then either the end of the line or some spaces followed by anything. The argument list cannot contain any white space.

                  macroDefinitionHeadingLine ::= <label> <spaces> <macroName>
                  macroCall ::= [<label>] <spaces> <macroName> <spaces> <argList>
          	arg ::= <nonWhiteSpace_nonComma>...
          	argList ::= <arg> [ "," <arg> ]...
          The argument list is treated as raw text; the arguments need not meet any syntactic restrictions, and can contain unbalanced parentheses or quotes, illegal expressions, and so on. Of course, the code resulting from the macro call will be assembled, and at that time it must be syntactically correct to succeed. But the arguments can be embedded in the line in any way that makes the result come out right; they need not make any kind of sense outside of those lines. For instance, the following two snippets of code produce the same object code:

          length equ 3
          width equ 4
          fnord macro 
              add&1 length&2)
              fnord a,*(width+7  
              adda length*(width+7)  

          References to arguments that were not supplied are replaced by the empty string, which is the same as if the placeholder was silently deleted. So if the argument list is too short, the result will have some missing parts. (This behavior can be exploited intentionally, of course). Similarly, no check is made that all of the supplied arguments are used; unused arguments are simply ignored. Argument number zero is special; it is the label from the calling line. An ampersand followed by a comma is always discarded, but can be useful if you want a number to immediately follow a placeholder; without something extra in there, the number would be read as part of the placeholder itself. An ampersand followed by a pound-sign expands to the number of arguments actually given on the calling line. An ampersand followed by an asterisk expands to the entire comma-separated list of arguments as given on the calling line. An ampersand followed by an at-sign produces a number that is unique to this macro invocation. (This is useful if the macro needs to define its own labels, but must generate different labels each time it is invoked to avoid production of conflicting definitions.) Finally, an ampersand followed by a any other character represents just that character. So, if you want the ampersand character to come out in the result of a macro call, it must be represented by two consecutive ampersands in the macro definition. This is particularly awkward if the macro contains a formula involving the logical AND operator "&&", which must now be represented as four consecutive ampersands.

          symbol expands to
          &, nothing
          &0 the label from the calling line
          &@ a unique number
          &# the number of arguments
          &* the whole argument list

          So the following code snippets produce the same result:

          fnord macro 
              fcc "label=&0"
              fcc "unique_id=&@"
              fcc "nothing=&,"
              fcc "num_args=&#"
              fcc "all_args=&*"
              fcc "this && that"
              fcc "20th_arg=&20"
              fcc "2nd_arg+0=&2&,0" 
          xyz fnord AB,CD,EF,GH
              fcc "label=xyz"
              fcc "unique_id=1"
              fcc "nothing="
              fcc "num_args=4"
              fcc "all_args=AB,CD,EF,GH" 
              fcc "this & that"
              fcc "20th_arg="
              fcc "2nd_arg+0=CD0"

          Parsing of macro definitions has higher priority than conditional compilation. When a macro definition is begun, all following lines are pulled into the definition until the macro-end is found. Those lines can contain conditionals, syntax errors, or anything else; the content is not noticed until the macro is called later. So a macro definition can contain the beginning of a conditional without its end, or vice-versa; but there is no way to specify that a conditional contains just the beginning or end of a macro definition.

          Macros can be recursive, using conditional compilation to prevent infinite recursion. The depth of macro expansion is limited to 65536.

          left macro 
              if &1>0
              left &1-1  
              left 7

          Because the expansion of recursive macros includes a lot of uninteresting lines associated with the conditional, the output listing of a macro call will show only those lines that actually generate code, and those lines that give rise to error messages.

        6. if...elseif...else...endif

          These pseudo-ops allow for conditional compilation - skipping over sections of the source code depending on the values of symbols and such. For instance, you might need to generate extra code to handle the case that the symbol "precision" has a value greater than 1. Or you might want to include some lines of code only if the symbol "debug" is defined. There are actually five different forms of the "if" part of the conditional; each will be explained below. All of them can have an "else", and all end with an "endif" pseudo-op.

          The "if" conditional allows testing of an arbitrary numeric expression to make its decision. Its argument is a formula, as described in the section under "numeric expressions". The formula is evaluated, and the result is considered to be true if it is not zero. The "if" can have any number of "elseif"s, each of which has its own condition just like the "if" did.

                  if ::= [<label>] <spaces> "if" <spaces> <formula>
                  elseif ::= <spaces> "elseif" <spaces> <formula>

          Once again, the following code snippets produce the same result:

          howmany macro
              if &1<=0
              fcc "&1 is none"
              elseif &1<=4
              fcc "&1 is some"
              elseif &1<=10
              fcc "&1 is lots"
              fcc "&1 is too many" 
              howmany 0
              howmany 2
              howmany 5
              howmany 99
              fcc "0 is none"
              fcc "2 is some"
              fcc "5 is lots"
              fcc "99 is too many" 

          The "ifdef" and "ifndef" conditionals test whether or not a specific symbol has been defined. The argument is just a symbol name. The "ifndef" is the negation of "ifdef" - each is true when the other would have been false. These can each have an "else", but no "elseif".

                  ifdef ::= [<label>] <spaces> ("ifdef"|"ifndef") <spaces> <symbol>

          The "ifeq" and "ifneq" conditionals test whether two strings are identical. These can each have an "else", but no "elseif". Neither of the two argument strings can contain any white spaces or commas, because they are separated by a comma and ended by a white space character. Note that the strings are merely tested for equality; they are not evaluated as formulas or symbols. To see the difference, consider the statement "ifeq 1+1,2". The first argument is a string of three characters, the second is a string of only one character; the two strings are not equal to each other.

                  ifeq ::= [<label>] <spaces> ("ifeq"|"ifneq") <spaces> <args>
          	arg ::= <nonWhiteSpace_nonComma>...
          	args ::= <arg> "," <arg>

          The "ifeq" pseudo-op is mostly useful within a macro, in which one or both of the strings contain (or consist of) placeholders for macro arguments. (Otherwise the two strings would both be visible on the line, and the programmer would have noticed their equality and not needed a compile-time conditional to decide that.)

          Finally, note that "if", "elseif", "ifdef", "ifndef", "ifeq", "ifneq", "else", and "endif" are all pseudo-ops, and each must appear on its own line, like any other operation.

        7. error

          The error pseudo-op forces the assembler to generate an error message of your choosing. This is useful if you detect a problem by using conditional compilation, to make it clear that something is wrong, when it might not otherwise be obvious that there is a problem, or exactly what the problem is.

        8. exitm

          The exitm pseudo-op ends a macro expansion early. It is a convenience; you could always get the same effect with some combination of conditionals. It is typically used within a conditional, with the error pseudo-op. For instance, you may check that the arguments to a macro are reasonable using 'if'. If it turns out that the argument values would cause illegal code to be generated, you could put in an exitm to prevent that. This makes the output cleaner, making it easier for the human to figure out where the real problem is, especially if used in conjunction with the error pseudo-op described above.

          shift macro
              ifneq &1,left
              ifneq &1,right
              error "shift left or right, not &1" 
              ifeq &1,left