|
|
|
|||||||||||||||
| created 01/07/09 last update 06/07/09 | author: Claude Baumann | |||||||||||||||
|
1. Introduction May be the most interesting and intriguing class of computational programs is composed of the set of recursive functions and procedures. Essentially a recursive routine can be depicted as a program or a part of a program that calls itself in order to find a solution to a certain problem. Recursive programming presents the intrinsic advantage that complex questions can be answered by relatively short self-referencing sequences. The Tower of Hanoi illustrates quite well, how easily and elegantly a problem can be solved by the means of the recursion. It also demonstrates the limits of the approach in terms of potentially exponential growth of the computation time. This page outlines a recursive solution to a problem that appears in compiling and assembling programs to micro-controller machine code. Modern computer languages that also deserve the specific area of embedded applications, present a certain portability, although special commands and instructions that apply to peripherical devices like I/O ports are not standardized at all. It appears that almost every family of micro-controller devices has its own higher language specifications. Things become very complicated, if a software-program that has been developed at high costs for a certain device shall be ported to another one. Generally the programmer has to restart from scratch in an unaccustomed programming environment, at best, being able to re-use at least some of the old libraries. And he may not even be certain that the new compiler will translate the source code to machine code with the same functionality. An intense debugging work then starts to make sure that the program behaves correctly on the new device. Every software engineer knows that the debugging process may be the most annoying and expensive. One particular problem consists in the necessity of a valuable mathematical kernel that is adapted to the timing and precision requirements of the embedded application. Let's assume that a development team is charged to produce a device that should do the real-time DFT, extract the first harmonic and react on this. The hardware section has worked out the proto-board and it's now the work of the software section to realize the related program. Two months later the main software genius falls ill and nobody is able to replace him on the given platform. But, there is another genius normally working with another family of micro-controllers, who then asks to restart the hardware development with the replacement chip. (I think this is the kind of story that should not happen, but that takes place in reality.) But, what would happen, if the software was completely independent of the underlying hardware? In that case, the proxy engineer would just change the target of the cross-compiler, and no hardware re-soldering would be necessary. Although it is a wide-spread wish, that there should be such an universally portable software, it will take some time, before it will really exist. Probably the most delicate issue in the development of this software is the necessity for each micro-controller of a impressivel large native library that will be compatible with the requirements of the compiler. The following lines try to partly answer the question, if it is possible to design a multi-target cross-compiler system and keep the relevant native library as small as the instruction set of the micro-controller. The answer obviously is: Yes, we can! (Nota bene: This page is based on an unpublished study program called "ULTIMATE ROBOLAB II", that has been written in LabVIEW by Claude Baumann in 2007.) |
||||||||||||||||
|
2. With an example: add two integers on different micro-controllers In order to dig deeper into the question, we want to present an example, where we show some Assembler code for the Renesas H8/3292 (the heart of the LEGO RCX module) and the Microchip PIC 16F88. Although both are 8bit micro-controller devices, they do not resemble at all, neither in their functionality, nor in the instruction set. Especially the integer addition differs, depending on the data size. (Nota bene: We will discuss the following on the base of generated Assembler code that we will call object code here, instead of directly comparing machine code that would not make any sense here.) Colour-code: pure data movement arithmetic/logic operation mixed data movement and operation Note: The suffix _L or _H designates the low part or the high part. For an 8-bit variable the attribute _L or _H doesn't make any sense, if it is not possible to also consider the lower or higher nibble. In the case of the 16-bit variable X16 for instance, X16_L designates the lower byte. X32_H stands for the high word of a 32-bit value, but X32_HH is the most significant byte. (A word is considered as a composite of two bytes. A double word is made of two words and four bytes.)
Important differences are the facts that the PIC allows directly adding data without carry into memory, whereas the H8 has two built-in functions "add with carry" and "simple add without carry" that are operated on Arithmetic and Logical Unit (ALU) registers alone. Besides this, the H8 allows a limited number of 16bit operations, although being an 8bit device. The PIC must emulate the "add with carry" function. The H8 has a small list of registers, but the 16F88, like most PICs, only has a single accumulator. (However, because the on-chip data files all are valid registers than can be accessed by many other instructions, they may also be considered as regular registers.) The question now is, how we can build a code generator that will be able to produce these 6 code samples with minimal effort, by only knowing the processor instruction sets and other processor characteristics. This question can be split into two basic queries: first, how can we generate code for different data sizes. And second, how can we respect the settings and specifications of two (or more different) processor types. 2.1. A divide and conquer generalization of the addition The addition is known as the simplest mathematical operation. Counting beyond "1, 2, many" is one of the greatest achievements of early humanity. But, operating an addition is much more than counting, because it involves the invention of the numerical base. We have no real idea, how young the rules for summations are, and how extraordinary the fact is that most of us can operate additions mentally. We apply the algorithm that has been spread in Europe through Adam Ries and that is originated from al-Khwa-rizmi (whose name etymologically is is the root for the word "algorithm" and who himself learned from India and old Greece). If we want to add 127 to 34, we might count 34, 35, 36, 37... 159, 160, 161, while repeating 127 times the simple "increment by one" function. Of course this inefficient method would require much time and leave us frustrated, if the numbers to operate become really big. Thanks to al-Khwa-rizmi's genius we know that we can do this with all the ease, if we consider the numbers in base 10. Obviously our representation of 34 and 127 already is exactly the good form that we need for the algorithm that can be summarized:
With this method the number of operations has drastically decreased. We use this triviality to show that for the application of the addition we:
We can do this operation in any base. Computers practically use base 2. We can easily prove that a simple addition of two base2 digits, called bits, can be described with the basic logical XOR-operation that yield the digit result and the logical AND-operation that yields the carry flag:
A device that is able to execute this operation is called the half-adder, because it does not include the previous carry bit. The next digit addition would engage two successive half adders and a supplementary OR function for the carry bits:
There are other ways to logically define the full adder. But, the exploration into deeper logic is not the subject of this article. We just want to illustrate the difference between the simple addition and the addition with carry. Let's suppose that an imaginary processor is capable of operating the full 8-bit addition, that we will denote ADDX8 and that executes the complete functionality of adding 8 bits with carry and also returns the new carry bit as the important complementary output. (Note that we need only a single carry flag, because it is a volatile information that is only relevant for the current and the next operation.) Let's also assume that the producer of the processor didn't think of including the half-adder, because he thinks that it is part of the full-adder anyway. We admit that the processor has at least two ALU registers X8 and Y8. Finally, we look into the processor's data sheet and we find that the pseudo Assembler code for ADDX8 is:
|
||||||||||||||||
| ADDX8 ( SOURCE8, DESTINATION8 ) <==> SUMX8 SOURCE8, DESTINATION8 | ||||||||||||||||
|
where SOURCE8 and DESTINATION8 are macros that will later be replaced by strings that are taken from the set { X8 , Y8 }. Because we know that there is no half-adder, we must write the minimal code for this function using ADDX8. The easiest way to do so, is to clear the carry flag before entering the operation. Let's assume that the pseudo-Assembler does this with the following code:
|
||||||||||||||||
| CLEAR ( CARRY BIT ) <==> CLC | ||||||||||||||||
|
Operating the ADDX8 function with the cleared carry flag is equivalent to simply ignoring the flag and we have a half-addition only. There is nothing that prevents us from considering the 8-bit addition in base 256. The carry flag will only be 1, if the addition exceeded 255, just like the carry had only became 1 for additions in base 10, if the single digit addition exceeded the value 9. The pseudo Assembler for the half-adder function, denoted ADD8 will look like:
|
||||||||||||||||
|
ADD8
( SOURCE8,
DESTINATION8 ) <==>
if
code (ADD8) exists then use
this code else {
CLEAR
( CARRY BIT )
ADDX8 ( SOURCE8, DESTINATION8 ) } |
||||||||||||||||
|
Now imagine that our higher level language compiler wants to call the simple addition for the value in X8 to Y8. We told the compiler to first seek in a valid translation list, if such a code exists and in that case, use this code. In the other case, it must emulate the function and the generated code will be:
|
||||||||||||||||
|
ADD8
( X8, Y8 ) <==>
{ CLC
SUMX8 X8, Y8 } |
||||||||||||||||
|
You'll say: trivial ! Yes, of course. But let's continue developing our ideas. Now imagine that the compiler wants to simply add two 16-bit registers. Again it first seeks, if a corresponding native code exists and uses it in the affirmative. We assume for a moment that the translation list has no such code. In that case, the compiler must synthesize the code from other code modules. You certainly noticed that we red-highlighted the ADDX8, ADD8 and CLEAR codes. These are part of a new intermediate macro language that will be generated from the compiler, BEFORE doing the assembly. Let us call this intermediate language "TEMPLE" (Translate to Elementary Macro Program Language Emulation). We now will define a new macro function ADD16:
|
||||||||||||||||
|
ADD16
( SOURCE16,
DESTINATION16 ) <==>
if
code (ADD16) exists then use
this code else
{
SPLIT
( SOURCE16 --> SOURCE16_H, SOURCE16_L ) |
||||||||||||||||
|
where the SPLIT function tells the compiler to divide the registers into their upper and lower parts. Hence, we can generalize the addition as a recursive divide and conquer algorithm and use the affirmative case of the native code existence as the termination condition. (Nota bene: We must assume here that the source and the destination have the same data size and that the )
|
||||||||||||||||
|
ADD
( SOURCE,
DESTINATION ) <==>
if
code (ADD) exists for DATASIZE (
DESTINATION ) then use
this code else
{
SPLIT
( SOURCE --> SOURCE_H, SOURCE_L )
SPLIT
( SOURCE --> SOURCE_H, SOURCE_L ) |
||||||||||||||||
|
If we admit that our lowest data-size is 8-bit, then we can reorganize this definition. If the compiler never finds a correct code for a certain data-size that is greater than 8-bit, it will recursively step down to the 8-bit data-size and then it will necessarily generate the code without recursive call or express an error message that it has detected a recursive call without termination. This precaution is important to prevent the compiler from getting hooked in a cycle. Thus, we can redefine:
|
||||||||||||||||
|
ADD
( SOURCE,
DESTINATION ) <==>
if
code (ADD) exists for DATASIZE (
DESTINATION ) then use
this code else {
if DATASIZE
( DESTINATION )
= 8-bit then
{
CLEAR
( CARRY BIT )
else
{ SPLIT
( SOURCE --> SOURCE_H, SOURCE_L ) if DATASIZE ( DESTINATION ) = 8-bit then ERROR 'Recursive call without termination'
else
{ SPLIT
( SOURCE --> SOURCE_H, SOURCE_L ) |
||||||||||||||||
|
TEMPLE must not necessarily be a textual language, but can be as well a graphical icon-based language. As already pointed out, we tested the TEMPLE concept in the unpublished study software that we designed in LabVIEW. The code looks impressively simple, although LabVIEW does not allow direct recursive calls. The green icon color code corresponds to the red highlighted TEMPLE keywords. However the pink function calls denote the LabVIEW indirect recursive calls of the designated function. Also inside this sub.vi TEMPLE checks if the desired native code exists. Note that we opted for a more complex and variable function DATASIZE that we will not discuss here.
Now, where is the gain? Normally you would tell the compiler the whole code for all the possible data-sizes. Let's try the experiment with the H8/3292 micro-controller. We suppose for a moment that there is no 16-bit addition in the instruction set. So, a normal compiler would somewhere need a native code for this, otherwise it could not do the 16-bit addition. The same is true for the 32-bit addition. By consequence the compiler designers must provide all this code. And writing it -bug-free- might be a hard thing. With TEMPLE, things become very easy. Let's compare the H8 native code data-base that is necessary for doing the addition in ALU registers:
|
||||||||||||||||
|
This already indicates that with the TEMPLE
idea, the compiler will need nothing but the processor instruction set to
synthesize the code for any data-size. The reader surely find the interest
for BIGNUM
problems. (Note: At this point we deliberately ignore the issue that we
might encounter due to a limited number of ALU registers.) But, the power of the TEMPLE
concept becomes more dramatically visible, if later we will analyze the code
for the multiplication.)
|
||||||||||||||||
|
2.2. Changing the device Since the H8/3292 instruction set includes an ADD16 code function, we may add this to the list. Anyway, we might fill the data-base will all the optimized codes. So, nothing prevents us from slowly and gradually improving the compiler achievement. But, once we are confronted with a new device with a different instruction set or specifications, we are not obliged to completely fill the data-base, but just add the minimum from the instruction set. This reduces the efforts of the compiler designer to a minimum, dramatically saving costs. It would be an excellent idea to allow the end-user to write his own code or use his preferred library for any of the basic function, if he likes. Since he certainly will only work on a small number of different processors, he will find the ways to somewhere grab better code in open source repertories or otherwise shared databases. If we switch to the Microchip PIC16F88, a particularity appears that TEMPLE should be able to respect. We have exactly the contrary of what we developed so far concerning the addition, because -as already pointed out- the 16F88 has no native add with carry function. In that case, the full-adder must be emulated. A simple way to do this, may be with the following code: |
||||||||||||||||
|
ADDX8_alternative
( SOURCE,
DESTINATION )
<==> {
BRANCH
(on CARRY BIT clear) --> skip next line |
||||||||||||||||
|
But how will we detect a potential cycle, in the case that the compiler (erroneously) will not find a termination condition. In fact, this should never happen, because either there is a native termination ADDX (8) or ADD (8). However the TEMPLE compiler would need a cycle survey, in order to avoid a crash, in the case of a cross-reference. The complete code then has the following appearance. Note that the potential cycle calls are blue high-lighted. Now there are two new instructions BRANCH and INC (increment, e.a. count or add 1)
|
||||||||||||||||
|
ADD
( SOURCE,
DESTINATION ) <==>
if
code (ADD) exists for DATASIZE (
DESTINATION ) then use
this code else {
if DATASIZE
( DESTINATION )
= 8-bit then
{
CLEAR
( CARRY BIT )
else
{ SPLIT
( SOURCE --> SOURCE_H, SOURCE_L ) if
DATASIZE
(
DESTINATION ) = 8-bit
then
{
BRANCH
(on CARRY BIT clear) --> skip next line
else
{ SPLIT
( SOURCE --> SOURCE_H, SOURCE_L ) INC (DESTINATION) <==> if code (INC) exists for DATASIZE ( DESTINATION ) then use the code else { if DATASIZE ( DESTINATION ) = 8-bit then ERROR 'Recursive call without termination'
else
{ SPLIT
( DESTINATION --> DESTINATION_H, DESTINATION_L ) |
||||||||||||||||
Let us have a look at the compiler handling in the case of the PIC16F88,
where we have the following native settings:
Note that the branch instruction is independent of the data-size, and the code can always be directly picked from the native data-base. Also note that for the PIC16F88, the addition needs two lines of the instruction set. If we want to do the ADD16 (X16,Y16) operation, we will get the sequence:
|
||||||||||||||||
| Perhaps the reader has noticed that for the PIC16F88, TEMPLE still uses mnemonics for the ALU registers. Strictly speaking X16_L doesn't mean anything for the PIC. The Assembler will need a clear reference in terms of a memory location. With the H8, there is an issue, because the number of operation registers is limited. Depending on the application, it could be a good idea to define the TEMPLE ALU registers as memory variables. In the case of the H8/3292, this perfectly makes sense, since the micro-controller has a fast addressable on-chip RAM memory. By this means, for instance, the bignum issue mentioned above could easily be solved. The spin-off is a certain loss in speed. However, the compiler could be completed by a code optimizer that would eliminate all unnecessary data moves, especially for volatile data. If we only consider the H8/3292 ALU registers that are part of the CPU, the native code data-base will be as short as a single line for each operation of the instruction set. But, if they were chosen from RAM, a certain number of wrapping data moves would be required. Anyway, data moves can be treated exactly the same way as ALU operations, and the related code can be synthesized with TEMPLE with very limited native code lines. | ||||||||||||||||
|
2. A recursive view on the multiplication TEMPLE doesn't exist yet. To be precise, it only exists in the unpublished experimental software that we called "ULTIMATE ROBOLAB II". The idea was born from the attempt of merging ULTIMATE ROBOLAB (UR) for the H8/3292 (RCX) with the PIC version. Both programs are very comparable and represent individual languages that descend from the ROBOLAB software. Although this programming environment is discontinued, some of the extraordinary features survive in the LabVIEW NXT toolkit. However the UR PIC version and RCX version are not compatible in many ways. In the second study-version we pushed everything to the extreme, in order to find out, how we could build a graphical compiler that deserves micro-controllers that are so different like the 16F88 and the H8. We also considered other chips AVRs, the 8051, the ARM, the dsPIC33, the Renesas R13, Parallax.... From the comparison, TEMPLE was the best intersection of all the processors. In the following we will expose the TEMPLE version of a portable multiplication (both from the data-size and the target point of view). For easy understanding, we will present the routine as a graphical flow-chart program, but it could be as well described in the textual form.
This diagram shows the multiplication that is universal for all integer data-sizes -as long as they are powers of two- and for a large list of micro-controllers. The program uses TEMPLE instructions that have not been discussed here. The code should be self-explanatory. Note the "globals" icon that gives URII the correct label number. Now you may compare the generated codes (expect them to still be buggy, because there might have been bad inputs in the native data-base).
|
||||||||||||||||
//CALCULATE Y*=X FOR 8-BIT INTEGERS
//ON THE RENESAS H8/3292
//NEEDS TEMPORARY REGISTER T8
//COMMENTS ADDED MANUALLY
//CODE GENERATED WITH UR2
MOV.B Y8,T8 //T=Y
MOV.B #0X0,Y8 //Y=0
LABEL BEGINLOOP_10002 //WHILE T8<>0
MOV.B T8,T8 //GET ZERO BIT
BNE LOOPDO_10002 //IF <>0 LOOP
BRA ENDLOOP_10002 //ELSE SKIP LOOP
LABEL LOOPDO_10002
BTST #0,T8 //TEST T8 lsb
BNE TRUE_10001 //IF <>0 DO
BRA ENDIF_10001 //OTHERWISE SKIP
//CONDITIONAL CODE
LABEL TRUE_10001
ADD.B X8,Y8 //Y+=X
LABEL ENDIF_10001
SHLL X8 //X<<1
SHLR T8 //T>>1
BRA BEGINLOOP_10002 //LOOP
LABEL ENDLOOP_10002
|
//CALCULATE Y*=X FOR 8-BIT INTEGERS //ON THE MICROCHIP PIC16F88 //NEEDS TEMPORARY REGISTER T8 //BANK SPECIFICATION MISSING //CODE GENERATED WITH UR2 MOVF Y8,W
MOVWF T8
LABEL BEGINLOOP_10002
MOVF T8,W
BTFSC STATUS,Z //JUMPS TO LOOPDO_10002
GOTO ENDLOOP_10002
LABEL LOOPDO_10002
BTFSC T8,0 //JUMPS TO TRUE_10001
GOTO ENDIF_10001
LABEL TRUE_10001
MOVF X8,W
ADDWF Y8
LABEL ENDIF_10001
BCF STATUS,C //CLEAR CARRY; UR2 DOESN'T
//KNOW THAT THIS IS OBSOLETE
RLF X8
BCF STATUS,C //IMPORTANT, OTHERWISE T8 NEVER 0
RRF T8
GOTO BEGIN_LOOP_10002
LABEL ENDLOOP_10002
|
|||||||||||||||
//CALCULATE Y*=X FOR 16-BIT INTEGERS //ON THE RENESAS H8/3292 //NEEDS TEMPORARY REGISTER T16 //Y16=R5; X16=R6; T16=R2 MOV.W Y16,T16 //NATIVE 16-BIT MOVE EXISTS
MOV.B #0X0,Y16_H //IMMEDIATE MOVE ONLY 8-BIT
MOV.B #0X0,Y16_L
LABEL BEGINLOOP_10002
MOV.B T16_L,T16_L
BNE LOOPDO_10002 //IF NON ZERO LOOP
MOV.B T16_H,T16_H //IF LOW BYTE ZERO, TEST HIGH BYTE
BNE LOOPDO_10002
BRA ENDLOOP_10002
LABEL LOOPDO_10002
BTST #0,T16_L
BNE TRUE_10001
BRA ENDIF_10001
LABEL TRUE_10001
ADD.B X16_L,Y16_L
ADDX X16_H,Y16_H
LABEL ENDIF_10001
SHLL X16_L
ROTXL X16_H //NEEDS ROTATION WITH CARRY
SHLR T16_H
ROTXR T16_L
BRA BEGINLOOP_10002
LABEL ENDLOOP_10002
|
//CALCULATE Y*=X FOR 16-BIT INTEGERS //ON THE MICROCHIP PIC16F88 //NEEDS TEMPORARY REGISTER T16 //BANK SPECIFICATION MISSING MOVF Y16_H,W
MOVWF T16_H
MOVF Y16_L,W
MOVWF T16_L
CLRF Y16_H
CLRF Y16_L
LABEL BEGINLOOP_10002
MOVF T16_L,W
BTFSS STATUS,Z //SKIP IF ZERO; NEEDS HIGH BYTE TEST
GOTO LOOPDO_10002 //LOW BYTE NON-ZERO, SO RUN LOOP
MOVF T16_H,W
BTFSC STATUS,Z //IF NON-ZERO RUN LOOP
GOTO ENDLOOP_10002
LABEL LOOPDO_10002
BTFSC T16_L,0
GOTO ENDIF_10001
LABEL TRUE_10001
MOVF X16_L,W
ADDWF Y16_L
BTFSC STATUS,C
INCF Y16_H
MOVF X16_H,W
ADDWF Y16_H
LABEL ENDIF_10001
BCF STATUS,C
RLF X16_L
RLF X16_H
BCF STATUS,C
RRF T16_H
RRF T16_L
GOTO BEGIN_LOOP_10002
LABEL ENDLOOP_10002
|
|||||||||||||||
|
Relevant links: |
||||||||||||||||