FORTRAN on the P2
wmosscrop
Posts: 409
in Propeller 2
Well, FORTRAN IV anyway... I finally have the cpu and disk working on my IBM 1130 emulator ported from the P1.
Lots of lessons learned. Mainly: don't assume that the instructions work the same as they did on the P1!
For example, SUMC works differently. On the P1 it set C if the result overflowed. On the P2 it sets it to the correct sign of the result.
I *think* this P2 code is the equivalent of the P1 SUMC instruction
That said, there are a lot of P2 features that really help:
* The "xorc" modifier and its kin.
* GETBYTE/SETBYTE and the like. These replaced 3 lines of code with 1 in many cases. Very powerful for extracting values from structures.
* Smart pins! Synchronous and asynchronous serial transmit/receive. With the former I was able to implement a fast and compact custom SPI driver; with the latter I will eliminate the cog needed for RS-232 communications.
That said... I don't see any way to use asynchronous serial transmit/receive with slow baud rates such as 1200 at clock speeds of 180MHz. Ah well.
* DRVL, etc.
* LUT/HUB execution (once I got the hang of locating the code for execution...)
* Stacks! Stacks! Stacks!
* RDFAST/WRFAST.
Thanks, Chip, and everyone else, for all of your hard work. I know I'm just scratching the surface of what the P2 can do!
Walter
PAGE 1 // JOB 2BAD LOG DRIVE CART SPEC CART AVAIL PHY DRIVE 0000 2BAD 2BAD 0000 V2 M12 ACTUAL 8K CONFIG 8K // FOR *IOCS(1403 PRINTER) *LIST SOURCE PROGRAM WRITE(5, 15) 15 FORMAT(1X,'HELLO, WORLD') CALL EXIT END FEATURES SUPPORTED IOCS CORE REQUIREMENTS FOR COMMON 0 VARIABLES 0 PROGRAM 40 END OF COMPILATION // XEQ HELLO, WORLD
Lots of lessons learned. Mainly: don't assume that the instructions work the same as they did on the P1!
For example, SUMC works differently. On the P1 it set C if the result overflowed. On the P2 it sets it to the correct sign of the result.
I *think* this P2 code is the equivalent of the P1 SUMC instruction
sumc val1, val2 wc ' Add/subtract operand based on carry, set carry to true sign of result. testb val1, #31 xorc ' Test if result sign and true sign are different. If so, set C due to an overflow.
That said, there are a lot of P2 features that really help:
* The "xorc" modifier and its kin.
* GETBYTE/SETBYTE and the like. These replaced 3 lines of code with 1 in many cases. Very powerful for extracting values from structures.
* Smart pins! Synchronous and asynchronous serial transmit/receive. With the former I was able to implement a fast and compact custom SPI driver; with the latter I will eliminate the cog needed for RS-232 communications.
That said... I don't see any way to use asynchronous serial transmit/receive with slow baud rates such as 1200 at clock speeds of 180MHz. Ah well.
* DRVL, etc.
* LUT/HUB execution (once I got the hang of locating the code for execution...)
* Stacks! Stacks! Stacks!
* RDFAST/WRFAST.
Thanks, Chip, and everyone else, for all of your hard work. I know I'm just scratching the surface of what the P2 can do!
Walter
Comments
Would you mind posting those differences on the P2 Tricks and Traps thread please.
https://forums.parallax.com/discussion/169069/p2-tricks-traps-differences-between-p1-reference-material-only
Nice job. Brings back fortran memories and the watfor compiler, not that I did much fortran.
Any chance you could replicate the byte you want to send into all four bytes of a word, apply SPLITB and transmit 32 bits at 4800 baud?
For receive, configure to receive 32 bits at 4800 baud and use MERGEB to turn that into four copies of a 1200 baud byte?
Just thinking out loud, as I'm not sure what effect that would have on the start and stop bits. You might need to fall back onto bit-bash, but you might be able to use an ISR to manage that in the background.
TJV saves an instruction to test and jump if result overflowed:
In my case the next statement was something like this, where I only want to set (and not reset) the overflow flag: But this is still good to know, thanks!
Will do, once I get a clean example and incorporate the tjv instruction that @TonyB_ pointed out.
The P1 emulator uses files on an SD card for punched card input and line printer output. At this time I don't have that working. The input is a hack where I read the cards from lut and just output to serial via pin 62.
However, the 1130 did have a console printer/keyboard that could be used for job entry. I also support that in the P1 emulator but have yet to get it moved over to the P2.
As far as the emulator, it is a mess. I have had, and still have, issues with timing of simulated I/O operations. The P1 emulator had this worked out; with the new instructions and shorter execution paths the timing balances between cogs have to be reevaluated. Basically the I/O cogs have to maintain the device status information (particularly "device busy") long enough to allow the cpu cog to access and handle it (sometimes multiple times, sometimes no times) but not do so for so long that the emulation is delayed.
Once I get it cleaned up and working with all devices again I would be glad to post it for all to see.
Edit: solved a major timing issue. I was releasing a lock too soon, causing the overwriting of buffered data. Still need to really clean up the code.
The I/o was on twisted pair shielded and would run to 1km or more at 56 bps. IIRC a character was transmitted as a start, 7 bit ASCII, 2 parity bits, stop. An ack of 2 bits would be returned for success or fail and the process continued or repeated. Up to ten devices could be on one cable (typewriter terminal, video, printer, card reader or punch, etc.
The logic boards for this was on 3 x 12”x12” plus 1x 12”x12” dedicated pcb for the device. This was 1970 onwards. I replaced this interface for connecting Centronics printers in the late 70’s with a micro 6802, and later ~1982 with a tiny 2x4” pcb with a pair of 68705P3S, 2x 14 pin gates, 2 transistors and the isolation coil. The P1 with 2 cogs could have replaced the 68705’s
Perhaps I wasn't clear in my explanation. 4800 baud is achievable on the P2 all the way up past 300MHz.
If the byte to send was $5A and you constructed a long with $5A5A5A5A (the byte to send replicated) and then apply a SPLITB you end up with four copies of each bit in sequence:
%0000_1111_0000_1111_1111_0000_1111_0000
Sending those 32 bits at 4800 baud should look like 8 bits sent at 1200 baud to a device that is only sampling at 1200 baud.
On receipt at the P2, sampling the 1200 baud stream at 4800 baud might give 4 bits registered per bit sent: %0000_1111_0000_1111_1111_0000_1111_0000
A MERGEB would reshuffle those into bytes: %0101_1010_0101_1010_0101_1010_0101_1010 or $5A5A5A5A
I don't know if this is practical, as I don't have the means to test it. The remaining problem I see for transmit is that the start bits wouldn't be stretched and probably be ignored as a glitch at the receiver.
Shifting the data down by two bits could allow a 3/4 start bit, and a final 1/2 data bit.
So the long to send (LSB first) would be %00_1111_0000_1111_1111_0000_1111_0000_00 - being (left to right) 0,1,0,1,1,0,1,0, start extension.
Constraining yourself to 7 bit bytes would give an extra 3 bits to extend the start bit to the correct length without jeopardizing the correct receipt of the MSB - you'd probably want to set bit 31 to bring the beginning of the stop bit forward to the correct time to avoid a framing error.
So the long to send (LSB first) would be %1_1111_0000_1111_1111_0000_1111_0000_000 - being (left to right) stop extension, 1,0,1,1,0,1,0, start extension.
Success in each case would depend on the de-glitch circuitry for identifying start bits and the alignment of the sampling point with respect to the beginning of the start bit.
Receiving the 1200 baud data would see the start bit spill into the data bits under 4800 baud timing. This can't be prevented, but can be managed by shifting the received long before performing the MERGEB. The MSB (received last) of an 8 bit byte is then only present in the lowest byte, but the other bytes can be used for comparison in the other bit locations to detect glitches.
So (untested code disclaimer), send might look like
and receive
Ah, now it makes sense as to what you were doing with transmitting 32 bit characters.
It might work, but as you point out the start/stop bits might be an issue.... I was thinking maybe they could be bit-banged on transmission, not sure about receive.
The top part of the diagram shows a original 1200 bps, 8 data bits ($55, plus start and stop bits, of course) being received at 4800 bps, generating 32 bits of information.
Indeed, it would require some post-processing, in order to be "peeled from" residual start-bit samples that were received as data, and also "replicate" the last received data bit, in order to produce a meaningfull 4x (nibble-wise) image of the original 8-bit data. Simple (and fast) shifts and bit-replication ops could deal whith those needs, enabling the recovery of the raw byte.
The bottom part illustrates the transmission of the same 8 bits, now making use of a 3600 bps timeframe and a total of 28 bits of information, showing how the start, data and stop bits would need to be extended/replicated, to meet the intended effect at P2 output pin. The replication of each original data bit into a triplet is not that hard to achieve; the same concept could be used to deal with the twits that "extend" the duration of the start and stop bits.
The net result of using this approach is being able to get rid of the Cog focus-consuming, bit-bashing operations, letting to the smart pins the burden of dealing with almost all the bodering, imposed by such low-speed transfer rates.
Hope it could help a bit.
Henrique
I've actually spent a bit of time looking at the 3600 baud transmission, in order to get the timing cleaner, and worked out how to do 8 bits plus parity and 2 stop bits. It looks like all of that could be achieved in approximately ten code longs, one input/result register, a mask register, and a temp register, taking around 66 clocks if everything is kept in cog ram.
Edit: 3600 baud limits P2 sysclock to around 235MHz, but as the OP was talking about 180MHz sysclock it shouldn't be a problem.
This could be made more flexible with an init routine to patch for the desired parameters. Calculation would be required for the parity mask (BMASK #bits-1), where to place the parity bit (if used; bits *3), where to place the extra stop bit (if used; with parity: bits * 3+3), the final shift value (30 - ((bits + parity)*3)), and how many bits the smartpin needs to send ((bits + parity + extra_stop) * 3 + 2 - 1).
e.g. for 5M2
the current code would require the mask to be 5 bits, BITC to be patched to BITH and the S field to #2<<5 | 15, the REP to be patched to 5 loops, the final shift patched to 15 to get everything lined up, and WXPIN ## (CLKFREQ/3600)*$1_000 + ((5 + 1 + 1)*3 + 2) - 1.
Untested code:
Something similar could be done for reception at 3600 baud for up to 8P2 using as many as 32 bits (0..1 for tail of start bit, 2..17-26 for data at 3 per data bit, next 3 for parity if used, next 3 for extra stop bit if used)
Thanks for working this out... I've been spending the last few days diagnosing, and finally replacing, my wife's sick computer. I'll take a deeper look at this when I come up for air.
Walter
My porting of the code from my P1 1130 emulator to the P2 ran into several issues with what appeared to be timing.
On the P2, which is so much faster, it wouldn't work consistently... or at all.
I finally figured out that my P1 code was incorrect... even though it would run every sample program I tested.
I found that I was clearing the interrupt status information at the wrong time (when the interrupt was handled) instead of when the device was "sensed" with the "reset status" bit set. The difference in timing, coupled with much faster disk access (smart pins), threw everything into confusion.
This might not have happened with other hardware but the 1130 developers sometimes relied on an interrupt not occurring before a certain number of instructions had been executed.
So in the end it was timing... just not what I thought.
Lesson learned: code that you think is fully debugged... isn't.
That can happen.
It's a well-known fact that software deteriorates over time. Many is the time I have gone back to look at code that was working perfectly when I left it, only to find new bugs have crept in while I wasn't using it
A partly software variant on this would be to pack 3 bits per baud-bit, and now you can fit 10 bits into a 32 bit frame.
The first start bit, would send/rx as 2 bits, as the HW manages one non-data start bit.
For 1200baud, I make the divider 50,000 and triple bits, from 180M.
You would need more precise than usual Baud defines, as you are receiving 3x the length, but I guess this ancient stuff all works from a 1.843200 Xtal somewhere ?
None of this new-fangled calibrated RC oscillator stuff
Building/extracting the payloads is a bit-bang variant, but it does not need long delays for each bit, as the HW eventually manages that.
P2 working code compiled with pnut then compiled with fastspin reveals missing wcz on compare.
I am highly interested in what you are doing.
Still need to get the sdcard access working.