FORTRAN on the P2

wmosscrop · 2020-05-05 21:37

Well, FORTRAN IV anyway... I finally have the cpu and disk working on my IBM 1130 emulator ported from the P1.

PAGE   1

// JOB    2BAD

LOG DRIVE   CART SPEC   CART AVAIL  PHY DRIVE
  0000        2BAD        2BAD        0000

V2 M12   ACTUAL  8K  CONFIG  8K

// FOR
*IOCS(1403 PRINTER)
*LIST SOURCE PROGRAM
      WRITE(5, 15)
   15 FORMAT(1X,'HELLO, WORLD')
      CALL EXIT
      END

FEATURES SUPPORTED
 IOCS

CORE REQUIREMENTS FOR
 COMMON      0  VARIABLES      0  PROGRAM     40

END OF COMPILATION

// XEQ

HELLO, WORLD

Lots of lessons learned. Mainly: don't assume that the instructions work the same as they did on the P1!

For example, SUMC works differently. On the P1 it set C if the result overflowed. On the P2 it sets it to the correct sign of the result.
I *think* this P2 code is the equivalent of the P1 SUMC instruction

sumc    val1, val2 wc         ' Add/subtract operand based on carry, set carry to true sign of result.
testb	val1, #31 xorc		' Test if result sign and true sign are different. If so, set C due to an overflow.

That said, there are a lot of P2 features that really help:
* The "xorc" modifier and its kin.
* GETBYTE/SETBYTE and the like. These replaced 3 lines of code with 1 in many cases. Very powerful for extracting values from structures.
* Smart pins! Synchronous and asynchronous serial transmit/receive. With the former I was able to implement a fast and compact custom SPI driver; with the latter I will eliminate the cog needed for RS-232 communications.
That said... I don't see any way to use asynchronous serial transmit/receive with slow baud rates such as 1200 at clock speeds of 180MHz. Ah well.

* DRVL, etc.
* LUT/HUB execution (once I got the hang of locating the code for execution...)
* Stacks! Stacks! Stacks!
* RDFAST/WRFAST.

Thanks, Chip, and everyone else, for all of your hard work. I know I'm just scratching the surface of what the P2 can do!

Walter

Cluso99 · 2020-05-05 22:54

@wmosscrop,
Would you mind posting those differences on the P2 Tricks and Traps thread please.
https://forums.parallax.com/discussion/169069/p2-tricks-traps-differences-between-p1-reference-material-only

Nice job. Brings back fortran memories and the watfor compiler, not that I did much fortran.

David Betz · 2020-05-05 22:56

This sounds really cool. Does this mean you have some sort of interactive OS running on your 1130 emulator? How can I get your emulator and whatever software runs on it?

AJL · 2020-05-06 05:00

wmosscrop wrote: »

* Smart pins! Synchronous and asynchronous serial transmit/receive. With the former I was able to implement a fast and compact custom SPI driver; with the latter I will eliminate the cog needed for RS-232 communications.
That said... I don't see any way to use asynchronous serial transmit/receive with slow baud rates such as 1200 at clock speeds of 180MHz. Ah well.

Thanks, Chip, and everyone else, for all of your hard work. I know I'm just scratching the surface of what the P2 can do!

Walter

Any chance you could replicate the byte you want to send into all four bytes of a word, apply SPLITB and transmit 32 bits at 4800 baud?
For receive, configure to receive 32 bits at 4800 baud and use MERGEB to turn that into four copies of a 1200 baud byte?

Just thinking out loud, as I'm not sure what effect that would have on the start and stop bits. You might need to fall back onto bit-bash, but you might be able to use an ISR to manage that in the background.

TonyB_ · 2020-05-06 09:15

wmosscrop wrote: »
Lots of lessons learned. Mainly: don't assume that the instructions work the same as they did on the P1!

For example, SUMC works differently. On the P1 it set C if the result overflowed. On the P2 it sets it to the correct sign of the result.
I *think* this P2 code is the equivalent of the P1 SUMC instruction
	sumc    val1, val	wc	' Add/subtract operand based on carry, set carry to true sign of result.
	testb	val1, #31	xorc	' Test if result sign and true sign are different. If so, set C due to an overflow.

TJV saves an instruction to test and jump if result overflowed:

	sumc    val1, val	wc	' Add/subtract operand based on carry, set carry to true sign of result.
	tjv	val1, #overflow		' Jump to overflow if val1[31] != C

wmosscrop · 2020-05-06 13:05

I figured that Chip had already handled this change in some way; I just didn't look hard enough.

In my case the next statement was something like this, where I only want to set (and not reset) the overflow flag:

if_c      bith            state,  #overflow_bit

But this is still good to know, thanks!

wmosscrop · 2020-05-06 13:07

Cluso99 wrote: »

@wmosscrop,
Would you mind posting those differences on the P2 Tricks and Traps thread please.
https://forums.parallax.com/discussion/169069/p2-tricks-traps-differences-between-p1-reference-material-only

Nice job. Brings back fortran memories and the watfor compiler, not that I did much fortran.

Will do, once I get a clean example and incorporate the tjv instruction that @TonyB_ pointed out.

wmosscrop · 2020-05-06 13:27

David Betz wrote: »

This sounds really cool. Does this mean you have some sort of interactive OS running on your 1130 emulator? How can I get your emulator and whatever software runs on it?

It's not interactive (yet, see below).
The P1 emulator uses files on an SD card for punched card input and line printer output. At this time I don't have that working. The input is a hack where I read the cards from lut and just output to serial via pin 62.
However, the 1130 did have a console printer/keyboard that could be used for job entry. I also support that in the P1 emulator but have yet to get it moved over to the P2.
As far as the emulator, it is a mess. I have had, and still have, issues with timing of simulated I/O operations. The P1 emulator had this worked out; with the new instructions and shorter execution paths the timing balances between cogs have to be reevaluated. Basically the I/O cogs have to maintain the device status information (particularly "device busy") long enough to allow the cpu cog to access and handle it (sometimes multiple times, sometimes no times) but not do so for so long that the emulation is delayed.
Once I get it cleaned up and working with all devices again I would be glad to post it for all to see.

Edit: solved a major timing issue. I was releasing a lock too soon, causing the overwriting of buffered data. Still need to really clean up the code.

wmosscrop · 2020-05-06 16:45

AJL wrote: »

Any chance you could replicate the byte you want to send into all four bytes of a word, apply SPLITB and transmit 32 bits at 4800 baud?
For receive, configure to receive 32 bits at 4800 baud and use MERGEB to turn that into four copies of a 1200 baud byte?

Just thinking out loud, as I'm not sure what effect that would have on the start and stop bits. You might need to fall back onto bit-bash, but you might be able to use an ISR to manage that in the background.

The issue is with older devices that don't support baud rates higher than 1200 baud (or more than 8 bits per transmitted character). The fallback is, as you say, to do bit-bash.

Cluso99 · 2020-05-06 20:24

The mini I cut my teeth on didn’t even use UARTS.
The I/o was on twisted pair shielded and would run to 1km or more at 56 bps. IIRC a character was transmitted as a start, 7 bit ASCII, 2 parity bits, stop. An ack of 2 bits would be returned for success or fail and the process continued or repeated. Up to ten devices could be on one cable (typewriter terminal, video, printer, card reader or punch, etc.

The logic boards for this was on 3 x 12”x12” plus 1x 12”x12” dedicated pcb for the device. This was 1970 onwards. I replaced this interface for connecting Centronics printers in the late 70’s with a micro 6802, and later ~1982 with a tiny 2x4” pcb with a pair of 68705P3S, 2x 14 pin gates, 2 transistors and the isolation coil. The P1 with 2 cogs could have replaced the 68705’s

AJL · 2020-05-07 01:45

wmosscrop wrote: »

AJL wrote: »

Any chance you could replicate the byte you want to send into all four bytes of a word, apply SPLITB and transmit 32 bits at 4800 baud?
For receive, configure to receive 32 bits at 4800 baud and use MERGEB to turn that into four copies of a 1200 baud byte?

Just thinking out loud, as I'm not sure what effect that would have on the start and stop bits. You might need to fall back onto bit-bash, but you might be able to use an ISR to manage that in the background.

The issue is with older devices that don't support baud rates higher than 1200 baud (or more than 8 bits per transmitted character). The fallback is, as you say, to do bit-bash.

Perhaps I wasn't clear in my explanation. 4800 baud is achievable on the P2 all the way up past 300MHz.

If the byte to send was $5A and you constructed a long with $5A5A5A5A (the byte to send replicated) and then apply a SPLITB you end up with four copies of each bit in sequence:
%0000_1111_0000_1111_1111_0000_1111_0000
Sending those 32 bits at 4800 baud should look like 8 bits sent at 1200 baud to a device that is only sampling at 1200 baud.

On receipt at the P2, sampling the 1200 baud stream at 4800 baud might give 4 bits registered per bit sent: %0000_1111_0000_1111_1111_0000_1111_0000
A MERGEB would reshuffle those into bytes: %0101_1010_0101_1010_0101_1010_0101_1010 or $5A5A5A5A

I don't know if this is practical, as I don't have the means to test it. The remaining problem I see for transmit is that the start bits wouldn't be stretched and probably be ignored as a glitch at the receiver.
Shifting the data down by two bits could allow a 3/4 start bit, and a final 1/2 data bit.
So the long to send (LSB first) would be %00_1111_0000_1111_1111_0000_1111_0000_00 - being (left to right) 0,1,0,1,1,0,1,0, start extension.

Constraining yourself to 7 bit bytes would give an extra 3 bits to extend the start bit to the correct length without jeopardizing the correct receipt of the MSB - you'd probably want to set bit 31 to bring the beginning of the stop bit forward to the correct time to avoid a framing error.
So the long to send (LSB first) would be %1_1111_0000_1111_1111_0000_1111_0000_000 - being (left to right) stop extension, 1,0,1,1,0,1,0, start extension.

Success in each case would depend on the de-glitch circuitry for identifying start bits and the alignment of the sampling point with respect to the beginning of the start bit.

Receiving the 1200 baud data would see the start bit spill into the data bits under 4800 baud timing. This can't be prevented, but can be managed by shifting the received long before performing the MERGEB. The MSB (received last) of an 8 bit byte is then only present in the lowest byte, but the other bytes can be used for comparison in the other bit locations to detect glitches.

So (untested code disclaimer), send might look like

asynctx_init
  WRPIN async_tx, txpin
  WXPIN ## (180_000_000/4800)*$1_000+32-1 '32 bits sent at 4800 baud

async_send
  MOVBYTS txbyte, #%%0000
  SPLITB txbyte
  SHL txbyte, #2
send_wait  TESTP txpin, WC
if_nc JMP #send_wait
  WYPIN txbyte, txpin

and receive

asyncrx_init
  WRPIN async_rx, rxpin
  WXPIN ## (180_000_000/4800)*$1_000+32-1 '32 bits received at 4800 baud

async_receive
  TESTP rxpin WC
 if_nc JMP #async_receive
  RDPIN rxbyte, rxpin
  SHR rxbyte, #3  'adjust for the "extra" start bit time
  MERGEB rxbyte 'received byte in lower byte, upper byte copies available for error detection of all but MSB

wmosscrop · 2020-05-07 01:54

Just found that if you are doing character set translation, like 8-bit ebcdic to 8-bit ascii, and your table is in cog memory, you can translate each character in only 2 instructions:

altgb      ebcdic_char, #table
getbyte    ascii_char
...
table
  long $<char 3 byte>_<char 2 byte>_<char 1 byte>_<char 0 byte> ' translated values, for example $33_32_31_30
  long $<char 7 byte>_<char 6 byte>_<char 5 byte>_<char 4 byte>
...

wmosscrop · 2020-05-07 02:01

AJL wrote: »

Perhaps I wasn't clear in my explanation. 4800 baud is achievable on the P2 all the way up past 300MHz.

Ah, now it makes sense as to what you were doing with transmitting 32 bit characters.

It might work, but as you point out the start/stop bits might be an issue.... I was thinking maybe they could be bit-banged on transmission, not sure about receive.

Yanomani · 2020-05-08 07:31

In an attempt to illustrate the concepts depicted by AJL, I've crafted a composite timing diagram, showing how it could look.

The top part of the diagram shows a original 1200 bps, 8 data bits ($55, plus start and stop bits, of course) being received at 4800 bps, generating 32 bits of information.

Indeed, it would require some post-processing, in order to be "peeled from" residual start-bit samples that were received as data, and also "replicate" the last received data bit, in order to produce a meaningfull 4x (nibble-wise) image of the original 8-bit data. Simple (and fast) shifts and bit-replication ops could deal whith those needs, enabling the recovery of the raw byte.

The bottom part illustrates the transmission of the same 8 bits, now making use of a 3600 bps timeframe and a total of 28 bits of information, showing how the start, data and stop bits would need to be extended/replicated, to meet the intended effect at P2 output pin. The replication of each original data bit into a triplet is not that hard to achieve; the same concept could be used to deal with the twits that "extend" the duration of the start and stop bits.

The net result of using this approach is being able to get rid of the Cog focus-consuming, bit-bashing operations, letting to the smart pins the burden of dealing with almost all the bodering, imposed by such low-speed transfer rates.

Hope it could help a bit.

Henrique

AJL · 2020-05-08 09:44

Yanomani wrote: »

In an attempt to illustrate the concepts depicted by AJL, I've crafted a composite timing diagram, showing how it could look.

The top part of the diagram shows a original 1200 bps, 8 data bits ($55, plus start and stop bits, of course) being received at 4800 bps, generating 32 bits of information.

Indeed, it would require some post-processing, in order to be "peeled from" residual start-bit samples that were received as data, and also "replicate" the last received data bit, in order to produce a meaningfull 4x (nibble-wise) image of the original 8-bit data. Simple (and fast) shifts and bit-replication ops could deal whith those needs, enabling the recovery of the raw byte.

The bottom part illustrates the transmission of the same 8 bits, now making use of a 3600 bps timeframe and a total of 28 bits of information, showing how the start, data and stop bits would need to be extended/replicated, to meet the intended effect at P2 output pin. The replication of each original data bit into a triplet is not that hard to achieve; the same concept could be used to deal with the twits that "extend" the duration of the start and stop bits.

The net result of using this approach is being able to get rid of the Cog focus-consuming, bit-bashing operations, letting to the smart pins the burden of dealing with almost all the bodering, imposed by such low-speed transfer rates.

Hope it could help a bit.

Henrique

I've actually spent a bit of time looking at the 3600 baud transmission, in order to get the timing cleaner, and worked out how to do 8 bits plus parity and 2 stop bits. It looks like all of that could be achieved in approximately ten code longs, one input/result register, a mask register, and a temp register, taking around 66 clocks if everything is kept in cog ram.

Edit: 3600 baud limits P2 sysclock to around 235MHz, but as the OP was talking about 180MHz sysclock it shouldn't be a problem.

This could be made more flexible with an init routine to patch for the desired parameters. Calculation would be required for the parity mask (BMASK #bits-1), where to place the parity bit (if used; bits *3), where to place the extra stop bit (if used; with parity: bits * 3+3), the final shift value (30 - ((bits + parity)*3)), and how many bits the smartpin needs to send ((bits + parity + extra_stop) * 3 + 2 - 1).
e.g. for 5M2
the current code would require the mask to be 5 bits, BITC to be patched to BITH and the S field to #2<<5 | 15, the REP to be patched to 5 loops, the final shift patched to 15 to get everything lined up, and WXPIN ## (CLKFREQ/3600)*$1_000 + ((5 + 1 + 1)*3 + 2) - 1.

Untested code:

 ' enter with byte to send LSB first in txbyte

 ' input long format
 ' 31..8        7..0
 ' ignored    byte to send LSB first

 ' output format
 ' 31..29  28..26  25..23  22..20  19..17  16..14  13..11  10..8  7..5  4..2  1..0
 '  111      ppp                                                               00
 '  stop   parity    MSB    Bit6    Bit5    Bit4    Bit3    Bit2  Bit1  Bit0  Stretch start bit 

  ALTR temp
  AND txbyte, mask WC ' C now holds parity
  BITC txbyte, #2<<5 | 24 ' set parity in bits 24 to 27; patch instruction to BITH for mark parity, BITL for space parity, or NOP and adjust bits to send if no parity
   REP @.end, #8
  SHR temp, #1 WC  ' bit to expand to a triplet in C
  MUXC txbyte, #7  ' write triplet into register
  ROR txbyte, #3 ' move to next bit
end
  SHR txbyte, #3 ' shift past parity; patch with NOP if no parity
  ADD txbyte, #7 ' write the extra stop bit; adjust bits to send if not desired
  SHR txbyte, #3 ' move everything into place

Something similar could be done for reception at 3600 baud for up to 8P2 using as many as 32 bits (0..1 for tail of start bit, 2..17-26 for data at 3 per data bit, next 3 for parity if used, next 3 for extra stop bit if used)

wmosscrop · 2020-05-09 23:20

Thanks, @AJL and @Yanomani.

Thanks for working this out... I've been spending the last few days diagnosing, and finally replacing, my wife's sick computer. I'll take a deeper look at this when I come up for air.

Walter

wmosscrop · 2020-07-11 23:01

Two months later...
My porting of the code from my P1 1130 emulator to the P2 ran into several issues with what appeared to be timing.
On the P2, which is so much faster, it wouldn't work consistently... or at all.
I finally figured out that my P1 code was incorrect... even though it would run every sample program I tested.
I found that I was clearing the interrupt status information at the wrong time (when the interrupt was handled) instead of when the device was "sensed" with the "reset status" bit set. The difference in timing, coupled with much faster disk access (smart pins), threw everything into confusion.
This might not have happened with other hardware but the 1130 developers sometimes relied on an interrupt not occurring before a certain number of instructions had been executed.
So in the end it was timing... just not what I thought.
Lesson learned: code that you think is fully debugged... isn't.

JonnyMac · 2020-07-11 23:07

I finally figured out that my P1 code was incorrect... even though it would run every sample program I tested.

Today I thought I found a bug in Spin2 and after reporting it to Chip, found the same issue -- which is not a bug -- in the P1. Apparently, the feature of my DMX transmitter that I was testing in the P2 was not properly tested by me in the P1, and never used by anyone who uses the object.

Lesson learned: code that you think is fully debugged... isn't.

That can happen.

RossH · 2020-07-12 00:31

wmosscrop wrote: »

Lesson learned: code that you think is fully debugged... isn't.

It's a well-known fact that software deteriorates over time. Many is the time I have gone back to look at code that was working perfectly when I left it, only to find new bugs have crept in while I wasn't using it

jmg · 2020-07-12 00:55

wmosscrop wrote: »

AJL wrote: »

Perhaps I wasn't clear in my explanation. 4800 baud is achievable on the P2 all the way up past 300MHz.

Ah, now it makes sense as to what you were doing with transmitting 32 bit characters.
It might work, but as you point out the start/stop bits might be an issue.... I was thinking maybe they could be bit-banged on transmission, not sure about receive.

A partly software variant on this would be to pack 3 bits per baud-bit, and now you can fit 10 bits into a 32 bit frame.
The first start bit, would send/rx as 2 bits, as the HW manages one non-data start bit.

For 1200baud, I make the divider 50,000 and triple bits, from 180M.
You would need more precise than usual Baud defines, as you are receiving 3x the length, but I guess this ancient stuff all works from a 1.843200 Xtal somewhere ?
None of this new-fangled calibrated RC oscillator stuff

Building/extracting the payloads is a bit-bang variant, but it does not need long delays for each bit, as the HW eventually manages that.

Cluso99 · 2020-07-12 01:56

P1 working code compiled with PropTool then compiled with bst or homespun reveals missing # on jmp/calls.
P2 working code compiled with pnut then compiled with fastspin reveals missing wcz on compare.

pilot0315 · 2020-07-20 17:00

The 1130 was one of the first computers I used using Fortran.
I am highly interested in what you are doing.

wmosscrop · 2020-07-20 17:04

I finally (!) have xbyte emulation working, but not optimized (instruction skipping, etc.)

Still need to get the sdcard access working.

FORTRAN on the P2

Comments