[Solved] Can two Propellers communicate any faster via one pin than 20 Mbps?
jac_goudsmit
Posts: 418
Hi All,
I'm working on a project with two Propellers that need to communicate as fast as possible, via a single I/O pin. One Propeller is connected to the bus of a 6502 and another Propeller is going to be used to generate video.
I've seen the High Speed Propeller-to-Propeller modules by Beau Schwabe. They use unrolled loops to transmit bits at a speed of 1 bit per 2 instructions (8 clocks) because the transmitter needs two instructions per bit: one to rotate the data and one to MUX the bit onto a pin.
I've also seen (and written) code that transmits serial data even faster (one bit per 4 clocks) by setting up a timer in NCO mode, setting the FRQx register to zero (so it doesn't count) and shifting the data through PHSx. In NCO mode, PHSx[31] is directly connected to an output pin so a single rotate instruction will shift to the next bit and put that bit on an output pin.
Now my questions:
1. As far as I know, to extract serial data from an input pin and make it parallel, you can only use e.g. a TEST mask, INA instruction followed by a ROL instruction. That's fine for the "1 bit per 2 instructions" transmitter but for the faster transmitter I'll have to run two cogs that each catch alternating bits from the input. Or does anyone know a faster way to deserialize bits with just one single instruction?
2. Second question: I think it's possible to shift data out even faster with video registers and WAITVID. Unfortunately I'm a little rusty on the video registers. Does anyone have any suggestions for code where this has already been done?
Thanks in advance!
===Jac
I'm working on a project with two Propellers that need to communicate as fast as possible, via a single I/O pin. One Propeller is connected to the bus of a 6502 and another Propeller is going to be used to generate video.
I've seen the High Speed Propeller-to-Propeller modules by Beau Schwabe. They use unrolled loops to transmit bits at a speed of 1 bit per 2 instructions (8 clocks) because the transmitter needs two instructions per bit: one to rotate the data and one to MUX the bit onto a pin.
I've also seen (and written) code that transmits serial data even faster (one bit per 4 clocks) by setting up a timer in NCO mode, setting the FRQx register to zero (so it doesn't count) and shifting the data through PHSx. In NCO mode, PHSx[31] is directly connected to an output pin so a single rotate instruction will shift to the next bit and put that bit on an output pin.
Now my questions:
1. As far as I know, to extract serial data from an input pin and make it parallel, you can only use e.g. a TEST mask, INA instruction followed by a ROL instruction. That's fine for the "1 bit per 2 instructions" transmitter but for the faster transmitter I'll have to run two cogs that each catch alternating bits from the input. Or does anyone know a faster way to deserialize bits with just one single instruction?
2. Second question: I think it's possible to shift data out even faster with video registers and WAITVID. Unfortunately I'm a little rusty on the video registers. Does anyone have any suggestions for code where this has already been done?
Thanks in advance!
===Jac
Comments
Yes video generator can output much higher rates, faster than a cog can transfer data from the hub
in fact. Reading is going to be the bottleneck unless you use a hardware shift reg to do serial to
parallel conversion.
BTW can you overclock? I’ve been using 104MHz for years.
For the initial setup, the clocks won't be synchronized. I basically already have the setup with the Propeller linked to the 6502 and I'm implementing the video part on a breadboard or on a USB Proto board, and it's impractical to let both Props work from the same clock source. I may use a common oscillator at a later stage if this is successful.
Nevertheless, I don't think it will be a big deal in this case if the Propellers each have their own clock. I'll probably use Beau's code that generates one or two start bits that are 1.5 times the duration of the normal bit time. On the receiving side, I can synchronize to that with a WAITPxx instruction so that the following instructions will sample the pin halfway between transitions. Given a bit time of 4 clocks, the total time for the entire word of 16 bits is 48 clocks. And at the end of the last bit, the timing difference between transmitter and receiver needs to be off by less than 2 clocks in either direction. That's 2/48 clocks, which is more than 4% in either direction. So the crystals of the two Propellers need to be within 2% of 5MHz (for the worst case that one Prop is 2% too slow and the other is 2% too fast, adding up to 4%). I'm pretty sure the crystals I use have a frequency drift of less than 50ppm (0.005%) so I'm not too worried . The receiver code will compensate for clock drift with that WAITPxx instruction that synchronizes with the start bit. I know the harmonics of two almost-identical crystal oscillators can cause problems with the FCC but I'm not planning on asking them; this is a private project for now .
The data I want to transfer comes directly from INA, not from the hub. I basically sample the address bus from the 6502 and serialize it to the second Propeller; the 6502 runs at up to 1MHz so I have about 20 Propeller instructions per 6502 clock cycle to serialize the address bus, transmit and receive it, and do a hub read or write on the secondary Prop. This is very, very tight but I think it can be done, though I will certainly need 2, maybe 4 cogs on the secondary Prop.
Converting the serial data back to parallel with external hardware is kind of the opposite of what I want . I already have the parallel bits, I'm trying to reduce the number of pins so that the secondary Prop has enough pins left over to generate 8 bit or 16 bit video.
This would not be practical; if I end up making this into a product, it will have to work with existing hardware "in the field" that has a 5MHz crystal soldered onto the board. But I do have an alternative: I can temporarily slow down or stop the 6502 if I can't get the work done on time. For now, it appears that that will not be necessary.
How many pins do you need ? You can use 2 or 4 bits as Dual or Quad SPI linking ?
Compatibility with any standard is irrelevant. Speed is. The transmitter basically just needs to get 16 (adjacent) pins from INA and send them to the other Propeller as fast as possible (preferably within 500ns / 40 clocks) and the receiving Propeller has to have enough time left within the remaining time of 1 microsecond total, to do a RDBYTE or WRBYTE and a WAITPxx to wait for the next 6502 clock cycle. Throwing more cogs at the problem on the receiver side will make it necessary to store part of the data in the hub with e.g. WRWORD, and reconcile the data from the two cogs using a RDWORD.
I have up to 3 pins available but I would like to do this with a single pin if possible. The speed gain of using multiple pins has to be enough to compensate for the extra time needed for the hub access instructions. It's likely that 2 cogs is the maximum because if one cog has to get the data from the other cogs, the hub instructions just get prohibitively expensive.
===Jac
BRILLIANT!
I think I understand what you mean, and that's EXACTLY what I was looking for. I had a feeling it would be possible to do a trick with a timer but I didn't think of using two timers (and a pin), and using A AND B to make sure that each bit is only counted once, and to make sure that the timer doesn't destroy existing data by adding 1 to PHSx while it has its lsb set to 1.
Let me see if I can explain the details to make sure I got it right, and to allow readers to follow along:
- Set timer A to duty mode (%00110), with FRQA set to $4000_0000, so that the timer A output pin goes HIGH once every 4 clock cycles and stays high for one clock cycle.
- Set timer B to counter mode "A AND B" (%11000), with the timer A output pin as input pin A, and the serial data receive pin as pin B. Set FRQB to 1.
- Wait for the start bit with a WAITPxx instruction
- Set PHSA to a value ($0000_0000, $4000_0000, $8000_0000, or $C000_0000) such that it goes HIGH roughly in the middle of the following incoming data pulses. Set PHSB to 0 immediately afterwards. Which value to choose depends on the length of the start bit; it can be determined exactly by reasoning, or it can be experimentally determined by using a logic analyzer. Or it can be experimentally determined. I'll figure it out
- Execute "n" consecutive SHL PHSB, #1 instructions (ROR PHSB, #1 should also work). Then immediately copy PHSB to another location. It now contains "n" bits of serial data.
I think you meant to say PHSx here.Even if the two Propellers run on two separate crystals, the difference in frequency should not be too much to stay in sync if you synchronize the receiver with the transmitter at the beginning of each transmitted (long)word. The crystals are probably 50ppm so the maximum drift per clock cycle is 100ppm in the worst case. When transferring 32 bits, this inaccuracy adds up to a maximum of 3200ppm = 0.32%. But since the bits are 4 clocks long, the REQUIRED accuracy is half of 4/32 (assuming the receiver measures in the middle of each incoming bit), so 2/32. That's 6.25% so the 0.32% worst case inaccuracy is very safely within the limits.
I want the video Propeller to be memory-mapped. It will be connected to the databus directly (serializing and deserializing the data bus would take too much time especially because it's bidirectional). Serializing the address bus will save enough pins on the second Prop to generate 8 bit video or 16 bit video or two screens using 8 bits each or one screen VGA and two screens CVBS. Maybe I'll add a third Propeller for multi-layer or something.
Thanks!!!
===Jac
A loading C on Duty pin to delay rise, and delay fall, should ensure enough tsu,th to reliably count ?
The phase of the Duty pulse to shift opcode will need care, as they cannot both try to use the same sysclk. If shift is a read-modify-write opcode, it may take more than one sysclk, which narrows the choice even more.
To boost speeds you could look to send multiple words this way, if you need to ? The tight Xtal tolerances mean you could sync then send eg 128 bits as 4 32b clumps
If you can use the same xtal signal (buffered) the relative xtal errors are no longer an issue.
That's not the issue. I'm trying to get 16 bits of data (from the INA register) from one Propeller to another Propeller, preferably within 500ns or so (looks like it will take a little longer), once every microsecond. Eventually the Propellers might use the same clock generator but I've already calculated and demonstrated that if I synchronize the receiver at the beginning of a 16 bit (or 32 bit) message, clock frequency differences that can be expected from two crystal-clocked Propellers are not going to be a problem.
The Shift operation is done by the ALU. And I suspect that the PHSx registers are implemented as latchable counters. As long as the timer is configured in such a way that the clock cycle from the timer doesn't happen at the same time as the write cycle of the SHL instruction, the entire process is predictable and will work.
There are only 16 bits to send per 6502 clock cycle (1MHz as I mentioned). The option of sending longer streams of bits to increase bandwidth is irrelevant because that's the only data I need. More data isn't available until the next 1MHz clock cycle when the 6502 puts a different address on the bus.
The algorithm above will allow me to send 16 bits in 4*16=64 Propeller clock cycles. The WAITPxx for synchronization takes 6 cycles minimum, the JMP at the end of the loop takes 4 cycles. That leaves 7 cycles to do other things. Not enough for a RDBYTE or WRBYTE but I may be able to "cheat" there by finishing the work in the next 6502 clock cycle. I don't want to get into that here and now.
If 64 cycles turns out to be too much, I might use WAITVID in the transmitter to send the data at double the speed (1 bit per 2 clocks, 40 megabits per second!), and use two cogs on the receiver that are 2 cycles out of phase, to read the incoming data. That might be difficult. But this information is really helping me think this through a bit further.
===Jac
Did you copy and paste this from a working module? It makes no sense to me that that code shifts FRQA, not PHSA. Or am I missing something?
===Jac
Remember, every four cycles, the contents of FRQA are added to PHSA if the bit being recieved is on. Moving a single bit to a different position every four cycles has an effect equivalent to OR-ing the recieved bit into PHSA. repeat for up to 32 bits, stop the clock and you have the recieved bits in PHSA, in any order you like. (note: i think PHSA can not be read when used as a destination, therefore the mov into serbuffer)
Ah, I see what you're saying. By shifting the bits in FRQA, you manipulate the bits in PHSA indirectly. And the weird shifting pattern (7 right-shifts-by-1 followed by one shift-left-by-15) compensates for the bytes coming in in the wrong order.
The way I had in mind was to just set FRQx to 1 and leave it, and use shift instructions to shift PHSx. Every time a 1-bit comes in and the clock output from the other clock is high (once every 4 cycles), the timer adds 1 to the value, and because I would shift PHSx left just before that, the increment-by-one operation of the timer is the same as setting the lsb of the PHSx value.
Either method will work, I'm pretty sure (and using rotate-instructions instead of shift-instructions would allow my method to store the incoming bits out of order too).
I think I already have code that uses ROR instructions to modify PHSx for outgoing serial data, so I don't think that's an issue.
===Jac
What's nifty about the above shr frqa approach, is it avoids having to worry about two operations on one register, so makes timing more tolerant.
Could you tell us the big picture. I just got your l-star kit(though still waiting on 65c02 & memory), and the Apple 1 video was pretty week, so I was wondering about putting better video on it. How are you looking to implement it?
For higher end video, you could implement a 3 prop system with a color in each prop. I think baggers was working on this at one point.
Also, the 65C02 runs at 1MHz, but the longer instructions, most of them, actually take 4 cycles, so you should have more than 80 cycles on the propeller side.
Excellent point! I will keep that in mind.
===Jac
David Murray, "The 8 Bit Guy", got an L-Star from me last year and he said he was thinking of building his "dream computer". It looks like that is going to be his next "big project" now he finished his Planet X for DOS game.
I get a lot of questions from people who would like to see color video and I've been thinking for a while about how to implement this. And I can only agree that it would be a great thing to have since the 1-pin video has many limitations.
The Dream Computer (currently known as the Commander 16 but that might change) is now being developed by a team that David has gathered. They created a Facebook group to discuss ideas. Unfortunately I (and others) failed to convince him/them that a Propeller-based video system would be a good idea, but that doesn't stop me from going ahead and making something for my own project.
It appears they have set their minds to the Gameduino system for video, and though I looked at the website and the Verilog code and I was pretty impressed, I think it's a big NOPE. First of all, the entire system is accessed through a SPI port which has to be bit-banged by the 65xx. That's easy to change of course but there are numerous other limitations such as a fixed resolution of 400x300 (800x600 with double size pixels). The system has sprites but it can't even generate an 80x25 text screen.
I think for such an ambitious project I would like to have a video generator that has some serious capabilities: multiple resolutions, scrolling in all directions, non-fixed fonts, tile based video as well as bitmap video, sprites... I would even consider a mode where it can generate teletext characters like "BBC mode 7". And if possible it would be great to do tricks like changing modes in the middle of the frame like the Amiga could do. (I wish I knew more details about the Amiga and BBC).
For bitmap graphics, 32K is not much, and for things such as sprites and multi-layered bitmaps it would be a good thing to let multiple Propellers work together. But I'm still working out how to do that.
For starters, I'll probably just put the standard 6 bit VGA circuit on a breadboard, or I'll use one of the old Propeller proto boards that had the Propeller dead center and a VGA connector combined with ps2 ports, I think I still have a few of those from when they were on sale when they got discontinued. Then as a proof of concept I'll run the 640x480 VGA driver and allow the 6502 direct access to the video RAM. It should be fun to modify Woz Mon so that it writes output to video memory.
Well... Yes and no.
The Propeller in my system bitbangs the 65xx bus, and doesn't really know what the 65xx is doing. Every task basically starts at the beginning of a 1MHz clock cycle and needs to be done 1 microsecond later. I can't simply decide to deliver my data too late, because the 65xx will read the data bus at the end of the clock cycle whether it's there or not. And obviously, not having data ready when it's supposed to be ready, is a failure.
In the PIA emulator for the Apple 1, this was impossible because I had to implement side effects too: when you put a video byte on the output, an input bit changes, and when you read a character from the keyboard, another input bit changes. Doing those side effects would take too long so the code does them during the next 65xx clock cycle. It's as good as impossible that the side effect bits are going to be read in the next cycle, so I could get away with that.
But this is not an option for something like video. I don't want to have to tell programmers that they have to read the same location twice to get stored data, for example. They want to treat the video memory as regular memory, and by nature, the faster you can make your video hardware, the better. It would even be unacceptable if I would disallow an INC or DEC instruction on a byte in video memory: programmers might depend on that simple feature!
I thought about using a cache-like technique to get away with it, which would be relatively easy to implement: I would make the traffic to the video Propeller(s) one way, and use my existing Memory module to use the hub memory in the primary Prop as "shadow" memory. But that is not going to scale well: one of the purposes of adding an extra Prop is to free up hub memory on the primary Prop, and with a third Prop there would not be enough hub memory in the primary Prop to shadow the hub memories of both video Props. So that's an uninteresting option (though it could be useful in some future case).
Anyway that's enough rambling for now
===Jac
I also like the 8-Bit Guy's videos, and was a bit disappointed that he reviewed the PE6502 with a propeller in it rather than your l-star, which I think is superior because it is simpler. I tend to think that he got a bad impression of it from there because of it's Apple I emulation which had really sucky video, so I understand what you are trying to do.
I see what you mean by the 4cycle 6502 instructions not mattering.
Considering his goals on the video, I don't know if the propeller can really compete with an FPGA, with the main limitation being memory, with the only limitation being the level of FPGA they choose. Right now the gameduino is pretty weak compared to a propeller, but that could change, especially if they are able to garner enough support to get quantity discounts.
The propeller video solution would be a lot quicker to develop, and since there is already a SIDcog completed project out there, adding that might just push the project over the top. To lower chip count, the final project could have the 65816 boot the propeller over I2C or serial which could free up some pins for performance.
I personally would like to have a system similar to his but not limited to the 6502, but be rather be dual propeller based, but the ecosystem is what really matters most, as the raspberry pi has shown.
For protoboards these are going cheap at $20 https://www.ebay.com/itm/Parallax-Propeller-Project-Board-USB-32810/332853649555?epid=1304670522&hash=item4d7f9fa093:g:o9cAAOSwmLZb0h1K
And I think they are better than the proto board with the prop smack dab in the middle. They come with the ftdi chip too so no prop plug is needed.
I am excited to see with what you come up with for full SERDES with a single pin or dual pins for bi-directional as many projects depend on having 2 propellers connected bandwidth and pins are definitely an issue.
I enjoyed reading your rambling, and I am enjoying getting back into propeller programming.
Thanks,
hinv
I think we agree on all points but let's not go off-topic any further.
Interesting! I will take a look.
===Jac