Interfacing Prop to Single Board Computer (RPi, BBB)

ags · 2014-01-08 21:11

I'm starting a project that aims to use the power of a Single Board Computer (SBC - e.g. Rasberry Pi, BeagleBone Black) to free up Prop resources currently supporting functionality like a web server, SD card reader, RTC interface, etc and push that to the SBC. The Prop would then be dedicated to perform real-time functions it handles so well - decoding data, formatting (bit-banging) and sending as output with precise timing (down to the last clock cycle) - something that the SBC running Linux isn't capable of doing. The idea is to have the SBC source the data (from SD card, Ethernet, USB client, etc) and send it to the Prop using it's gpios. I would rely on the precise timing afforded by the Prop for correct final data output timing. I'm thinking of implementing some sort of flow control so the Prop would signal when buffers need refilling; the SBC would monitor and send data as needed. Seems reasonable.

At a high level, I'm stuck figuring out what this SBC/Prop interface would be. I need to be able to stream a minimum of 5Mbps to the Prop, controlled so as to neither overflow or starve the Propeller's relatively small (<32kB) buffer. I currently have a prototype running that is all Propeller-based. It reads data from an SD card and feeds cogs that take care of the formatting/bit-banging and timing for outputs (just about 2Mbps data rate). This works today - but I need to double the amount of data throughput and don't see a path there with the current Propeller-only design.

I'm looking for suggestions on what type of interface I should be considering between the SBC and Prop. (I'm favoring the BBB as the SBC). It seems that using gpio (SPI, UART, I2C) will be too slow on the Prop side to sustain 5Mbps data rate. I'm also not clear on what the implications would be on the loading of the SBC. If I end up with custom drivers (not something I've done before) or LKMs, I could bog down an otherwise powerful ARM Cortex-8 core mostly waiting on the Propeller to signal that it's ready for more data. Clearly I have a lot of learning (I look forward to that).

As I said, I'm looking for suggestions about an overall architecture to pursue. Thanks in advance.

Peter Jakacki · 2014-01-08 21:23

ags wrote: »

I'm starting a project that aims to use the power of a Single Board Computer (SBC - e.g. Rasberry Pi, BeagleBone Black) to free up Prop resources currently supporting functionality like a web server, SD card reader, RTC interface, etc and push that to the SBC. The Prop would then be dedicated to perform real-time functions it handles so well - decoding data, formatting (bit-banging) and sending as output with precise timing (down to the last clock cycle) - something that the SBC running Linux isn't capable of doing. The idea is to have the SBC source the data (from SD card, Ethernet, USB client, etc) and send it to the Prop using it's gpios. I would rely on the precise timing afforded by the Prop for correct final data output timing. I'm thinking of implementing some sort of flow control so the Prop would signal when buffers need refilling; the SBC would monitor and send data as needed. Seems reasonable.

At a high level, I'm stuck figuring out what this SBC/Prop interface would be. I need to be able to stream a minimum of 5Mbps to the Prop, controlled so as to neither overflow or starve the Propeller's relatively small (<32kB) buffer. I currently have a prototype running that is all Propeller-based. It reads data from an SD card and feeds cogs that take care of the formatting/bit-banging and timing for outputs (just about 2Mbps data rate). This works today - but I need to double the amount of data throughput and don't see a path there with the current Propeller-only design.

I'm looking for suggestions on what type of interface I should be considering between the SBC and Prop. (I'm favoring the BBB as the SBC). It seems that using gpio (SPI, UART, I2C) will be too slow on the Prop side to sustain 5Mbps data rate. I'm also not clear on what the implications would be on the loading of the SBC. If I end up with custom drivers (not something I've done before) or LKMs, I could bog down an otherwise powerful ARM Cortex-8 core mostly waiting on the Propeller to signal that it's ready for more data. Clearly I have a lot of learning (I look forward to that).

As I said, I'm looking for suggestions about an overall architecture to pursue. Thanks in advance.

Rather than elaborate more on the things I've considered,

Sounds like a lot of bother to me to go to that length as I am doing all that on a single Prop now and I still have at least 5 cogs and memory to spare! What exactly are you trying to do then?

ags · 2014-01-08 21:51

I have either a stream of "live" data (from a WizNet Ethernet module) or data stored on an SD Card. I need to take that data, decompress/reformat it and then send it out using a proprietary (not mine) protocol (that is, bit-banging) with precise timing. I'm limited by the maximum possible speed of reading one bit at a time with SPI (two instructions per bit read). I was thinking that if I could use the speed of an SBC for Ethernet and/or SD Card data, and send that to the Propeller in a parallel bus (8 bits wide, maybe) then I could increase the amount of data I'm able to process on the Propeller (I would free up 4 cogs by not running the web server or Ethernet driver or SD card driver or audio codec driver) and divide the extra data-in among double the number of cogs for output.

Peter Jakacki · 2014-01-08 23:03

ags wrote: »

I have either a stream of "live" data (from a WizNet Ethernet module) or data stored on an SD Card. I need to take that data, decompress/reformat it and then send it out using a proprietary (not mine) protocol (that is, bit-banging) with precise timing. I'm limited by the maximum possible speed of reading one bit at a time with SPI (two instructions per bit read). I was thinking that if I could use the speed of an SBC for Ethernet and/or SD Card data, and send that to the Propeller in a parallel bus (8 bits wide, maybe) then I could increase the amount of data I'm able to process on the Propeller (I would free up 4 cogs by not running the web server or Ethernet driver or SD card driver or audio codec driver) and divide the extra data-in among double the number of cogs for output.

Am I missing something here? Unlike a single-core processor the Prop has 8 cores so that it can have precise timing on one, Ethernet on another, protocol on another and so on. My Ethernet and SD run on the main Tachyon console cog but even with dedicated cogs there are enough to go around. You don't really need any spare, if it takes all 8 cogs to make it work then the Prop has fulfilled it's requirements. If that's not enough, just add another Prop perhaps.

ags · 2014-01-09 07:35

I have all cogs running with no spare time - every "hold" period for bit-banging output is spent loading more data into cog RAM. If I have to use more than one Propeller to drive more outputs then I will need to replicate data on multiple SD cards, as well as handle synchronization between the two different chips.

I was purposefully attempting to leave the solution space as open as possible. Perhaps the question I should ask is "what is the best way to create a high-speed (>5Mbps) interface between a Propeller and a non-RTOS device?".

It seems that I can't get the speed I need (onto the Propeller) with a single-bit-wide data line. I thought of using a 8-wide data bus, but then I'm not clear if flow control (on the master side) would either have too much latency (starving the Prop buffers) or drag down the single-core master. I also thought about using a standard interface (USB, perhaps) but as that eventually is single-bit interface to the Propeller that doesn't help. I am wondering if there is such a device as a USB client that has a large internal buffer and that handles flow control with the USB host, and provides a data bus interface to the received data. I have no idea if such a thing exists, or if it is a reasonable solution.

Edit: I have found the FTDI FT245R which is just what I was trying to describe. It provides all the timing/interface for a USB device, and an 8-bit wide FIFO to read received data. I still don't know if this is the most effective way to go, but it is a possibility. Still looking for other options, or confirmation of this as a reasonable strategy.

photomankc · 2014-01-09 10:24

The BeagleBone Black has two PRU's (Programmable Realtime Units) that could easily (for the hardware) handle that data rate. The issues are the smothering layers of Linux abstraction that you have to work through to get them working as well as the fact that they are programmed in assembly only, there is no compiler to use to program them in C or anything. This is definitely right up the alley they were designed for though.

ags · 2014-01-09 11:29

I already have a RPi. The PRUs are the reason I have a BBB on order. I thought much the same thing as you describe. Although while it would be a fun project to figure out how to put them to use, I also don't need to avoid an easy (perhaps obvious to someone else) solution.

I've looked into the PRUs a bit (including the instructions (it's great that the SoC on the BBB is fully documented)). I'm not sure how to hook them into a process that will be reading data from an SD card or RAM or somewhere in userspace. I also am not clear on how to create a parallel output port (8 bits) without writing my own driver. I'm fairly sure I still need that to be able to achieve the necessary data rate to the Propeller, regardless of how I get it there. I've seen (maybe) some capes that appear to have done just that to drive 3D printers. I don't have experience writing Linux drivers, and that's when I thought of the USB interface.

David Betz · 2014-01-09 11:37

ags wrote: »

I already have a RPi. The PRUs are the reason I have a BBB on order. I thought much the same thing as you describe. Although while it would be a fun project to figure out how to put them to use, I also don't need to avoid an easy (perhaps obvious to someone else) solution.

I've looked into the PRUs a bit (including the instructions (it's great that the SoC on the BBB is fully documented)). I'm not sure how to hook them into a process that will be reading data from an SD card or RAM or somewhere in userspace. I also am not clear on how to create a parallel output port (8 bits) without writing my own driver. I'm fairly sure I still need that to be able to achieve the necessary data rate to the Propeller, regardless of how I get it there. I've seen (maybe) some capes that appear to have done just that to drive 3D printers. I don't have experience writing Linux drivers, and that's when I thought of the USB interface.

Do you have a link to a description of the PRU instruction set? I've done LInux drivers before. Might be fun to try writing one for a PRU if someone hasn't already done that.

Heater. · 2014-01-09 11:49

If I get the idea correctly you are not going to be writing a Linux device driver for a PRU.

Imagine you have an ARM chip running Linux and it has shared memory or DMA or whatever interface to an on chip Propeller core.

So now you have "firmware" to write for those prop cogs and perhaps a Linux device driver to load that firmware and subsequently talk to it.

In the case of the PRUs its a big investment of time and effort into something totally non-portable. I don't see anyone going for it unless they have a market for millions of some custom gadget.

David Betz · 2014-01-09 12:08

Heater. wrote: »

If I get the idea correctly you are not going to be writing a Linux device driver for a PRU.

Imagine you have an ARM chip running Linux and it has shared memory or DMA or whatever interface to an on chip Propeller core.

So now you have "firmware" to write for those prop cogs and perhaps a Linux device driver to load that firmware and subsequently talk to it.

In the case of the PRUs its a big investment of time and effort into something totally non-portable. I don't see anyone going for it unless they have a market for millions of some custom gadget.

I didn't mean a device driver to run on a PRU. I meant a driver to interface to a PRU. Or maybe that isn't necessary. Maybe the PRU can be mapped into user space and manipulated directly. In any case the ARM + PRU architecture is sort of like what I was hoping might someday happen with the Propeller. If we could have a Propeller n+1 chip, n>=2, that had a bunch of COGs and an ARM core then we could run the OS and apps on the "application processor" and run the time-critical stuff on the COGs. It seems like the idea they have with the PRUs so I'm interested in seeing how well it works out on the BBB.

Heater. · 2014-01-09 12:31

Point is, if I understand correctly, the PRU is a different architecture requiring it's own assembler and C compiler (if there is such a thing).
So if you want to use it from the Linux side you need some special driver to at least talk to it. The rest is down to creating "firmware" to run on the PRU.
Like I said, you can imagine it as an ARM with a Propeller on the same silicon. You need a Linux device driver to fire up and communicate with the Prop, and you need a Spin/PASM/C firmware to load the COGs with.

So yes, would't it be cool if one day Parallax could licence Propeller cores to all those guys making ARM SoCs for Linux based embedded systems?

We could be using a "standard" Prop architecture everywhere!

David Betz · 2014-01-09 12:39

Heater. wrote: »

Point is, if I understand correctly, the PRU is a different architecture requiring it's own assembler and C compiler (if there is such a thing).
So if you want to use it from the Linux side you need some special driver to at least talk to it. The rest is down to creating "firmware" to run on the PRU.
Like I said, you can imagine it as an ARM with a Propeller on the same silicon. You need a Linux device driver to fire up and communicate with the Prop, and you need a Spin/PASM/C firmware to load the COGs with.

That is exactly what I was expecting. I'll have to lookup the description of the PRU to see what it's instruction set looks like. I don't mind writing assembly code if the thing can do what I want. For example, maybe it would be possible to program a PRU to efficiently talk to a Propeller chip attached to the BBB. Now I'm wishing I had purchased a BBB instead of a RaspberryPi. :-)

Heater. · 2014-01-09 12:55

The BBB is an interesting device for sure.

The idea of writing a Linux device driver to talk to a PRU, for which you have written firmware to talk to a Propeller, for which you have also written code. Starts to sound very cumbersome though.

David Betz · 2014-01-09 12:57

Heater. wrote: »

The BBB is an interesting device for sure.

The idea of writing a Linux device driver to talk to a PRU, for which you have written firmware to talk to a Propeller, for which you have also written code. Starts to sound very cumbersome though.

Maybe but what's the alternative? Slow access to the Propeller over a serial port?

ags · 2014-01-09 13:30

@David Betz here's a link to a (pseudo) tutorial, with links to other TI resources on the PRU: http://elinux.org/ECE497_BeagleBone_PRU

@Heater it sounds like a lot of fun and learning to me. And as David says, until we have an ARM/Propeller SoC, this would be a great way to have cake and eat it too.

As said earlier, for the same purpose (sending data to a slave Propeller for bit-banging/precise timing) I was looking into using the simpler, standardized USB connectivity. That would require at least one (FTDI) chip on the slave board, and still may have too much latency to work (for me). I just saw some specs that indicate that the minimum round trip communication for a USB 1.1/2.0 communication is 1ms (that's forever). I presume on the BBB side all the polling for CTS (to the Propeller through USB) would be offloaded from the main processor - although I haven't confirmed yet.

I was also looking into direct manipulation of the GPIO, but parallel (8 data bits + ready bit) not serial. I can't find much on that topic. I suspect it will require a Linux driver. And DT overlay. Maybe more? But more importantly, I also am concerned that it would bog down the ARM core, mostly wasting time with constant polling to see when the Propeller is ready for more data. Unless I can use interrupts for that.

I'm working from the basic assumption that if coded in PASM, one cog will max out at about 1MB/sec data throughput - that is, 8 parallel bits read in and stored in hub RAM. Plus or minus. Am I missing something there? If not, that's the goal I'll set for the sustained data rate to be provided by the BBB (with no buffer under-runs on the Propeller, with a buffer size of whatever is left of 2kB in cog RAM or maybe up to 4kB in hub RAM if necessary).

BTW, in case anyone is reading this and taking what I'm writing as concrete fact, or even if a reader feels compelled to tell me I don't know what I'm talking about... I already know I don't know what I'm talking about. This is my first foray into the world of hacking Linux on a BBB. I'm open to learning from others, and welcome that of course.

...and one more thing: IIRC I read that the PRU instruction set is a subset of about 40 instructions shared with the Cortex-8 core. I didn't bookmark that page and haven't been able to find it again though.

Edit: Actually, there are exactly 45. http://processors.wiki.ti.com/index.php/PRU_Assembly_Instructions

David Betz · 2014-01-09 14:00

Thanks for the link. The PRUs sound kind of nice. Each has 8k of program memory and 8k of data memory plus they share 12k of data memory. They also each have a MAC and have direct access to some GPIO pins. Haven't looked at the instruction set yet. I'll do that later. Anyway, I don't even have a BBB so I can't do anything right away anyway.

ags · 2014-01-09 16:59

David Betz wrote: »

...Anyway, I don't even have a BBB so I can't do anything right away anyway.

What do you mean? You have lots of reading and learning and planning to do. Having the actual hardware with you will just be a distraction. :-)

I'm surprised and pleased that I was able to discover the PRUSS. The more I learn about it the better it seems - until the misery of actually using them becomes real. I hope we can collaborate on this project.

altosack · 2014-01-10 07:13

I have a BBB, and even though I haven't done much with it, I've done a lot of reading on the PRUSS.

The instruction set is not shared with the ARM core; it's completely separate. It's completely deterministic, one clock for every instruction @ 200 MHz, unless you address the memory in the ARM. There is a blob device driver in Unix space that is callable from user space to communicate with the PRUSS, so you may not have to write a device driver. My experience is that there's quite a learning curve to use it from linux, although it looks like a lot of fun to program the PRUSS itself.

Programming the PRUSS is done with a one-pass assembler, no C, and no linking, although it does have normal pre-processor directives, and you can include other files, so it can be modularized to a certain degree.

ags · 2014-01-10 07:23

@altosack do you have links to any of the learning material you used (other than the two links I posted previously)?

Thanks.

LoopyByteloose · 2014-01-10 07:24

Peter Jakacki wrote: »

Am I missing something here? Unlike a single-core processor the Prop has 8 cores so that it can have precise timing on one, Ethernet on another, protocol on another and so on. My Ethernet and SD run on the main Tachyon console cog but even with dedicated cogs there are enough to go around. You don't really need any spare, if it takes all 8 cogs to make it work then the Prop has fulfilled it's requirements. If that's not enough, just add another Prop perhaps.

Well, it seems that Tackyon Forth makes the Propeller 'scary good'.

I'd like to summariize a few salient points.

A. The Propeller One (and likely the Propeller Two as well) is the ideal interface extender for all and any Single Board Computer or SOC.

B. And in many cases with Peter J's Tackyon Forth code, the whole project might fit on just a Propeller.

David Betz · 2014-01-10 07:35

altosack wrote: »

I have a BBB, and even though I haven't done much with it, I've done a lot of reading on the PRUSS.

The instruction set is not shared with the ARM core; it's completely separate. It's completely deterministic, one clock for every instruction @ 200 MHz, unless you address the memory in the ARM. There is a blob device driver in Unix space that is callable from user space to communicate with the PRUSS, so you may not have to write a device driver. My experience is that there's quite a learning curve to use it from linux, although it looks like a lot of fun to program the PRUSS itself.

Programming the PRUSS is done with a one-pass assembler, no C, and no linking, although it does have normal pre-processor directives, and you can include other files, so it can be modularized to a certain degree.

Sounds quite nice. I'll have to add a BBB to my list of gadgets to acquire sometime soon. I should probably spend some time with my recently acquired RaspberryPi first though. I'm running out of space in my input queue for new devices and Chip's revision of the P2 code will be taking my time soon...

altosack · 2014-01-10 08:47

ags wrote: »

@altosack do you have links to any of the learning material you used (other than the two links I posted previously)?

Thanks.

I don't have a link; I've got it all on my computer now. If you google "am335xPruReferenceGuide.pdf", and look for other things in the same place, it should get you more than started. Beware that the newest version of the am335x spec. document (not the Pru-specific one) has the PRU stuff stripped out, and you need to get an older one (spruh73c.pdf, not 73h).

David Betz · 2014-01-10 12:47

altosack wrote: »

I don't have a link; I've got it all on my computer now. If you google "am335xPruReferenceGuide.pdf", and look for other things in the same place, it should get you more than started. Beware that the newest version of the am335x spec. document (not the Pru-specific one) has the PRU stuff stripped out, and you need to get an older one (spruh73c.pdf, not 73h).

Interesting. Did they strip out the PRU stuff because they've moved it to a separate manual or are they now trying to hide the details from users?

ags · 2014-01-16 21:50

Good question,

photomankc · 2014-01-17 07:13

ags wrote: »

Good question,

TI has been weird about them from the beginning. They are advertised as a feature of the chip but they don't seem at all interested in supporting them in any way what-so-ever. I think I remember one comment to the effect of not supported for 'hobby use', whatever that means? As hobbyist buyers are the only ones I know of that sounds a lot like "we just aren't supporting them".

ags · 2014-01-17 10:09

My guess would be that they (TI) see this as differentiated for industrial/embedded use. The SoC on the BBB wasn't developed for the BBB, it was "adopted" by BeagleBoard.org. While TI may be glad for the exposure, what they need are design socket wins. This SoC (and derivatives) may have the PRUSS, but if there are limited support resources available from TI, then I can see not wanting to spend them helping a hobbyist become productive using this rather complex (or at least new and not well-understood) subsystem, but rather work on the socket wins by supporting potentially high-volume users become productive.

photomankc · 2014-01-17 13:26

Yeah that's a good point.

I know that BBB and TI have ties but it is true that this is not a TI product. That can always backfire on you though because you never know which person you turn your nose up at may be the one that launches that new product that uses your fan-dangled part. It's not so much the idea that they won't build goodies for the BBB community it's the fact that they seem to be reluctant to even share what resources exist with a kind of "you'll shoot your eye out" attitude about it. Perhaps I am miss-reading the intent but I definitely got the impression they were very stand-offish about the PRUSS. I ended up deciding a prop and a serial link would fill the need easier right now.

ags · 2014-01-17 13:48

I believe that the BBB designers are actually employees of TI and have support from TI in the form of "on-the-job" time spent on the BBB work.

photomankc wrote: »

... I ended up deciding a prop and a serial link would fill the need easier right now.

...and now we've come full circle (or at least part way around). So while the PRUSS is interesting and I do want to learn more about it, I don't need to force doing everything the hard way (that seems necessary with enough frequency without me insisting on it). I was first wondering if I could use an existing interface from the BBB to the Propeller to get data from BBB to Prop for bit-banging in real time. With the speed and memory limitations of the Propeller, I have to have pretty good flow control to keep the Propeller buffers full (avoiding starvation) without overflow (avoiding data loss). With Linux not a RTOS, how does one effectively manage the data flow? Did you accomplish this with your serial interface, or do you have the luxury of just using wait/sleep on Linux and then wake/resume, send data and repeat without issue?

photomankc · 2014-01-23 05:37

ags wrote: »

Did you accomplish this with your serial interface, or do you have the luxury of just using wait/sleep on Linux and then wake/resume, send data and repeat without issue?

Sorry, Didn't see your question earlier. No my use of the Propeller is the other way around. The Propeller just does it thing as fast as it can and my Linux program spits commands at it [SN:R:0] - Sonar, Read, Sensor 0 and the prop spits back whatever the current sonar distance reading is [210] - 21.0 inches. The other Subsystem is a clock generator. So [CK:W:0:8100] - Clock, Write, channel 0, 8.1MHz. That's a fire and forget deal. Planning to add a GPIO system as well so I can say [GP:W:0:FF].... you get the picture. It's all stuff that would be sampled at 20 to 40hz so nothing that has to be more than 115200bps.

With memory-mapped userland I/O I know you can get toggle rates in the MHz on the ARM boards depending on the performance of the library. The issue will always be what happens when some other task stomps on the CPU for a while and your userland toggle gets the school-bus stop sign. Kernel driver?

Heater. · 2014-01-23 06:48

I don't think a kernel driver will help you with the "school-bus stop sign" problem.

Imagine trying to write something like full duplex serial by bit-banging on the GPIO of a Pi.

Your kernel module could probably do that, provided yo disabled interrupts permanently so that incoming bits and bytes are not missed and the outgoing bit timing is not delayed by being rescheduled.

But then of course you have halted the entire operating system just to do that bit banging job reliably. Even whatever program you have that might use the driver has no time to run!

photomankc · 2014-01-23 10:34

Yes I was thinking more along the lines of a clocked bus of some kind where the period of the clock itself is not timing critical. If your receiver doesn't care that bit to bit times might vary by microseconds then you could get bits over at average rates in the megabits. If you could spare 5 lines then you could transfer a nibble at a time and significantly up the effective bit rate and give the system more breathing room on the needed clock rate. Correct me if I'm wrong here but if it's a kernel level driver then only other kernel interrupts are going to get in the way, where a user-level app can suffer the wrath of a CPU hog application. If the Kernel goes off to la-la-land your product is pretty well hosed now anyway.

Lets see.... a nibble wide bus would need to average 1.25Mhz to reach 5Mbps, so that's like 8 micro seconds. If you can devise a way to avoid usleep or it's kin to determine when to send the next nibble and clock that seems potentially doable.

If it has to be an async line of some kind, I don't see how you get there (5Mbps) from here. SPI can get up to a pretty high clock rate on Beagle/Pi but I found that byte to byte transmission times can vary a lot and are often so long that clock rate increases are not significantly helpful. I think at 1Mbps the delay from byte to byte was close to 1/3 of the time to transmit all the bits of each byte.

Interfacing Prop to Single Board Computer (RPi, BBB)

Comments