Video Processing (going the other way)
Wurlitzer
Posts: 237
Anybody been working on using this great Micro to process images from a video source rather than generate them?
I have an application that would have to look at high contrast flags oriented as, 2 rows of 65, and determine one of 3 vertical positions. Each flag would have·one flag-width space between so horizontal separation should be sufficient and I would make sure the vertical displacement would move the image at least 5 horizontal scan lines up or down. I don't need to know the exact vertical position just that it moved far enough to reach one of the 3 states.
·IE: If (Position1=< 15, Position2=15 to 19, Position3=> 20) and the top of
Flag#42 is at horizontal scan line #12· it is position 1
I was thinking of making the flag height tall enough to allow the video processing to self clock. In other words, no matter what vertical position the flags are·in, there would be a horizontal scan line number where all flags for a row would be seen and therefore counted. In the example above maybe scan line 16 would see all the flags.
I would like to process the image ideally at an interleaved rate of ·60/second or worse case a full screen rate of 30/second. Any slower the application would not work.
What does the brain trust think?
·
I have an application that would have to look at high contrast flags oriented as, 2 rows of 65, and determine one of 3 vertical positions. Each flag would have·one flag-width space between so horizontal separation should be sufficient and I would make sure the vertical displacement would move the image at least 5 horizontal scan lines up or down. I don't need to know the exact vertical position just that it moved far enough to reach one of the 3 states.
·IE: If (Position1=< 15, Position2=15 to 19, Position3=> 20) and the top of
Flag#42 is at horizontal scan line #12· it is position 1
I was thinking of making the flag height tall enough to allow the video processing to self clock. In other words, no matter what vertical position the flags are·in, there would be a horizontal scan line number where all flags for a row would be seen and therefore counted. In the example above maybe scan line 16 would see all the flags.
I would like to process the image ideally at an interleaved rate of ·60/second or worse case a full screen rate of 30/second. Any slower the application would not work.
What does the brain trust think?
·
Comments
With 32K of RAM, there's not a lot of buffer space for something as rich as·a camera image, as you are aware. If some kind of object detection could be performed as the data was input, which is what you are proposing, the buffering req's could be reduced by a lot. This would make it all possible and practical. I think you could achieve this.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
My first vision app was for a huge sorting machine that measured the diameter of dwarf fruit tree stock at a nursery and dropped the trees into bins, based on size. It used a Z80 and captured a 256x256x6-bit frame from a B/W TV camera. The video buffer required 64K of RAM, but the processing requirements were pretty modest, and the Z80 could keep up easily. From there I specialized in linescan imagers, which are cameras with a single row of pixels -- in my case 256x1 or 128x1. Linescan imagers are the things to use when the subjects being observed are moving, such as on a production line. That way, successive images can be acquired to fill in the second dimension missing from the sensor itself. And, as Chip suggests, processing on the fly is the way to eliminate huge buffer requirements. These sensors were (and still are) being used in the produce-packing industry for sizing fruit and vegatables. Other applications include detecting orientation of empty liquor bottles prior to filling (so the label gets put on the right side) and measuring the widths of boards going through an automatic saw. These apps use PIC microcontrollers -- again, nothing fancy.
One of the biggest hurdles to overcome in machine vision apps is lighting. If you can control the lighting and optimize it for the type of sensing you're doing, you've done 90% of the work and made the other 10% easier. Seriously. There are many resources on the web that discuss lighting for machine vision. One of the better introductions can be found here: dolan-jenner.com/jenner/equipment/guide.asp.
Now, how does the Propeller fit into all of this? The Propeller has some unique characteristics that make it suitable for some machine vision apps. Being able to display what it's looking at, given some sort of image input, is a big plus for debugging. Granted, 32K of RAM isn't enough to hold a frame of VGA (640x480) or even CIF (352x288). But you don't have to for a large class of useful vision apps. A lot can be accomplished either at coarser resolutions or by computing on the fly. One thing I expect the Propeller, with its multiple processors, to shine at is image data processing. But I haven't yet gotten that far in my investigations.
I expect to have a lot more to say on the subject in the near future. For now, suffice it to say that the Propeller looks like a good candidate to form the foundation of a modest, but extremely useful, machine vision system.
-Phil
In my application, the flags will always be in a fixed horizontal position and have 3 possible vertical positions. The background will be solid black and the flags white.
The processing requirements would be as follows:
Determine the end of the vertical sync pulse
Count the horizontal scan lines
Determine position within a single horizontal scan where the flags should appear. This might be hard coded or it would be great to be self detecting.
At the expected horizontal position determine if the video image is white for
Flag(FlagPositionCounter) at one of the 3 possible scan count values.
If I see white for example at HorzScanLine 5/6 the flag is at position 12, @ 10/11 position 2, @ 15/16 position 3. (I used 2 scan lines to account for interleaved scanning)
Set a FlagArray (0-129) to a value of 1,2,3, or with 32 bits, this array could hold both the anticipated horizontal position and vertical position.
In my application this would eliminate 260 switch contacts (other similar apps require 520) which have always been problematic regardless of switch type. The industry has tried, reed, hall effect, IR, phosphor bronze wire and shorting bars etc. All have failed from time to time due to different circumstances like temperature, humidity, contact corrosion etc.
The ability to auto correct in software for physical changes would be a huge plus.
Once the full documentation for assembly programming is available I can start to give this some serious attention.
I'm considering doing line-oriented processing, as the previous posters have suggested, but this creates some problems.
1. Processing has to happen during a single horizontal scan line (or a small multiple thereof, if I use line buffers), if I don't want to pull lines from successive frames and risk tearing.
2. No way to do multipass processing of the image.
In either case, to do multi-cog video processing (which is really why I want the chip), I'm a wee bit concerned about RAM access latency.
Are there any reasonably speedy ways to attach some RAM to the current Propellers? Are there plans for Propellers with more shared RAM?
And where the hell's the rest of the manual?
An external·flash A/D would also lighten the load significantly.
Depending upon the horizontal resolution your application requires it is possible the internal ram would be sufficient for line by line processing if it did not have to first determine where is was in the scan.
I agree on the manual issue. I cannot begin to work on this until I have a good handle on the assembly language required and hopefully a chart depicting number of CPU cycles for each instruction.
I just found the section in my camera module's docs that specify how to do that. (I'm using an OV6620.) It might be workable, but the docs are a little fuzzy on how long I can hold the row charges without losing image quality, etc.
Since the camera's already generating digital output, I'm also considering building a framebuffer driven directly by its outputs, and having the processor read from that -- but that introduces a frame of latency.
Indeed; I'm working at CIF, so 288 lines of 352 pixels. At 16 bits lum/chroma, each line would take 176 longs; I can window the sensor smaller, or subsample if necessary -- but I'd like to avoid it.
If the individual cogs prove too slow or too RAM-constrained, I can always interleave the I/O functions between two of them.
I built a small simulation of real-time line processing on an architecture like the Propeller, and it seems to work. I'm not doing that much work -- frame differencing for motion detection, some simple color tracking, basically what the SX28 in the CMUcam does. Thanks to the tan and log tables in ROM, I may also be able to do my laser surface topography calculations entirely in the Propeller; they're already fixed-point and avoid division.
(Incidentally: Parallax, am I gonna have to implement Forth for this thing? We need a low-level compiled language.)
However:
This is the crux here. Without precise instruction timing information, I'm stuck with loose estimates gleaned from other chapters of the manual (22 cycles for an out-of-sync Hub instruction, 7 cycles in sync, 4 cycles for basic instructions, etc). I have no idea what the branch latencies are -- even the PICs have pipelines to stall -- so I'm writing without them.
Hell, at this point, I don't even know for sure that all instructions fit in 32 bits, so I have no idea if the routines I'm designing would fit in cog RAM.
I'm holding off buying Propeller equipment until I can get this info. I admit to having a real soft-spot for weird architectures, and I'm known for tight hand-optimization, but without a good instruction set reference? I'd rather not reverse-engineer my micro.
All the info you need about the Propeller's assembly language is in this document:
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 6/14/2006 5:29:42 PM GMT
Rock on! This should be sufficient to finish my model.
Nice orthogonal instruction encoding, btw. I like the way instruction predication is handled, and several of the instructions (the MUX variants in particular) will save wear on my copy of Hacker's Delight.
From a tools perspective, of course, this document doesn't cover how to bootstrap ASM code on the chip, but I know y'all are pretty closed about that sort of thing. I'll see if I can borrow a Windows box.
Couple questions to fill the holes in that document:
1. So, short of self-modifying code, there's really no way to do indirect addressing within the Cog's local storage? (That is, indirection of the D field specifically?)
2. Are there docs on what CALL and RET expand to in the standard assembler? It strikes me as being something like
; CALL foo
jmpret foo_ret, foo
; RET
foo_ret jmp #0 ; S field modified on CALL
...but if it were that simple, I don't see why we'd need a macro, so I'm surely missing something.
3. Any docs on the WAITVID instruction?
4. How about the cycles from COGINIT/COGSTOP to the COG actually starting/stopping? (I'm sure it's a function of both cog numbers, but as long as I can predict it, it might well come in handy.)
Just have the initial COG that runs your spin code launch a single ASM program into itself and you can be completely in assembly thereafter, with COGs launching and stopping other COGs, and the works. Once all your code is running in COGs, you can reuse the entire main RAM for whatever you want. There are no rules.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
*nod*
The tool support issue I was referring to, however, is the fact that I don't own (or even have access to) a Windows box, so simply launching it from SPIN code isn't an option. Hence my desire for information on the bootstrap sequence, so I could roll a binary and program the chip from one of my Macs, or one of the Linux boxen at work.
Chip Gracey
Parallax, Inc.
I'll go read the VGA sources. I have no need for VGA or composite output in my project at the moment, but when I pick up a Propeller board I'll sure play with it some.
As for COGSTOP, are there any guarantees of how many cycles after COGSTOP the targetted COG stops? (Scarily enough, I'm actually thinking of using this for some state control in the I/O routine, since I can't squeeze enough cycles out of the critical path to actually include a "stop" mechanism.)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
What I find especially nice is the buffer's address is preinitialized. This can also work in updating both the source and destination address simultaneously.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.
Post Edited (Paul Baker) : 6/15/2006 2:03:55 PM GMT
Before :storeloop do you need to do movd :storeloop, fbufstart? Otherwise it seems that if you hit that loop multiple times the destination would always start wherever it last left off.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
OS-X: because making Unix user-friendly was easier than debugging Windows
links:
My band's website
Our album on the iTunes Music Store
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.
The d_inc constant, 0x200, contains a 1 in the low-order bit of the D field. Thus, the instruction
...adds a literal 0x200 to the instruction at :storeloop, incrementing the D field.
Of course, if you did this 512 times, it'd overflow, which would convert the instruction to something along the lines of
In other words, it becomes a nop, which would be quite a surprise. Paul, I assume your buffer is under 512 words?
(Incidentally, Paul, your code is quite a bit tighter than my attempt at this same thing. I've been trying to get a read-and-fill loop under eight instructions for 16-bit data packed two per word.)
With one COG, the camera will have to be clocked at or around 13MHz (16-bit output is at clk/2, 12 cycles per looped read).
I'm concerned that writing the line out to shared RAM (5,296-5,311 cycles for CIF) looks to take longer than an entire line interval for the camera. One might need two COGs, taking turns reading lines. (The shared RAM write time could be reduced if I could pack samples two-per-long as they're being read, which would likely require COGs alternating on each sample.)
The good news is, buffering a single line with packed 16-bit samples will only take 176 words of shared RAM.
What's really killing me here, besides the high latencies for shared RAM, are the four-cycle non-pipelined instructions. The raw power of the Propeller beats the SX28, sure, but the SX28 benefits from single-cycle I/O instructions and the traditional PIC-style four-stage pipeline.
Writing to hub memory, especially in blocks can be quite time consuming. The biggest rub is that you need to update the source and destination address (wrlong's destination is by pointer only so the trick I used above doesn't work), as well as the loop instruction, this means you have 3 instructions between wrlong's so you end up missing the next availible hub slot. If you are fetching values off the bus, it may be faster in the long run to write it directly to the hub memory. Doing a "WRLONG from INA/increment hub address/DJNZ" allows you to catch every hub rotation, then another cog could go through and pick out the data and compact it, whether the compacting cog could keep up is another question, but here is where you might use the alternating cogs for each scan line.
Thats what I love about the Propeller, if you are imaginative enough you can find these cool little tricks to squeeze more performance out of it.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.
Post Edited (Paul Baker) : 6/15/2006 9:51:04 PM GMT
Well yes. Considering that your code's rooted at 0x000, you'll likely have to start higher than that as well.
Yes, my current code assumes you can fit the shared buffer addresses within an immediate, so it'd have to reside entirely within the first 512 bytes of RAM. I can only do this with some quantization and similar nastiness. But, assuming one can do the code shuffling to pull this off, it's a simple add of 0x201.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.
I'm collecting the information I need to roll my own binaries and program them onto the chip. This'll mean enough SPIN to bootstrap the assembler code, and no more. In this case, the assembler can come up, jump into a routine in high RAM, and clear the lower chunk from there, possibly paging in more code in the process.
In the longer term, I hope to target a compiler to the chip, but we'll see. (I've already got an assembler, from the data Chip sent yesterday, but of course my "binaries" lack the necessary bootstrap preamble.)
Does this assembler not let you use directives to control where your code is loaded in RAM? Unfortunate. That's a pretty standard tool.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.
Besides, it'd have to be VAR LONG res[noparse][[/noparse]128] -- longs are byte-addressed, so only 128 of them are addressable by a literal.
But other than that.
You should·understand that we're not in love with Windows, either. It's just a fact of life for most of us. The ultimate goal of future Propeller chips is to completely stand alone so that NO box is necessary.
In the meantime, if you want to make a bootstrap loader for the Propeller, this is the only preamble you need:
Minimal Spin bootstrap code for assembly language launch
$0000: HZ HZ HZ HZ CR CS 10 00 LL LL 18 00 18 00 10 00
$0010: FF FF F9 FF FF FF F9 FF 35 37 04 35 2C -- -- --
$0020: your assembly code starts here - loaded into COG #0
elaboration:
$0000: HZ HZ HZ HZ - internal clock frequency in Hz (long)
$0004: CR········· - value to be written to clock register (byte)
$0005: CS········· - checksum so that all RAM bytes will sum to 0 (modulus 256)
$0006: 10 00······ - 'pbase' (word) must be $0010
$0008: LL LL······ - 'vbase' (word) number of longs loaded times 4
$000A: 18 00······ - 'dbase' (word) above where $FFF9FFFF's get placed
$000C: 18 00······ - 'pcurr' (word) points to Spin code
$000E: 10 00······ - 'dcurr' (word) points to local stack
$0010: FF FF F9 FF - below local stack, must be $FFF9FFFF
$0014: FF FF F9 FF - below local stack, must be $FFF9FFFF
$0018: 35········· - push #0·· (long written to $0010)
$0019: 37 04······ - push #$20 (long written to $0014)
$001B: 35········· - push #0·· (long written to $0018)
$001C: 2C········· - COGINIT(0, $20, 0) - load asm code from $20+ into same COG #0
$001D: -- -- --··· - filler
$0020: XX XX XX XX - 1st long of asm program to be loaded into COG #0
$0024: XX XX XX XX - 2nd long of asm program to be loaded into COG #0
$0028:············ - rest of data
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 6/16/2006 6:33:53 AM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.
Chip, you rock. You've saved me numerous hours in a hex editor.
The 'vbase' value -- when you say "number of longs loaded," do you mean the longs of the machine code at 0x0020, or the overall word size of the programmed image?
And I should probably ask, since you've been so helpful: do y'all have any objections to third-party tools targeting the Propeller? It'd be non-commercial; my employer doesn't take kindly to commercial side projects.
Also, Paul:
Not necessarily. For example, in my (still simulated) OV6620 interface code, I don't have time to pack the 16-bit samples while reading from the camera. Likewise, packing them into longs before writing to main memory takes more than 8 cycles, so I miss a hub access window.
Net effect: writing using WRWORD takes twice as many writes, but uses the same total cycles.
I see your point on the WRWORD vs WRLONG.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life is one giant teacup ride.