I've shoe horned the Kalman filtering code into the my aerial robotics Atmel AVR based control board and timed it.

You can see the source code for the embedded version's vector .h, ahrs.h and ahrs.c. The last one is the actual Kalman filtering code.

With the math library the code is about 15k (including 4k of the sensor library and other things). The GNU math library does 32 bit double precision and includes the necessary trig functions. We require roughly 70 ms per call to kalman_step*, so we can only process every fourth sample of the accelerometer (20ms / sample). This is stil fast enough to do a 10 Hz update, which may be sufficient.

I can still do more performance tuning and code compaction. The runtime footprint is a bit large since the code is almost verbatim from the version that was running on my laptop.

*: It requires 285,000 instructions to emulate the floating point math and matrix inversions, etc. An 8 Mhz clock would do it in 35 ms, or the 16 Mhz Mega128 could do it in 18 ms. That would be just about a 60 Hz update rate.