2015
05.01

안녕하세요!

Low-level toying with multiple CPUs without proper locking mechanisms is asking for trouble. I have already seen many cryptic boot logs form native AROS on RaspberryPi2 which you simply cannot decode. This happens every time when more than one core tries to speak over serial line.

The locking primitive which we have just added to AROS is a spin lock. It does not have an owner, so one cannot re-enter it — trying to do so will result in an endless loop with no exit. The spin lock can be obtained either for reading or for writing. When spin lock is in read mode, it can be acquired by many clients but as long as at least one of them is holding a read lock, code willing to switch it into write mode will have to wait. When spin lock is in write mode, it gives an exclusive access to not more nor less but only one caller. Until it is released again, no other code will be able to obtain the lock at all.

So, here it goes, the spin lock:

typedef struct {
    volatile unsigned long lock;
} spinlock_t;

#define SPINLOCK_INIT_UNLOCKED  { 0 }
#define SPINLOCK_INIT_WRITE_LOCKED  { 0x80000000 }
#define SPINLOCK_INIT_READ_LOCKED(n) { n }
#define SPINLOCK_MODE_READ  0
#define SPINLOCK_MODE_WRITE 1

The spin lock comes with three default initializers for those who want to put it in some defined state into e.g. data section. The lock uses one 32-bit value which defines the state of lock:

  • lock == 0 – the lock is in its free state, everyone can lock it in either mode
  • lock > 0 – locked in READ mode. Everyone can lock it in READ state (up to 2^31 times, then it wraps), but attempting to lock it in WRITE mode will blocks until it is free.
  • lock == 0x80000000 – the lock is in WRITE mode. Further attempts to lock it in either modes will block.

The code for locking and unlocking uses the LDREX and STREX instructions which guarantee exclusive access to addressed memory. The code uses also a nice feature of ARM processors – conditional execution of instructions. Let’s look at the code – it assumes that register r0 points to the lock

    mov       r3, #0x80000000
1:  ldrex     r2, [r0]
    teq       r2, #0
    wfene
    strexeq   r2, r3, [r0]
    teq       r2, #0
    bne       1b

Only one single loop inside. When the function finishes, the spin lock is acquired in WRITE mode. How does it work? The LDREX function reads the lock value into r2 register and marks exclusive access to addressed memory. The lock value is compared against zero. If the lock value was not zero, then the WFE instruction will be executed (please note the “ne” suffix). It puts the CPU into sleep mode until either an interrupt or an event from any other core is sent. If the lock value was zero, the WFE instruction is not executed at all. The next one is conditional variant of STREX. It is executed only if the lock value equals zero (spin lock is free, note the suffix “eq” after STREX). The STREX stores register r3 at address pointed by register r0. If write succeeds, i.e. exclusive lock was still granted, register r2 will be set to value 0, if write fails, r2 will contain value 1. Finally, register r2 is tested against value 0 and, if it’s not zero, we jump back and repeat.

Please note, that in second comparison r2 can contain one of three values:

  • 0, if STREXeq was executed and succeeded,
  • 1, if STREXeq was executed and filed,
  • 0x80000000, if the lock was already acquired and our CPU went to sleep (WFEne).

The last case means, that CPU has received either an event (from another CPU core when it released a spin lock) or an interrupt was triggered. In both cases the CPU will re-attempt to acquire the lock. It wakes up, STREXeq is not executed, 0x80000000 is compared against 0x00000000 and if they are not equal, CPU does a branch. Nice, isn’t it?

There is one more scenario to be considered. What happens if there was an interrupt triggered between LDREX and STREX? Well, in that case AROS code needs to release the exclusive memory by either issuing a CLREX instruction (ARM v7 cpus and up) or by issuing a dummy STREX instruction to some arbitrary memory location. In that case the interrupted code will re-attempt the process of obtaining a spin lock.

Now after the locks were added and properly used, you can turn this:

[KRN:ide27 modbces 08_dritpri fl #s veacion nam00:
0 (0147300 7ff0)
[KRN:BCMnel8]ebcurce8
nif8 1or08# 1ls @ 0x50 "ex0c.
 iKRar C
e f81a03tr: p110 .2
 41N]expansi CPlib60ry01
 3 
 815f] Core105CP2 =600001bu
libra Cor
 80e f70: 100001
41RNutility.librarl"
e @ 0e500c:ec00
0x0001889a 8RN]41o"er2 .library@
b3e10 C1 e41 Bootstrad t.re @ 0x0"
[d0c: or9 2 cpu1 ontek.reizurc12
+ KR1a1Ca8e 2 cp 01tx @ 0pr00e3eb0
eKRurcCM
08] 0fm2948_ini4_c1r 43

into this:

[KRN:BCM2708] Initialising Multicore System
[KRN:BCM2708] bcm2708_init: Copy SMP trampoline from f800074c to 00002000 (100 bytes)
[KRN:BCM2708] bcm2708_init: Patching data for trampoline at offset 80
[KRN:BCM2708] bcm2708_init: Attempting to wake core #1
[KRN:BCM2708] bcm2708_init: core #1 stack @ 0x000b4380 (sp=0x000dc370)
[KRN:BCM2708] bcm2708_init: core #1 fiq stack @ 0x000dc390 (sp=0x000dd380)
[KRN:BCM2708] bcm2708_init: core #1 tls @ 0x000dd3a0
[KRN] Core 1 Boostrapping..
[KRN] Core 1 CPSR=600001d3
[KRN] Core 1 CPSR=60000193
[KRN] Core 1 TLS @ 0x000dd3a0
[KRN] Core 1 KernelBase @ 0x000b3ec0
[KRN] Core 1 SysBase @ 0x000b3200
[KRN] Core 1 Bootstrap task @ 0x000dd3c0
[KRN] Core 1 cpu context size 2124
[KRN] Core 1 cpu ctx @ 0x000dd460
[KRN:BCM2708] bcm2708_init_core(1)
[KRN] Core 1 operational
[KRN] Core 1 waiting for interrupts
[KRN:BCM2708] bcm2708_init: Attempting to wake core #2
...
2015
04.27

All your nightly are belong to us

Yay, I’ve killed all nightly builds. Sorry 😉

That was the short version. Last weekend I was busy with removing some legal hacks from AROS sources. The hack on the schedule was commonly used ThisTask pointer in the SysBase. Now, at least in my local branch of AROS for RaspberryPi the SysBase->ThisTask points to a nirvana place where all code is either happy crashing, or dead, or both. ThisTask points to NULL :)

No, it didn’t disappeared completely. The ThisTask pointer has been moved (and is used there) to something similar to a thread local storage. It is local, but not local for a thread. It is local to a CPU core. On RPi2 we use four independent local storages and each of them has it’s own ThisTask pointer. Don’t hold your breath, it’s not SMP yet. Far from it :) The scheduler works only on the CPU#0. At least for now.

The TLS is used exclusively by the kernel.resource, which knows best about the low-level part of the system. Exec has become two new architecture-specific macros, named GET_THIS_TASK and SET_THIS_TASK(x). On all architectures they do expand to SysBase->ThisTask, on RaspberryPi they expand to TLS_GET(ThisTask) and equivalent TLS_SET. What about the rest of the AROS code? Well, in that case the only sane way to get ThisTask shall be used — the FindTask(NULL) call.

And here we come to the point where I’ve killed all nightlies. During my ThisTask removal fun I broke accidentally one macro in AROSTCP network stack :) It should be fixed already.

2015
04.21

Hello Core 1, hello Core 2, Core 3 – wake up!

Porting AROS to RaspberryPi is a lot of fun, I told that already. There’s also a lot of frustration and You know that. This time because of 4 CPU cores…

From very beginning I have noticed that the speed of frame buffer was relatively slow. At least not as fast as I would expect form a nearly 1 GHz machine. Well, issue there, ignored first. I followed with AROS porting and came to a point where AROS was booting into desktop and running programs. As a simple example I have added Clock to WBStartup folder, thus making this app start automatically once the system is up. Of course I have had full debug enabled in screen console and over serial port.

Huh, it took AROS nearly 30 seconds to boot. Not bad, but could be better for sure. Slow redrawing od the screen was worrying me but hey, we do have the simplest graphics driver ever. No acceleration, just a simple portion of memory filled pixel by pixel (with some help of our base graphics class of course). So far so good.

IMG_3069

Then out of curiosity I decided to take a look at an old raspberry pi model I have on my desktop. I booted it and looked on the Clock and gone mad. Old raspberry pi with arm11 CPU booted in about 20 seconds. 2/3 of RaspberryPi2 speed! Can’t be, I thought. The new machine cannot be that bad, can it? Have I missed some cache setup? Frame buffer can’t be cached, right? Why was linux frame buffer console faster?

Finally I found a forum where Bare Metal guys were discussing their great efforts to develop standalone software for RaspberryPi. Luckily for me one of them had similar issue I had. He also led me to the final solution. It turned out, that the CPU cores of RaspberryPi2 are not silently seeping and waiting for an interrupt when start.elf transfers the control over to the ARM cpu. No, instead they are busy looping and polling the registers, anxiously waiting to start and do some useful work. As you can imagine polling technique is not something very effective, it’s rather the contrary. The additional CPU cores were stealing the precious bus cycles, leaving less for the CPU#0 which was actually running AROS code. Eureka!

There are two solutions and I have found both of them working with AROS. The first one is to extend the config.txt file (the file which is read and parsed by VideoCore). There, one has to add following parameter

 arm_control=0x1000

It forces the additional CPUs to go sleep and wait for interrupts instead of do busy looping. I tested it and it really helped. After adding that line AROS really flies on that tiny computer! Frame buffer refreshes quickly, display redraws quickly, few demos redraw their windows nearly immediately. Boo! Now the machine not only feels faster than old RPi, it actually is faster.

Letting the additional CPUs to sleep alone is good, but not something I liked very much. Sure, start.elf does good job but I wanted to make AROS do that job. So I started to code :) I wrote small assembly routine, a trampoline which initializes caches and MMU of the woken up core. The trampoline initializes also the supervisor stack and jumps to a routine in C code. At the moment the C routine is rather simple. It checks CPU type, enables VFP and enters endless wait-for-interrupt loop. Ah, the C routine babbles on the system log of course to let me know it is actually working. What I got was:

[KRN] Co]e o Co eUp ani idiwir igr rrutatuots
s
0a008

Uh. Not very readable. Forgot something? Ah yes, there is no locking in our bug() function, which means all cores were fighting on the serial line. Proper locking will come later, since it has to be done right, for now I have only added some delays. This is how it looks now

Bildschirmfoto 2015-04-21 um 21.53.40

Please note that the “Core x up and waiting” lines are sent to the console respectively by different ARM cores. It’s not SMP, not even AMP. It’s just small initialization routine. But at least it work as expected…

And with current setup AROS really flies on the RaspberryPi 2 😀

 

2015
04.18

Raspberry Pi

Eons ago I was involved in several ARM-related projects. One of them was to make a linux-hosted port of AROS for ARM devices. These were the days full of fun and joy (if everything worked well) and frustration (if everything failed). After that my engagement in AROS dropped nearly to zero. There were, of course, some exceptions like improvements in memory management (TLSF support) or improvements in x86_64 AROS. But none of them were as low-level as I wished them to be.

Since at work we started to use some ARM-based embedded machines for our electronics, I had some fun with coding them. Not really low level, but weird enough :) This all drove me to an idea of buying an ARM platform and make native AROS for it.

IMG_3049 Kopie

Even if there are better machines available, I have decided to support RaspberryPi. One of the reasons was availability of the rPi code in AROS repository – our great developer Nick Andrews has started a port of Aros for that machines already and made a great progress with it. Another reason, a very important one, is a huge community behind Raspberry.

So, the board, the RaspberryPi 2, has been bought :)

IMG_3003

During last weeks me and Nick had fun with bringing AROS port back into usable state, rewriting it and improving in many places. Code which was initially not working with rPi2 boards at all now boots equally good (or equally bad) on both rPi and rPi2 into Wanderer, the desktop environment of AROS. The kernel of our system is loaded at a virtual address 0xf8000000. The read-only portion of the kernel is MMU-protected again writes. All caches and write buffers are enabled. Slowly all bits and pieces are improved and we are doing our best to get USB on-the-go up and running. Having it would allow us to actually use Aros on these nice machines already.

Meanwhile, I’m completing our small EABI library for ARM cpus so that we could build entire AROS with gcc5 compiler. Well, fun :)

2015
04.10

Reboot

Over two years passed since last entry on this page — two years only but it feels like eons. I think it’s time to reactivate this blog :)

 

So, reboot…

2012
07.06

I think I will never understand that

Today morning I was reviewing some small bit of code, which surprisingly compiled on i386 target just fine, but failed for ARM target. As always, the first thing I though was “Oh no! That could be variadic function!” and I was right, again.

But this time I was really surprised. The author of the code started just right fine:

#include <stdarg.h>
[...]

char * STDARGS GetKeyWord(int value, char *def, ...)
{
    [...]
    va_list va;
    [...]
    va_start(va, def);

And then, out of sudden, the motivation for using stdarg passes away, va is casted to a LONG * type and varargs handled manually. Why oh why? Why the coder uses tons of casting, where he could use a simple va_arg? Why string = *((char **) args) instead of string=va_arg(va, char *)? Why advancing the args pointer? Where is the missing va_end? I don’t know and I think I will never understand that.