Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze low batch size timing #538

Open
ryan-summers opened this issue May 16, 2022 · 7 comments
Open

Analyze low batch size timing #538

ryan-summers opened this issue May 16, 2022 · 7 comments

Comments

@ryan-summers
Copy link
Member

Analyze the timing requirements when using the DMA sample acquisition architecture for ADC/DAC operations for low batch sizes (e.g. 1 or 2).

If possible, we may want to eliminate the Peripheral data -> RAM DMA operation, as this would eliminate processing overhead in the loop. Instead, for these low batch counts, the data can just be manually transacted with the peripherals directly.

@ryan-summers ryan-summers changed the title Analyze Low batch size timing Analyze low batch size timing May 16, 2022
@ryan-summers
Copy link
Member Author

ryan-summers commented May 16, 2022

DS4_QuickPrint12

The above capture was completed using toggling of the USART3 RX/TX lines while using a batch size of 1.

  • TX was enabled at the start of the DSP process() function call and de-asserted at the end.
  • RX was enabled immediately before getting the ADC/DAC data buffers and servicing the DBM DMA transfer. It was the disabled immediately inside of the closure processing said buffers.
    • The second RX pulse is caused by RX being asserted immediately before data is transferred to the ethernet livestream. It is then deasserted immediately after the DBM/DMA transfer closure completes

As can be seen, the whole DSP process takes approximately 1.9uS, which comes to a maximum sampling rate of approximately 526KHz. Of that, servicing the DBM DMA transfers for data requires about 420ns.

If there was no DBM DMA transfer servicing required, the existing livestream / DSP routines require 1.48uS, which corresponds to a maximum sampling rate of ~676KHz. However, even without DBM DMA, there would still be some small amount of time required to read/write the SPI peripheral data registers, so in reality, the overhead would be slightly more.

Rough breakdown of time requirements within DSP processing for a batch size of 1:

pie title Process time breakout (Batch size = 1)
    "DSP Routines": 900
    "Get DMA Buffers": 440
    "Prepare livestream": 400
    "Update Telemetry": 120
    "Exit": 20
    "Entry": 120

@jordens
Copy link
Member

jordens commented May 16, 2022

Interesting. I seem to remember much less time for DSP. ~1000 insns is a lot. Might be worthwhile to check back against

stabilizer/src/main.rs

Lines 247 to 284 in 0fd442e

#[task(binds = SPI1, resources = [spi, iir_state, iir_ch], priority = 2)]
fn spi1(c: spi1::Context) {
#[cfg(feature = "bkpt")]
cortex_m::asm::bkpt();
let (spi1, spi2, spi4, spi5) = c.resources.spi;
let iir_ch = c.resources.iir_ch;
let iir_state = c.resources.iir_state;
let sr = spi1.sr.read();
if sr.eot().bit_is_set() {
spi1.ifcr.write(|w| w.eotc().set_bit());
}
if sr.rxp().bit_is_set() {
let rxdr = &spi1.rxdr as *const _ as *const u16;
let a = unsafe { ptr::read_volatile(rxdr) };
let x0 = f32::from(a as i16);
let y0 = iir_ch[0].update(&mut iir_state[0], x0);
let d = y0 as i16 as u16 ^ 0x8000;
let txdr = &spi2.txdr as *const _ as *mut u16;
unsafe { ptr::write_volatile(txdr, d) };
}
let sr = spi5.sr.read();
if sr.eot().bit_is_set() {
spi5.ifcr.write(|w| w.eotc().set_bit());
}
if sr.rxp().bit_is_set() {
let rxdr = &spi5.rxdr as *const _ as *const u16;
let a = unsafe { ptr::read_volatile(rxdr) };
let x0 = f32::from(a as i16);
let y0 = iir_ch[1].update(&mut iir_state[1], x0);
let d = y0 as i16 as u16 ^ 0x8000;
let txdr = &spi4.txdr as *const _ as *mut u16;
unsafe { ptr::write_volatile(txdr, d) };
}
#[cfg(feature = "bkpt")]
cortex_m::asm::bkpt();
}
(caveat: old hardware I think).
Ah. I think the big difference in DSP load is the signal generator.
Also do generally use nightly and the cortex-m/inline-asm feature.
I've found DWT CYCCNT to be a nicer tool for these measurement that GPIO toggling. I think it could well be less overhead.

@ryan-summers
Copy link
Member Author

My calculations are showing the DSP section taking approximately ~360 insns - the rest of the overhead here is from the various other things we've put into the DSP routing, such as telemetry, signal generation, DMA servicing, etc.

@jordens
Copy link
Member

jordens commented May 16, 2022

The 1.9 µs you measure are about 760 insns for "DSP Routines". That doesn't include DMA servicing and telemetry, right?

@jordens
Copy link
Member

jordens commented May 16, 2022

Ah. No. The 1.9 µs you call "DSP process" is not "DSP routines".

@jordens
Copy link
Member

jordens commented May 16, 2022

Isn't signal generation part of "DSP Routines" in your measurement?

@ryan-summers
Copy link
Member Author

"DSP Routines" is inclusive of signal generation - it's the amount of time the closure on the ADCs/DACs run:

// Start timer
(adc0, adc1, dac0, adc1).lock(() {
    // Stop & Reset timer, this is "Get DMA Buffers"
})
// Stop & Reset timer, this is called "DSP Routines"

telemetry.latest_adcs = [adcs[0][0], adcs[1][0]];
telemetry.latest_dacs = [dacs[0][0], dacs[1][0]];
// Stop timer, this is "Update telemetry"

I'll try to get a full diff just to show things. I want to rework it so these calculations just get reported via telemetry instead of manually probing debug pins as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants