Skip to content

Mcl86jr correctness fixes#55

Open
mx-shift wants to merge 8 commits into
MicroCoreLabs:masterfrom
mx-shift:mcl86jr-fixes
Open

Mcl86jr correctness fixes#55
mx-shift wants to merge 8 commits into
MicroCoreLabs:masterfrom
mx-shift:mcl86jr-fixes

Conversation

@mx-shift
Copy link
Copy Markdown

A variety of verilog and microcode fixes to match Intel 8088 documented behavior. The currently released version (with microcode V4) hangs before reaching TOPBENCH's main menu. Tracking down the cause led to me building a system for running MCL86jr under verilator in a harness that kept it lock-step at an instruction level with an actual Intel 8088. Differences in cycle timing or undocumented behavior were preserved. Under that harness, I've tested up to 150M instructions of TOPBENCH, MSFS2, King's Quest 1, Cartridge BASIC, Lotus 123 cartridge, and Turbo Pascal compilation.

mx-shift added 8 commits May 9, 2026 22:40
Real 8088 HALT is documented as one T1 phase (ALE pulse + S2:S0=011
status) and does NOT wait for READY. The original BIU runs HALT
through the normal data-bus states 0x02..0x0A, including the
READY-wait at state 0x07. There is no architectural reason for a host
platform to drive READY in response to a HALT cycle -- HLT is an
internal CPU state, not a bus transaction that needs acknowledgment --
so any HLT instruction risks stalling the BIU indefinitely on a board
that doesn't happen to assert READY anyway. Once the BIU is stalled,
it stops servicing PFQ refills and the EU's next fetch wedges.

Fix: in HALT state 0x18, set s_bits=3'b011 (HALT status) and signal
biu_done immediately, jumping straight to bus-idle at 0x0B without
asserting ALE or driving an address. This matches the documented 8088
HALT semantics ("just one T1").

Behaviorally identical on platforms whose bus would have ack'd HALT
quickly anyway -- the bus-cycle states for HALT didn't write or read
anything, they just consumed clock cycles.
JMP FAR / CALL FAR / RET FAR microcode strobes the BIU twice: first an
"update CS" strobe (eu_biu_strobe=2'b11, code 3'h2), then the actual
"jump request" (eu_biu_req_code=5'h19) which atomically updates
pfq_addr_in to the new IP. Real 8088 microcode treats CS+IP as a single
update from the BIU's point of view; MCL86jr's microcode strobes them
separately, and without staging the BIU briefly sees new CS + old
pfq_addr_in. Any prefetch dispatched in that window forms an address
as (new_CS << 4) + old_pfq_addr_in -- a phantom fetch that has no
analogue on real 8088. Visible at the PCjr reset vector as a fetch at
F000:0005 (new CS=F000, pfq_addr_in was 5 from FFFF4 increment) instead
of either the real-8088 prefetch overshoot at FFFF:0005 or the clean
jump to F000:0043.

Stage the CS update. EU strobe code 3'h2 now writes
biu_register_cs_staged + sets cs_update_staged; biu_register_cs itself
stays at its old value. The matching JMP request commits both
atomically on the same clock edge:

  if (eu_biu_req_caught==1'b1 && eu_biu_req_code==5'h19) begin
      pfq_addr_in <= eu_register_r3_d;
      if (cs_update_staged) begin
          biu_register_cs  <= biu_register_cs_staged;
          cs_update_staged <= 1'b0;
      end
  end

Side effect: prefetches dispatched in the gap between CS strobe and
JMP request now see the OLD CS, producing the real-8088 prefetch
overshoot at (old_CS << 4) + old_pfq_addr_in -- the byte gets discarded
after the JMP-driven flush, just like real hardware.

Visible at the realistic 110 MHz : 4.77 MHz core_clk:CLK ratio
(matches the FPGA PLL); a smaller ratio masks the bug because the
in-flight prefetch has time to complete on the old CS before the EU
finishes JMP decode.
Standalone Python tool that parses the `p`-line triplet format used by
Microcode_MCL86.txt and emits a Xilinx .coe
(memory_initialization_vector) file or a Verilog $readmemh .mem file.

Usage:
  python3 MCL86/Core/ucode_assembler.py assemble Microcode_MCL86.txt \
      --coe MCL86_Microcode_Xilinx_Version_<N>.coe
The original procedures stashed the segment-override prefix bit in
eu_flags bit 4 (AF) across the BIU operation. That corrupts the
architectural AF on every string-op call site -- 8086 spec says string
ops preserve flags, but every MOVS/LODS/CMPS would clear AF (path B,
no override) or force-set it to 1 (path A, override). Manifests during
PCjr BIOS POST: after `INC DI 0x4F->0x50` sets AF=1, the next MOVSW
iteration runs STORE_SEG_PREFIX path B and clears AF.

Switch to r2 as the scratch -- it's dead during string ops since the
INC16/DEC16 flag-processing path that uses r2 isn't called between
STORE and RESTORE. (r0 was tried first but conflicted with CMPSB/CMPSW
which load r0 with the source byte at 0x0645 right before the
STORE_SEG_PREFIX call at 0x0646.)
Real 8086 delays interrupts by one instruction only after MOV/POP to
SS (atomic stack reload paired with MOV SP, ...). MCL86jr's microcode
applied that delay to all segment-register loads -- POP ES / POP CS /
POP DS / MOV ES / MOV CS / MOV DS -- by routing them through the
legacy 0x0470-0x0478 ending whose final jump goes to 0x0011 (skipping
the 0x0006-0x000B interrupt-poll section of the main fetch loop). That
made the EU silently swallow any hardware interrupt that arrived
during one of these instructions.

Fix: split the common ending into two:
  * 0x0470-0x0478 (legacy): final jump to 0x0011 (skip INT). Used by
    POP SS and MOV SS to preserve the architecturally-required
    1-instruction interrupt delay.
  * 0x0FC7-0x0FCF (new): mirror with final jump to 0x0006 (with INT
    poll). POP ES/CS/DS handlers redirect their final jumps from
    0x0470 to 0x0FC7. MOV ES/CS/DS reach the new tail via 0x0D63's
    redirected jump (0x0474 -> 0x0FCB, the PFQ-wait entry of the new
    tail). MOV SS bypasses 0x0D62/0x0D63 entirely by jumping from
    0x0D6B to a new SS-specific debounce + skip-INT path at
    0x0FD0-0x0FD1.

LDS/LES (0xC4/0xC5) already jumped to 0x0004 with INT poll, so they
were already correct.
Real 8086 byte-2 rule: byte_2_linear = seg*16 + ((offset+1) & 0xFFFF).
The offset wraps within the segment; the linear address does not. Two
cases need to be handled correctly:

  - byte_1 at any 0x?FFFF linear address with a non-paragraph-aligned
    segment: byte_2 needs the carry past bit 15 to propagate through
    the upper 4 bits of addr_out_temp. Hit by code that does word
    reads across a 64KB linear boundary using a non-paragraph-aligned
    segment (e.g. LODSW at DS=4EFE SI=101F crosses 0x4FFFF/0x50000).
  - offset == 0xFFFF: byte_2 wraps back to seg*16 (offset 0 in same
    segment). The PCjr BIOS hits this on end-of-segment word reads
    with paragraph-aligned segments; a naive 20-bit linear+1 increment
    would land in the next paragraph instead.

Implementation: latch byte-1's offset (= eu_register_r3_d[15:0]) into
a new biu_byte1_offset register at every BIU dispatch path, then at
the byte-2 increment in state 0x0A (and the parallel SRAM fast path in
state 0x16):

    if (byte1_offset == 0xFFFF)
      addr_out_temp <= addr_out_temp - 0xFFFF;   // back to seg*16
    else
      addr_out_temp <= addr_out_temp + 1;         // full 20-bit
Real 8086 only delays interrupt recognition after STI / MOV SS / POP SS.
MCL86jr was also deferring after IRET, which broke the BIOS timer ISR's
return path: IRET pops FLAGS (with IF=1) and the very next 18.2Hz tick
that lands during the return-from-ISR should be taken immediately.
MCL86jr was skipping that poll and the tick got dropped until the next
real new_instruction edge.

Two pieces needed fixing in tandem:

1. Microcode (Microcode_MCL86.txt). IRET's terminating jump went to
   0x0474 (the legacy skip-INT path used by POP/MOV SS). Redirected to
   the with-INT-poll ending at 0x0FCB.

2. EU (eu.v). intr_enable_delayed only updates on new_instruction
   edges, so even with the with-INT-poll path the IRET microcode's
   post-FLAGS-load INT poll would still see IF=0 from before the
   restore. eu.v now detects the "OR r0 into FLAGS" microcode step
   (at ROM addr 0x0532; BRAM has 1-cycle synchronous read so the
   detection is gated on eu_rom_address == 0x0533) and syncs
   intr_enable_delayed in-cycle from bit 9 of the ALU output. Must
   come BEFORE the eu_flag_i==0 short-circuit because eu_flag_i is
   the OLD value at this point.

Caught running King's Quest I on PCjr under the cosim: the BIOS timer
ISR IRETs to user code, and the next 18.2 Hz tick races the return.
AAM (0xD4) and AAD (0xD5) microcode populated alu_last_result with
the full AX register and called All_Flags_WORD, which uses the 16-bit
domain for SF (bit 15) and ZF (whole word). Per Intel docs, AAM/AAD
set SF, ZF, PF on AL only (CF, OF, AF are undefined).

For AAM, the post-instruction AX has AH=quotient, AL=remainder. When
AL=0 but AH!=0 (e.g. AL_pre=0x50, AAM 10 -> AX=0x0800), All_Flags_WORD
saw alu_last=0x0800 != 0 and left ZF=0. Real 8086 sets ZF=1 because
AL=0. Caught by Lotus 1-2-3 cosim at op=4901806 in CS:IP F000:19CB
area; worker pushed ZF=0 (CF=1, AF=1 — undefined and masked) while
host had ZF=1.

For AAD the body already zeroes AH, so AX AND 0xFF == AX and the
flag domain doesn't actually matter in practice — but fix it
consistently so the SF check is on bit 7 of AL, not bit 15 of AX,
and the microcode reads idiomatically.

Both ops now AND AX with 0x00FF into alu_last_result and call
CALCULATE_FLAGS_BYTE (the byte-domain path used by OR/AND/XOR).

Assembled output ships as Version_8.coe in both MCL86/Core and
MCL86jr/FPGA/src4synth (4-word diff vs Version_7 at ROM addresses
0x2BF / 0x2C0 / 0x2D7 / 0x2D8 — the AAM and AAD microcode entries).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant