IPB

Welcome Guest ( Log In | Register )

integer multiplications on IA32 architecture.
wkwai
post Aug 6 2003, 14:24
Post #1


MPEG4 AAC developer


Group: Developer
Posts: 398
Joined: 1-June 03
Member No.: 6943



Hi,


I am used to working with Assembly Language Programming on the Pentium processor generation( 166 - 200 Mhz MMX). I noticed that for operations like int16 and int32 multiplications / divisions, it used to take as long as 20 clock cycles to complete the an instruction execution. However I noticed that on a Celeron processor, (using the VTune 7.0 evaluation kit from Intel's website) it takes on 1 clock cycle to execute.. Could anyone verify this? In the past, we would use a combination of shift and add operations to implement integer multiplications / divisions.


wkwai
Go to the top of the page
+Quote Post
 
Start new topic
Replies
Diocletian
post Aug 10 2003, 10:40
Post #2





Group: Members
Posts: 45
Joined: 11-October 02
Member No.: 3517



QUOTE (wkwai @ Aug 10 2003, 12:15 PM)
QUOTE (NumLOCK @ Aug 8 2003, 07:23 AM)
By the way, I loved their funny PCKUNMLL and PSKCNNNLXGLCBB mnemonics  blink.gif


I think those instructions does not exists for the Celeron and PII systems. For PIII and above, the MMX instructions actually work on 128 bit registers. That is what I noticed from the latest Intel Programmers guide.

I wondered how much performance gain does a 64 bit processor has over the IA32 architecture? It seems to me that most of the internal floating point operations of the IA32 architecture are already at 64 bit operations??? blink.gif

When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits? blink.gif

The 64 bit FMUL instructions have nothing to do with the 64 bit IMUL instructions on IA64
or x86-64. The main advantage of a 64 bit CPU is that it can work with more or more fragmented
memory:
- you can work with more than 1.5 GB of memory per process
- you don't have to care with virtual address room fragmentation
- you can map files to memory
- you can built up sparse memory structures in the memory which do a lot of work
in hardware than in software
-----------------------------------

The rules about optimization which you find in books and in brains are typically 10 years and
older and are COMPLETELY out of day and often able to deoptimize programs.

To evaluate the speed of current CPUs is easier than the speed of 10 years old CPUs,
because in modern CPUs decoding and execeution is nearly complete decoupled.
This was not the case for CPUs like Pentium, Pentium MMX and AMD K5, where a
prediction of calculation speed was a pain.

Modern CPU executation time of code and data which is completely in the L1 cache can
be characterized by two parameters:

- Latency (the time from the input to the output register)
- Throughput (the average time from input to output register when executing multiple instructions)

Latency/Throughput is typically an integer which can be interpreted as the number of execution
pipelines. The execution time of the mul32 instruction:

- i386: Depending on the number of significant bits in the second operand: 6...37 clocks
- i486: Depending on the number of significant bits in the second operand: 9...40 clocks
- Pentium/Pentium MMX: 11 clocks (fixed)
- K6: 2 clocks , a 3rd clock for the upper 32 bits
- Athlon: 5 clocks (throughput: 2.5 clocks)
but: operand in memory: 4 clocks (throughput: 2 clocks)
- Pentium II: 4 clocks (throughput: 4 clocks) (?)
- Pentium 4: 14 clocks (throughput: 5.67 clocks)
operand in memory: 18 clocks (throughput: 6 clocks)

Pentium 4 is much slower than the Pentium II/III or the K6. Even shl don't helps, because
it is also very slow:

- shl reg,n: 4 clocks

Fast indeed is:

- add reg1, reg2: 0.5 clocks

MMX on pentium 4 is also slower than on the Pentium MMX/II/III, because there's only
ONE MMX pipeline instead of two. The Pentium 4 is clock speed optimized, not speed optimized. A lot of Latency (the time from the input to the output register)
- Throughput (the average time from input to output register when executing multiple instructions)

Latency/Throughput is typically an integer which can be interpreted as the number of execution
pipelines. The execution time of the mul32 instruction:

- i386: Depending on the number of significant bits in the second operand: 6...37 clocks
- i486: Depending on the number of significant bits in the second operand: 9...40 clocks
- Pentium/Pentium MMX: 11 clocks (fixed)
- K6: 2 clocks , a 3rd clock for the upper 32 bits
- Athlon: 5 clocks (throughput: 2.5 clocks)
but: operand in memory: 4 clocks (throughput: 2 clocks)
- Pentium II: 4 clocks (throughput: 4 clocks) (?)
- Pentium 4: 14 clocks (throughput: 5.67 clocks)
operand in memory: 18 clocks (throughput: 6 clocks)

Pentium 4 is much slower than the Pentium II/III or the K6. Even shl don't helps, because
it is also very slow:

- shl reg,n: 4 clocks

Fast indeed is:

- add reg1, reg2: 0.5 clocks

MMX on pentium 4 is also slower than on the Pentium MMX/II/III, because there's only
ONE MMX pipeline instead of two. The Pentium 4 is clock speed optimized, not speed optimized. A lot of these changes are to allow high clock speeds. In the first P4 stepping
there were additional serious penalties for misaligned memory accesses which dropped the
speed down to Pentium MMX times.


--------------------
Diocletian

Time Travel Agency
Book a journey to the Diocletian Palace. Not today!
Go to the top of the page
+Quote Post

Posts in this topic


Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 20th December 2014 - 13:00