Dell EMEA www.euro.dell.com 
Support 
 
Download Acrobat Reader  
Many of these documents are available in Adobe's Portable Document Format (PDF)



Intel®  Pentium®  4 Processor
December 2000

Download PDF (169KB)

Rick Schuckle, Technology Strategist

The new Intel Pentium 4 processor is the latest generation of the Intel IA-32 architecture. Formerly code-named "Willamette," the Pentium 4 processor introduces significant architecture advances over the previous 32-bit P6 processor family consisting of the Pentium Pro, Pentium II, and Pentium III processors. This article begins with a review of the evolution of the IA-32 architecture and then discusses the software and hardware architectural changes that the Pentium 4 processor family brings to desktop, workstation, and server systems.

IA-32 Evolution

The IA-32 architecture essentially began with the 80386 processor. Although the basic set of 32-bit instructions have remained the same, there have been architecture changes and instruction-set enhancements along the way. The 80486 processor added an internal cache, instruction pipelining, and an integrated math coprocessor.

Next, the Pentium processor introduced a superscalar microarchitecture that allowed it to execute multiple instructions in parallel. In addition, Intel split the internal level 1 (L1) cache of the Pentium processor into separate instruction and data caches to improve code performance.

The Pentium Pro introduced the concept of a backside level 2 (L2) cache that removed many of the bandwidth and latency limitations of an L2 cache on the frontside bus (FSB) as seen in prior processors. In addition, the Pentium Pro featured the ability to convert complex instruction set computer instructions (CISC) into micro-ops, which were then executed on a reduced instruction set computer (RISC) core. The Pentium Pro also contained more execution units to extend its superscalar capabilities and a longer pipeline to increase the frequency above previous generations.

Intel released an enhanced version of the Pentium processor, the Pentium processor with MMXTM  technology, in 1997. The Pentium MMX incorporated multimedia extensions (MMX) to the basic IA-32 instruction set. This allowed software developers to perform more digital signal processing (DSP)-like functions on the processor to improve graphics and sound capabilities.

The Intel Pentium II processor improved on the Pentium Pro by changing from a multichip module (MCM) to a single-edge connector cartridge (SECC). The SECC allowed Intel to move the P6 family into mass production. At this point, MMX technology was also introduced in the P6 processor family.

The Pentium III processor added Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) to the P6 family. SIMD operations allow code developers to perform identical operations on multiple pieces of data in parallel. This capability allows many iterative calculations to be performed simultaneously, reducing the overall execution time. SSE added 68 new instructions, including 45 new floating-point operations, 11 SIMD integer instructions, and 5 cache-management instructions.

Pentium 4 Processor NetBurstTM  Microarchitecture

The Pentium 4 processor's "NetBurst" microarchitecture enables significant hardware and software advances over previous IA-32 processors. This new microarchitecture allows greater scalability and internal performance enhancements over the current Pentium III architecture.

Hardware Architectural Changes

The Pentium 4 processor is initially targeted for the same 0.18u process technology used for the Pentium III, but will be migrated to future process technologies as they become feasible. The hardware changes include:

  • Hyper Pipelined Technology
  • Advanced Dynamic Execution
  • Execution Trace Cache
  • High-performance data cache
  • L2 advanced transfer cache
  • Rapid execution engine
  • High-bandwidth 400-MHz system bus

Hyper Pipelined Technology

Hyper Pipelined Technology describes the Pentium 4 processor internal pipeline, which has been extended from the 10-stage pipeline in the Pentium III to a 20-stage pipeline shown in Figure 1. Increasing the number of pipeline stages reduces the number of gates per stage, which allows a higher core frequency and greater frequency scalability for the future. (See sidebar, "Why Can Longer Pipelines Achieve Higher Frequencies?")

1. New Pentium 4 Processor 20-Stage Pipeline

Figure 1. New Pentium 4 Processor 20-Stage Pipeline

Why Can Longer Pipelines Achieve Higher Frequencies?

A typical pipeline has a fixed amount of work that is required to decode and execute an instruction. This work is performed by individual logical operations called "gates." Each logic gate consists of multiple transistors. By increasing the stages in a pipeline, fewer gates are required per stage. Because each gate requires some amount of time (delay) to provide a result, decreasing the number of gates in each stage allows the clock rate to be increased. It allows more instructions to be "in flight" or at various stages of decode and execution in the pipeline. These benefits are offset somewhat by the overhead of additional gates required to manage the added stages.

The following rule-of-thumb calculation estimates the maximum frequency that can be achieved by a pipeline in an equivalent silicon production process:

1/(pipeline time in ns/number of stages) * 1,000 (to convert to megahertz) = maximum frequency

Accordingly, the maximum frequency achievable by a five-stage, 10-ns pipeline is:

1/(10/5) * 1,000 = 500 MHz

In contrast, a 15-stage, 12-ns pipeline can achieve:

1/(12/15) * 1,000 = 1,250 MHz or 1.25 GHz

Additional frequency gains can be achieved by changing the silicon process and/or using smaller transistors to reduce the amount of delay caused by each gate.

Increasing the number of stages can impact performance because there is additional overhead associated with the added stages. There is also a higher performance penalty if the processor incorrectly predicts a code sequence for execution. These mispredictions require the pipeline to be restarted. To mitigate performance penalties, Intel has implemented Advanced Dynamic Execution architecture, an enhanced version of the Dynamic Execution architecture introduced in the P6 family of processors. In addition, the Pentium 4 processor includes the new Execution Trace Cache.

  • Advanced Dynamic Execution — The Dynamic Execution architecture blends out-of-order speculative execution with hardware register renaming and branch prediction. Advanced Dynamic Execution extends some of the basic features found in the P6 core. Branch predictions are increased with a new algorithm that is the most advanced of Intel's currently disclosed branch prediction schemes. It also includes a 4K branch predictor table, which is eight times larger than that included in the P6 family of processors. Intel estimates that the new branch prediction will reduce mispredictions by one-third.

    In addition, the Pentium 4 processor has improved speculative execution capabilities. The processor features deeper speculation consisting of 126 instructions in flight, with 48 loads and 24 stores. This allows the processor to look ahead 126 instructions to find tasks to perform in parallel. The increased number of instructions allows the processor to schedule available resources more efficiently and to improve execution parallelism.
  • Execution Trace Cache — Instead of a typical L1 instruction cache that contains the same information that is stored in memory, the Execution Trace Cache contains a decoded version of the instructions known as micro-ops (uops). By storing the uops in the cache, cache hits can begin execution of the instructions much sooner because they have already been decoded. In addition, branch predictions are performed before placing instructions in the cache, so that instructions in the predicted execution path can be loaded in expected execution order. This reduces any latency associated with the branch predictions. Finally, in the case of a misprediction, the recovery time is reduced when the corrected path is in the cache, because only a subset of the full 20-stage pipeline must be reloaded. The processor can retrieve the decoded instructions, rather than having to use the full pipeline to decode and execute the corrected execution path.

L1 High-Performance Data Cache

The L1 data cache has been redesigned to increase bandwidth and reduce latency. It is an 8-KB, 4-way set-associative cache with 64-byte lines. (For an explanation of set-associative cache, see "An Overview of Cache," http://developer.intel.com/design/intarch/papers/cache6.htm   .) The cache was designed with a new access algorithm to reduce the latency, and is dual ported to allow one load and one store per clock. This reduces the latency on the Pentium 4 processor to two clocks, compared to three clocks on the Pentium III.

L2 Advanced Transfer Cache

Introduced in the Pentium III with integrated L2 cache, the Advanced Transfer Cache is a 256-KB, 8-way set-associative cache that holds both data and instructions. The Pentium 4 processor has extended this cache architecture to include 128-byte lines, with two 64-byte pieces per line. The cache bus bandwidth is more than double that of the Pentium III; for example, the bandwidth on a 1.5-GHz processor is 48 GB/sec.

Rapid Execution Engine

The Rapid Execution Engine is a performance-enhanced arithmetic logic unit (ALU). The ALU has been redefined to execute the most common integer operations in one-half clock, allowing two instructions to be executed in a single clock cycle. This can provide significant performance increases on integer-intensive applications.

400-MHz System Bus

The new system bus on the Pentium 4 processor supports a "source-synchronous," "quad-pumped" bus. Source synchronous (also referred to as "clock-forwarded") refers to the ability to propagate the clock or strobe with the data, a requirement for exceeding 200 MHz on the frontside bus. Quad-pumped refers to the ability to transfer four data values per clock. The new system bus is designed to operate with a base clock of 100 MHz; thus, the quad pump produces 400 megatransfers/sec. This increases the bandwidth of the 64-bit system bus to 3.2 GB/sec (compared to 1.0 GB/sec on the Pentium III processor).

Enhanced Floating Point/Multimedia Engine

This feature provides significant performance enhancements on standard floating-point operations. The engine has been extended to include a 128-bit port, with a separate floating-point move and data store port. Also included in the enhancements are new data types and instructions for supporting SSE2.

Software Architectural Changes: SSE2

SSE2 is a set of 144 new instructions that provide advanced capabilities for applications such as 3D graphics, video encoding/decoding, and speech recognition. There are six new data types and three new classes of instructions. SSE2 also includes some changes to take advantage of the advanced hardware features and new data types included in the Pentium 4 processor microarchitecture. In addition, it reuses the eight existing 128-bit extended multimedia (XMM) registers for both SSE2 and SSE operations. SSE2 is fully compatible with current IA-32 software.

New Data Types

Figure 2 presents the six new data types included in SSE2. The six new types consist of three classes: a 128-bit packed double-precision floating point, a 64-bit quadword integer, and four 128-bit integer data types. The packed floating-point type allows two IEEE 64-bit double-precision floating-point values to be packed into one double quadword. The 64-bit quadword integer type allows for both signed (i.e., negative or positive) and unsigned values. The 128-bit integer types allow for two quadwords, four doublewords, eight words, or 16-byte integers to be packed into one double quadword.

New Data Types

Figure 2. New Data Types

New Instructions

The 144 new instructions fall into three categories: double-precision floating point, integer, and cache instructions.

"Double precision" refers to the use of two double words instead of one to hold a number used for calculations. This allows a larger number of bits to represent the value for more arithmetic precision. Double-precision capability is particularly useful in the floating-point operations common in advanced 3D graphics applications.

Double-Precision Floating Point

The double-precision floating-point instructions include data movement, arithmetic, comparison, conversion, logical, and shuffle instructions. Examples include:

  • MOVAPD (move aligned packed double-precision floating-point)
  • ADDPD (add packed double-precision floating-point)
  • CMPPD (compare packed double-precision floating-point)
  • CVTPS2PD (convert packed single-precision floating-point to packed double-precision floating-point)
  • ANDPD (logical AND of packed double-precision floating-point)
  • SHUFPD (shuffle packed double-precision floating-point)

Each instruction contains two double-precision floating-point values, and the results are performed in parallel. Thus, both calculations can be performed in parallel.

Integer

SSE2 adds several 128-bit packed-integer instructions. The 128-bit versions of these new instructions operate on data in the XMM registers, and the 64-bit versions operate on data in the MMX registers. Example instructions include:

  • PADDQ (packed quadword add)
  • PMULUDQ (unsigned integer doubleword multiply)
  • PSHUFD (shuffle packed-integer doublewords in an XMM register)
  • PSRLDQ (shift right logical with byte granularity the contents of an XMM register)
  • MOVQ2DQ (move integer data from MMX to XMM registers)

In addition, SSE2 extends all of the 64-bit MMX and SSE integer instructions to operate on 128-bit packed-integer operands located in the XMM registers.

Cache

Additional cache instructions over the original SSE instructions allow more control over the caching, loading, and storing of data. Cache instructions are broken down into cache flush, cachability control, memory ordering, and pause instructions. "Cache flush" allows the write-back and invalidate of a single cache line. "Cachability control" instructions provide hints (using "fence" commands) to the cache not to store data that will be accessed once and not reused immediately. The related "memory-ordering" instructions establish the fence commands used in cachability control. "Pause" instructions delay execution of the next instruction for a specified period of time.

SSE2 provides powerful extensions to the instruction set for IA-32. These new instructions allow the processor to perform operations on more data in parallel, and the programmer more flexible control over the caching of the data that is being used. Overall, SSE2 allows software to perform better on integer and floating-point calculations that can be executed in parallel.

Conclusion

The Pentium 4 processor delivers significant performance improvements on the following types of applications and environments:

  • Workstation-class applications that rely heavily on floating-point or 3D graphics performance.
  • Multimedia and digital content creation applications such as voice recognition and video encoding/decoding.
  • Bandwidth-intensive games and other memory-intensive applications.
  • Emerging e-business applications such as 3D collaboration, data visualization, and information management.
  • Multitasking environments running multiple high-bandwidth applications and real-time background tasks such as virus checking, encryption, compression, and e-mail synchronization.

Dell recently launched Pentium 4 processor-based Dell PrecisionTM  WorkStation and high-end DimensionTM  systems, to be followed closely by OptiPlexTM  systems in early 2001. These are single-processor systems that use the Intel 850 chipset and dual-channel Rambus DRAM. (Intel plans to release "Foster," a dual-processor, larger-cache version of the Pentium 4 processor designed for servers and high-end workstations sometime in 2001.)

The architectural improvements in the Pentium 4 processor are expected to allow this Intel processor family to scale frequency and performance for many years. The addition of the SSE2 enhancements provide advanced capabilities for 3D graphics, video encoding/decoding, and speech recognition. Dell believes that the Pentium 4 processor offers a compelling feature set, and will lead the way for next-generation systems, software, and technologies to be incorporated into personal computers.

For More Information

For more information, see the following website:

  • IA-32 Intel Architecture Software Developer's Manual With Preliminary Intel Pentium 4 Processor Information Volume 1: Basic Architecture and Volume 2: Instruction Set Reference  , http://developer.intel.com/design/processor/future/manuals


Information in this document is subject to change without notice.
© 2000 Dell Computer Corporation. All rights reserved.

Trademarks used in this text: Dell  , the DELL  logo, Dell Precision  , OptiPlex  , and Dimension  are trademarks of Dell Computer Corporation; Intel  and Pentium  are registered trademarks and MMX  and NetBurst  are trademarks of Intel Corporation. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell Computer Corporation disclaims any proprietary interest in trademarks and trade names other than its own.

Copyright 1999-2003 Dell Computer Corporation.
Terms and ConditionsOnline Privacy Practices
http://www.euro.dell.com/Support