In today’s business world, with the growing importance of the Internet, more and more applications need to be
available online all the time. One obvious example is the online store application. Many companies want to keep their
online stores open 24×7 on 365 days so that customers from everywhere, in different time zones, can come at any time
to browse products and place orders.
High Availability (HA) may also be critical for non-customer-facing applications. It is very common for IT
departments to have complex distributed applications that connect to multiple data sources, such as those that extract
and summarize sales data from online store applications to reporting systems. A common characteristic of these
applications is that any unexpected downtime could mean a huge loss of business revenue and customers. The total
loss is sometimes very hard to quantify with a dollar amount. As the key components of these applications, Oracle
databases are often key components of a whole storefront ecosystem, so their availability can impact the availability of
the entire ecosystem.
The second area is the scalability of applications. As the business grows, transaction volumes can double or
triple as compared to what was scoped for the initial capacity. Moreover, for short times, business volumes can be
very dynamic; for example, sales volumes for the holiday season can be significantly higher. An Oracle Database
should be scalable and flexible enough to easily adapt to business dynamics and able to expand for high workloads
and shrink when demand is reduced. Historically, the old Big Iron Unix servers that used to dominate the database
server market lacked the flexibility to adapt to these changes. In the last ten years, the industry standard has shifted to
x86-64 architecture running on Linux to meet the scalability and flexibility needs of growing applications. Oracle Real
Application Clusters (RAC) running on Linux on commodity X86-64 servers is a widely adapted industry-standard
solution to achieve high availability and scalability.
This chapter introduces the Oracle RAC technology and discusses how to achieve the high availability and
scalability of the Oracle database with Oracle RAC. The following topics will be covered in this chapter:
• Database High Availability and Scalability
• Oracle Real Application Clusters (RAC)
• Achieving the Benefits of Oracle RAC
• Considerations for Deploying Oracle RAC
High Availability and Scalability
This section discusses the database availability and scalability requirements and their various related factors.
What Is High Availability?
As shown in the previous example of the online store application, business urges IT departments to provide solutions
to meet the availability requirements of business applications. As the centerpiece of most business applications,
database availability is the key to keeping all the applications available.
In most IT organizations, Service Level Agreements (SLAs) are used to define the application availability
agreement between business and IT organization. They can be defined as the percentage availability, or the maximum
downtime allowed per month or per year. For example, an SLA that specifies 99.999% availability means less than
5.26 minutes downtime allowed annually. Sometimes an SLA also specifies the particular time window allowed for
downtime; for example, a back-end office application database can be down between midnight and 4 a.m. the first
Saturday of each quarter for scheduled maintenance such as hardware and software upgrades.
Since most high availability solutions require additional hardware and/or software, the cost of these solutions
can be high. Companies should determine their HA requirements based on the nature of the applications and the
cost structure. For example some back-end office applications such as a human resource application may not need to
be online 24×7. For those mission–critical business applications that need to be highly available, an evaluation of the
cost of downtime may be calculated too; for example, how much money can be lost due to 1 hour of downtime. Then
we can compare the downtime costs with the capital costs and operational expenses associated with the design and
implementation of various levels of availability solution. This kind of comparison will help business managers and IT
departments come up with realistic SLAs that meet their real business and affordability needs and that their IT team
Many business applications consist of multi-tier applications that run on multiple computers in a distributed
network. The availability of the business applications depends not only on the infrastructure that supports these
multi-tier applications, including the server hardware, storage, network, and OS, but also on each tier of the
applications, such as web servers, application servers, and database servers. In this chapter, I will focus mainly on the
availability of the database server, which is the database administrator’s responsibility.
Database availability also plays a critical role in application availability. We use downtime to refer to the periods
when a database is unavailable. The downtime can be either unplanned downtime or planned downtime. Unplanned
downtime can occur without being prepared by system admin or DBAs—it may be caused by an unexpected event
such as hardware or software failure, human error, or even a natural disaster (losing a data center). Most unplanned
downtime can be anticipated; for example, when designing a cluster it is best to make the assumption that everything
will fail, considering that most of these clusters are commodity clusters and hence have parts which break. The key
when designing the availability of the system is to ensure that it has sufficient redundancy built into it, assuming
that every component (including the entire site) may fail. Planned downtime is usually associated with scheduled
maintenance activities such as system upgrade or migration.
Unplanned downtime of the Oracle database service can be due to data loss or server failure. The data loss may
be caused by storage medium failure, data corruption, deletion of data by human error, or even data center failure.
Data loss can be a very serious failure as it may turn out to be permanent, or could take a long time to recover from.
The solutions to data loss consist of prevention methods and recovery methods. Prevention methods include disk
mirroring by RAID (Redundant Array of Independent Disks) configurations such as RAID 1 (mirroring only) and
RAID 10 (mirroring and striping) in the storage array or with ASM (Automatic Storage Management) diskgroup
redundancy setting. Chapter 5 will discuss the details of the RAID configurations and ASM configurations for Oracle
Databases. Recovery methods focus on getting the data back through database recovery from the previous database
backup or flashback recovery or switching to the standby database through Data Guard failover.
Server failure is usually caused by hardware or software failure. Hardware failure can be physical machine
component failure, network or storage connection failure; and software failure can be caused by an OS crash, or
Oracle database instance or ASM instance failure. Usually during server failure, data in the database remains intact.
After the software or hardware issue is fixed, the database service on the failed server can be resumed after completing
database instance recovery and startup. Database service downtime due to server failure can be prevented by
providing redundant database servers so that the database service can fail over in case of primary server failure.
Network and storage connection failure can be prevented by providing redundant network and storage connections.
In the database world, it is said that one should always start with application database design, SQL query tuning, and
database instance tuning, instead of just adding new hardware. This is always true, as with a bad application database
design and bad SQL queries, adding additional hardware will not solve the performance problem. On the other hand,
however, even some well-tuned databases can run out of system capacity as workloads increase.
In this case, the database performance issue is no longer just a tuning issue. It also becomes a scalability issue.
Database scalability is about how to increase the database throughput and reduce database response time, under
increasing workloads, by adding more computing, networking, and storage resources.
The three critical system resources for database systems are CPU, memory, and storage. Different types of
database workloads may use these resources differently: some may be CPU bound or memory bound, while others
may be I/O bound. To scale the database, DBAs first need to identify the major performance bottlenecks or resource
contentions with a performance monitoring tool such as Oracle Enterprise Manager or AWR (Automatic Workload
Repository) report. If the database is found to be I/O bound, storage needs to be scaled up. In Chapter 5, we discuss
how to scale up storage by increasing storage I/O capacity such as IOPs (I/O operations per second) and decrease
storage response time with ASM striping and I/O load balancing on disk drives.
If the database is found to be CPU bound or memory bound, server capacity needs to be scaled up. Server
scalability can be achieved by one of the following two methods:
• Scale-up or vertical scaling: adding additional CPUs and memory to the existing server.
• Scale-out or horizontal scaling: adding additional server(s) to the database system.
The scale-up method is relatively simple. We just need to add more CPUs and memory to the server. Additional
CPUs can be recognized by the OS and the database instance. To use the additional memory, some memory settings
may need to be modified in OS kernel, as well as the database instance initialization parameters. This option is more
useful with x86 servers as these servers are getting more CPUs cores and memory (up to 80 cores and 4TB memory per
server of the newer servers at the time of writing). The HP DL580 and DL980 and Dell R820 and R910 are examples of
these powerful X86 servers. For some servers, such as those which are based on Intel’s Sandybridge and Northbridge
architectures, adding more memory with the older CPUs might not always achieve the same memory performance.
One of the biggest issues with this scale-up method is that it can hit its limit when the server has already reached the
maximal CPU and memory capacity. In this case, you may have to either replace it with a more powerful server or try
the scale-out option.
The scale-out option is to add more server(s) to the database by clustering these servers so that workloads can be
distributed between them. In this way, the database can double or triple its CPU and memory resources. Compared to
the scale-up method, scale-out is more scalable as you can continue adding more servers for continuously increasing
This section discusses Oracle RAC: its architecture, infrastructure requirements, and main components.
Database Clustering Architecture
To achieve horizontal scalability or scale-out of a database, multiple database servers are grouped together to form
a cluster infrastructure. These servers are linked by a private interconnect network and work together as a single
virtual server that is capable of handling large application workloads. This cluster can be easily expanded or shrunk by
adding or removing servers from the cluster to adapt to the dynamics of the workload. This architecture is not limited
by the maximum capacity of a single server, as the vertical scalability (scale-up) method is. There are two types of
• Shared Nothing Architecture
• Shared Everything Architecture
The shared nothing architecture is built on a group of independent servers with storage attached to each server.
Each server carries a portion of the database. The workloads are also divided by this group of servers so that each
server carries a predefined workload. Although this architecture can distribute the workloads among multiple servers,
the distribution of the workloads and data among the servers is predefined. Adding or removing a single server would
require a complete redesign and redeployment of the cluster.
For those applications where each node only needs to access a part of the database, with very careful partitioning
of the database and workloads, this shared nothing architecture may work. If the data partition is not completely in
sync with the application workload distribution on the server nodes, some nodes may need to access data stored in
other nodes. In this case, database performance will suffer. Shared nothing architecture also doesn’t work well with
a large set of database applications such as OLTP (Online transaction processing), which need to access the entire
database; this architecture will require frequent data redistribution across the nodes and will not work well. Shared
nothing also doesn’t provide high availability. Since each partition is dedicated to a piece of the data and workload
which is not duplicated by any other server, each server can be a single point of failure. In case of the failure of any
server, In the shared everything architecture, each server in the cluster is connected to a shared storage where the database
files are stored. It can be either active-passive or active-active. In the active-passive cluster architecture, at any given
time, only one server is actively accessing the database files and handling workloads; the second one is passive and in
standby. In the case of active server failure, the second server picks up the access to the database files and becomes the
active server, and user connections to the database also get failed over to the second server. This active-passive cluster
provides only availability, not scalability, as at any given time only one server is handling the workloads.
Examples of this type of cluster database include Microsoft SQL Server Cluster, Oracle Fail Safe, and Oracle
RAC One Node. Oracle RAC One Node, introduced in Oracle Database 11.2, allows the single-instance database to
be able to fail over to the other node in case of node failure. Since Oracle RAC One Node is based on the same Grid
Infrastructure as Oracle RAC Database, it can be converted from one node to the active-active Oracle RAC Database
with a couple of srvctl commands. Chapter 14 will discuss the details of Oracle RAC One Node.
In the active-active cluster architecture, all the servers in the cluster can actively access the database files and
handle workloads simultaneously. All database workloads are evenly distributed to all the servers. In case of one or
more server failures, the database connections and workloads on the failed servers get failed over to the rest of the
surviving servers. This active-active architecture implements database server virtualization by providing users with
a virtual database service. How many actual physical database servers are behind the virtual database service, and
how the workloads get distributed to these physical servers, is transparent to users. To make this architecture scalable,
adding or removing physical servers from the cluster is also transparent to users. Oracle RAC is the classic example of
the active-active shared everything database architecture