Tuesday, October 16, 2012

Java development 2.0: Introducing Kilim - An actor framework for Java concurrency

"Debugging nondeterministic defects in multithreaded applications has to be one of the most painful and frustrating activities known to software developers. So, like a lot of people, I've been caught up in the excitement about concurrent programming with functional languages like Erlang and Scala.
Both Scala and Erlang employ the actor model, rather than threads, for concurrent programming. Innovations around the actor model aren't limited to just languages; the actor model is also accessible on Java-based actor frameworks like Kilim.
Kilim's approach to the actor model is intuitive, and as you'll soon see, the library makes building concurrent applications a breeze."

The multicore challenge

In 2005, Herb Sutter wrote a now famous article entitled, "The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software". In it, he tore apart the misplaced belief that Moore's Law would continue to unleash higher and higher CPU clock speeds.

Sutter predicted the end of the "free lunch" that had come of increasing the performance of software applications by piggy-backing on faster and faster chips. Instead, he said that obtaining an appreciable increase in application performance would require taking advantage of multicore chip architectures.

As it turns out, he was right. Chip manufacturers have hit a hard limit, with chip speeds stabilizing at around 3.5 GHz for some years now. Moore's Law is alive and well in the multicore arena with manufacturers increasing the number of cores on chips at an ever-increasing rate.

The basic programming model of languages. like the Java language, is thread based. While multithreaded applications aren't terribly hard to write, there are challenges to writing them correctly. What's difficult about concurrent programming is thinking in terms of concurrency with threads. Alternate concurrency models have arisen along these lines. One that is particularly interesting, and gaining mindshare in the Java community, is the actor model.Sutter also noted that concurrent programming would allow developers to take advantage of multicore architectures. But, he added, "we desperately need a higher-level programming model for concurrency than languages offer today."

The actor model

The actor model is a different way of modeling concurrent processes. Rather than threads interacting via shared memory with locks, the actor model leverages "actors" that pass asynchronous messages using mailboxes. A mailbox, in this case, is just like one in real life — messages can be stored and retrieved for processing by other actors. Rather than sharing variables in memory, the mailbox effectively separates distinct processes from each other.

Actors act as separate and distinct entities that don't share memory for communication. In fact, actors can only communicate via mailboxes. There are no locks and synchronized blocks in the actor model, so the issues that arise from them — like deadlocks and the nefarious lost-update problem — aren't a problem. What's more, actors are intended to work concurrently and not in some sequenced manner. As such, actors are much safer (locks and synchronization aren't necessary) and the actor model itself handles coordination issues. In essence, the actor model makes concurrent programming easier.

The actor model is by no means new; it's been around for quite some time. Some languages, like Erlang and Scala, base their concurrency model on actors as opposed to threads. In fact, Erlang's success in enterprise situations (Erlang was created by Ericsson and has a rich history in the telecom world) has arguably made the actor model more popular and visible, and that has helped it become a viable option for other languages. Erlang is a shining example of the actor model's safer approach to concurrency.

Unfortunately, the actor model isn't built into the Java platform, but it is available to us in a variety of forms. The JVM's openness to alternate languages means that you can leverage actors via a Java-platform language like Scala or Groovy (see Resources to learn about Groovy's actor library, GPars). Also, you can try one of the Java-based libraries that enable the actor model, like Kilim.

Sunday, October 14, 2012

Java theory and practice: Introduction to nonblocking algorithms

Java theory and practice: Introduction to nonblocking algorithms:

Summary: Java™ 5.0 makes it possible for the first time to develop nonblocking algorithms in the Java language, and the java.util.concurrent package uses this capability extensively. Nonblocking algorithms are concurrent algorithms that derive their thread safety not from locks, but from low-level atomic hardware primitives such as compare-and-swap. Nonblocking algorithms can be extremely difficult to design and implement, but they can offer better throughput and greater resistance to liveness problems such as deadlock and priority inversion. In this installment of Java theory and practice, concurrency guru Brian Goetz illustrates how several of the simpler nonblocking algorithms work.

Java concurrency

1. Learn Java concurrency basics

Threads and processes are the basic units of execution in concurrent Java programming. Every process has at least one thread, and all of the threads in a process share its resources. Understand the benefits of threads and why it's essential to use them safely.

READ: Introduction to Java threads Item marked not complete - Click to mark complete

2. Master high-level Java concurrency utilities

Learn how to use the thread-safe, well-tested, high-performance concurrent building blocks in the java.util.concurrent package, introduced in Java SE 5. And find out how to avoid both common and lesser-known concurrency pitfalls.

PRACTICE: Concurrency in JDK 5.0 Item marked not complete - Click to mark complete

READ: 5 things you didn't know about.. java.util.concurrent, Part 1 Item marked not complete - Click to mark complete

READ: 5 things you didn't know about ... java.util.concurrent, Part 2 Item marked not complete - Click to mark complete

READ: Java concurrency bug patterns for multicore systems Item marked not complete - Click to mark complete

3. Test and analyze your concurrent code

Take advantage of tools developed by IBM researchers for testing, debugging, and analyzing concurrent Java applications.

DOWNLOAD: IBM Thread and Monitor Dump Analyzer for Java(Free download) Item marked not complete - Click to mark complete

PRACTICE: Multithreaded unit testing with ConTest Item marked not complete - Click to mark complete

4. Explore alternate concurrency models

In response to advances in multicore processor hardware, approaches to writing concurrent applications for the Java platform are diversifying. Concurrency support in two alternate languages for the JVM — Scala and Clojure — eschew the thread model. Learn about the actor and agent concurrency in those languages, and about third-party Java and Groovy libraries that implement those models. And learn more about fork-join, a multicore-friendly concurrency enhancement in Java SE 7.

LISTEN: Alex Miller talks concurrency Item marked not complete - Click to mark complete

READ: Explore Scala concurrency Item marked not complete - Click to mark complete

READ: Clojure and concurrency Item marked not complete - Click to mark complete

READ: Introducing Kilim: An actor framework for Java concurrency Item marked not complete - Click to mark complete

READ: Resolve common concurrency problems with GPars Item marked not complete - Click to mark complete

READ: Java theory and practice: Stick a fork in it Item marked not complete - Click to mark complete

Multicore CPUs and the concurrency changes they bring - a brief note from IBM developerworks

Multicore CPUs and the concurrency changes they bring:

Summary: Multicore chip architectures have brought little improvement to individual core performance. This continuing trend shifts the burden of maximizing the use of hardware resources to the developers of operating systems, programming languages, and applications. Many in the application-development community rely on thread-based concurrent programming to implement application parallelism. This article explains why thread-based programming is not the best approach for application parallelism in the multicore era.

Saturday, October 13, 2012

Normalizing Your Database: Third Normal Form (3NF)

Normalizing Your Database: Third Normal Form (3NF):

There are two basic requirements for a database to be in third normal form:

Already meet the requirements of both 1NF and 2NF
Remove columns that are not fully dependent upon the primary key.

Imagine that we have a table of widget orders that contains the following attributes:

Order Number
Customer Number
Unit Price
Quantity
Total

Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are there any duplicative columns? No. Do we have a primary key? Yes, the order number. Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF.

Now, are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn't appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we're OK there.

What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity, therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form. Perhaps we use the following attributes:

Order Number
Customer Number
Unit Price
Quantity

Now our table is in 3NF. But, you might ask, what about the total? This is a derived field and it's best not to store it in the database at all. We can simply compute it "on the fly" when performing database queries. For example, we might have previously used this query to retrieve order numbers and totals:

 SELECT OrderNumber, Total
 FROM WidgetOrders

We can now use the following query:

 SELECT OrderNumber, UnitPrice * Quantity AS Total
 FROM WidgetOrders

to achieve the same results without violating normalization rules.

Database Normalization: First, Second, and Third Normal Forms | Andrew Rollins

Database Normalization: First, Second, and Third Normal Forms | Andrew Rollins: "This is a complex way of saying that if a column isn’t intrinsically related to the entire primary key, then you should break out the primary key into different tables."

'via Blog this'

Reduce finalizer usage

IBM Debugging OutOfMemory Errors and Performance Problems

Reduce finalizer usage

Use the smallest number of finalizers possible, and ensure that they run as quickly as possible

Introduction

Finalization of an object occurs when the object is no longer in use, but before it is removed from the heap. Only the Garbage Collector can determine if an object is no longer in use and is eligable for collection. When the Garbage Collector runs, it finds the unreachable objects that have a finalizer method and adds them to the finalizer queue. Because methods cannot be run during the stop-the-world phase of garbage collection, the removal of objects with finalizers is delayed until the garbage collection cycle after the finalizer method has been run. Where possible, this removal occurs in the next garbage collection cycle and collection is delayed only by a single cycle. If this removal does not occur, a backlog of objects requiring finalization can build up.

Ensuring objects are finalized before the next garbage collection cycle

A single thread is responsible for running the finalizer method of all objects with a finalizer method, and therefore this activity occurs serially. To ensure that finalization occurs before the next garbage collection cycle, follow these guidelines:

Use finalizers only where necessary: By reducing the number of finalizers, all finalization is more likely to occur between garbage collection cycles. Where possible, do cleanup synchronously without the use of finalizers.
Ensure that finalizers complete quickly: Finalizers must run as quickly as possible and avoid taking locks or blocking on external resources or connections.

Use the smallest number of finalizers possible, and ensure that they run as quickly as possible

Friday, October 12, 2012

[concurrency-interest] LinkedBlockingQueue extract and memory "waste"

[concurrency-interest] LinkedBlockingQueue extract and memory "waste":

A (possible) performance bug in JDK with LinkedBlockingQueue implementation leading to a memory issue.
Nullifying head.next pointers while extracting from LinkedBlockingQueue helps resolve the issue.

The Concurrency-interest Archives - Doug Lea et al

The Concurrency-interest Archives:

Concurrency-interest -- Discussion list for JSR-166

This is a Java concurrency package discussion forum where experts including the author/s of the java platform concurrency features (Doug Lea et al.) contribute their ideas / explanations on various core /systems level concurrency related topics.

Thursday, October 11, 2012

java.util.concurrent (Java Platform SE 6) - Concurrency Package API Summary

java.util.concurrent (Java Platform SE 6):

This page provides a quick reference summary to the Java Concurrency API.

Package java.util.concurrent Description

Utility classes commonly useful in concurrent programming. This package includes a few small standardized extensible frameworks, as well as some classes that provide useful functionality and are otherwise tedious or difficult to implement. Here are brief descriptions of the main components. See also the locks and atomic packages.

Executors

Interfaces. Executor is a simple standardized interface for defining custom thread-like subsystems, including thread pools, asynchronous IO, and lightweight task frameworks. Depending on which concrete Executor class is being used, tasks may execute in a newly created thread, an existing task-execution thread, or the thread calling execute(), and may execute sequentially or concurrently.ExecutorService provides a more complete asynchronous task execution framework. An ExecutorService manages queuing and scheduling of tasks, and allows controlled shutdown. TheScheduledExecutorService subinterface and associated interfaces add support for delayed and periodic task execution. ExecutorServices provide methods arranging asynchronous execution of any function expressed as Callable, the result-bearing analog of Runnable. A Future returns the results of a function, allows determination of whether execution has completed, and provides a means to cancel execution. A RunnableFuture is a Future that possesses a run method that upon execution, sets its results.Implementations. Classes ThreadPoolExecutor and ScheduledThreadPoolExecutor provide tunable, flexible thread pools. The Executors class provides factory methods for the most common kinds and configurations of Executors, as well as a few utility methods for using them. Other utilities based on Executors include the concrete class FutureTask providing a common extensible implementation of Futures, and ExecutorCompletionService, that assists in coordinating the processing of groups of asynchronous tasks.

Queues

The java.util.concurrent ConcurrentLinkedQueue class supplies an efficient scalable thread-safe non-blocking FIFO queue. Five implementations in java.util.concurrent support the extendedBlockingQueue interface, that defines blocking versions of put and take: LinkedBlockingQueue, ArrayBlockingQueue, SynchronousQueue, PriorityBlockingQueue, and DelayQueue. The different classes cover the most common usage contexts for producer-consumer, messaging, parallel tasking, and related concurrent designs. The BlockingDeque interface extends BlockingQueue to support both FIFO and LIFO (stack-based) operations. Class LinkedBlockingDeque provides an implementation.

Timing

The TimeUnit class provides multiple granularities (including nanoseconds) for specifying and controlling time-out based operations. Most classes in the package contain operations based on time-outs in addition to indefinite waits. In all cases that time-outs are used, the time-out specifies the minimum time that the method should wait before indicating that it timed-out. Implementations make a "best effort" to detect time-outs as soon as possible after they occur. However, an indefinite amount of time may elapse between a time-out being detected and a thread actually executing again after that time-out. All methods that accept timeout parameters treat values less than or equal to zero to mean not to wait at all. To wait "forever", you can use a value of Long.MAX_VALUE.

Synchronizers

Four classes aid common special-purpose synchronization idioms. Semaphore is a classic concurrency tool. CountDownLatch is a very simple yet very common utility for blocking until a given number of signals, events, or conditions hold. A CyclicBarrier is a resettable multiway synchronization point useful in some styles of parallel programming. An Exchanger allows two threads to exchange objects at a rendezvous point, and is useful in several pipeline designs.

Concurrent Collections

Besides Queues, this package supplies Collection implementations designed for use in multithreaded contexts: ConcurrentHashMap, ConcurrentSkipListMap, ConcurrentSkipListSet,CopyOnWriteArrayList, and CopyOnWriteArraySet. When many threads are expected to access a given collection, a ConcurrentHashMap is normally preferable to a synchronized HashMap, and aConcurrentSkipListMap is normally preferable to a synchronized TreeMap. A CopyOnWriteArrayList is preferable to a synchronized ArrayList when the expected number of reads and traversals greatly outnumber the number of updates to a list.The "Concurrent" prefix used with some classes in this package is a shorthand indicating several differences from similar "synchronized" classes. For example java.util.Hashtable andCollections.synchronizedMap(new HashMap()) are synchronized. But ConcurrentHashMap is "concurrent". A concurrent collection is thread-safe, but not governed by a single exclusion lock. In the particular case of ConcurrentHashMap, it safely permits any number of concurrent reads as well as a tunable number of concurrent writes. "Synchronized" classes can be useful when you need to prevent all access to a collection via a single lock, at the expense of poorer scalability. In other cases in which multiple threads are expected to access a common collection, "concurrent" versions are normally preferable. And unsynchronized collections are preferable when either collections are unshared, or are accessible only when holding other locks.

Most concurrent Collection implementations (including most Queues) also differ from the usual java.util conventions in that their Iterators provide weakly consistent rather than fast-fail traversal. A weakly consistent iterator is thread-safe, but does not necessarily freeze the collection while iterating, so it may (or may not) reflect any updates since the iterator was created.

Memory Consistency Properties

Chapter 17 of the Java Language Specification defines the happens-before relation on memory operations such as reads and writes of shared variables. The results of a write by one thread are guaranteed to be visible to a read by another thread only if the write operation happens-before the read operation. The synchronized and volatile constructs, as well as the Thread.start() and Thread.join()methods, can form happens-before relationships. In particular:

Each action in a thread happens-before every action in that thread that comes later in the program's order.
An unlock (synchronized block or method exit) of a monitor happens-before every subsequent lock (synchronized block or method entry) of that same monitor. And because the happens-beforerelation is transitive, all actions of a thread prior to unlocking happen-before all actions subsequent to any thread locking that monitor.
A write to a volatile field happens-before every subsequent read of that same field. Writes and reads of volatile fields have similar memory consistency effects as entering and exiting monitors, but do not entail mutual exclusion locking.
A call to start on a thread happens-before any action in the started thread.
All actions in a thread happen-before any other thread successfully returns from a join on that thread.

The methods of all classes in java.util.concurrent and its subpackages extend these guarantees to higher-level synchronization. In particular:

Actions in a thread prior to placing an object into any concurrent collection happen-before actions subsequent to the access or removal of that element from the collection in another thread.
Actions in a thread prior to the submission of a Runnable to an Executor happen-before its execution begins. Similarly for Callables submitted to an ExecutorService.
Actions taken by the asynchronous computation represented by a Future happen-before actions subsequent to the retrieval of the result via Future.get() in another thread.
Actions prior to "releasing" synchronizer methods such as Lock.unlock, Semaphore.release, and CountDownLatch.countDown happen-before actions subsequent to a successful "acquiring" method such as Lock.lock, Semaphore.acquire, Condition.await, and CountDownLatch.await on the same synchronizer object in another thread.
For each pair of threads that successfully exchange objects via an Exchanger, actions prior to the exchange() in each thread happen-before those subsequent to the corresponding exchange() in another thread.
Actions prior to calling CyclicBarrier.await happen-before actions performed by the barrier action, and actions performed by the barrier action happen-before actions subsequent to a successful return from the corresponding await in other threads.

Wednesday, October 3, 2012

An approach to Software Design

Visualize your system (from inside)
Identify different (disparate) entities in your system
Group objects which are inherently common (from functional aspect or in real world aspect)
Identify inherent relation-ships in those entities: use tools such as is-a, has-a, uses, composition, aggregation, generalization, realization, cardinality
Identify key data structures to model the entities
Identify the interaction model between these entities to achieve the end goal of the system
Identify tasks and ways you can break down your tasks.
Identify task relationships
Identify Workers or Threads and how and model their interaction
Choose appropriate Thread Synchronizers in each situation (each has its way of defining a thread interaction mechanism)
Identify business rules
Identify actors in your system
Identify and abstract common design patterns
Stick to SOLID principles for Design (Highly cohesive and lowest coupling)
Ask questions and gather constraints = which are the most essential limiting factors in shaping your system
Visualize your system (from outside)
Identify the above things using your system as one component and how it interacts or interfaces with its outside world
Identify performance goals of your system (latency/throughput)
Design for performance (scale-able - identify performance/scale-ability bottlenecks): liveness issues/memory leaks/resource leaks issues
Don't over design
Start small and let your design evolve
Keep it flexible and simple (use the SOLID principles of design)

Tuesday, October 2, 2012

SOLID (Single responsibility, Open-closed, Liskov substitution, Interface segregation and Dependency inversion)

SOLID (object-oriented design) - Wikipedia, the free encyclopedia:

In computer programming, SOLID (Single responsibility, Open-closed, Liskov substitution, Interface segregation and Dependency inversion) is a mnemonic acronymintroduced by Michael Feathers for the "first five principles" identified by Robert C. Martin^[1]^[2] in the early 2000s^[3] that stands for five basic principles of object-oriented programmingand design. The principles when applied together intend to make it more likely that a programmer will create a system that is easy to maintain and extend over time.^[3] The principles of SOLID are guidelines that can be applied while working on software to remove code smells by causing the programmer to refactor the software's source code until it is both legible and extensible. It is typically used with test-driven development, and is part of an overall strategy of agile and adaptive programming.^[3]^[4]

Single responsibility principle

In object-oriented programming, the single responsibility principle states that every class should have a single responsibility, and that responsibility should be entirely encapsulated by the class. All its services should be narrowly aligned with that responsibility.

The term was introduced by Robert C. Martin in an article by the same name as part of his Principles of Object Oriented Design,^[1] made popular by his book Agile Software Development, Principles, Patterns, and Practices^[2]. Martin described it as being based on the principle of cohesion, as described by Tom DeMarco in his book Structured Analysis and Systems Specification^[3].

Martin defines a responsibility as a reason to change, and concludes that a class or module should have one, and only one, reason to change. As an example, consider a module that compiles and prints a report. Such a module can be changed for two reasons. First, the content of the report can change. Second, the format of the report can change. These two things change for very different causes; one substantive, and one cosmetic. The single responsibility principle says that these two aspects of the problem are really two separate responsibilities, and should therefore be in separate classes or modules. It would be a bad design to couple two things that change for different reasons at different times.

The reason it is important to keep a class focused on a single concern is that it makes the class more robust. Continuing with the foregoing example, if there is a change to the report compilation process, there is greater danger that the printing code will break if it is part of the same class.

Open/closed principle

In object-oriented programming, the open/closed principle states "software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification";^[1] that is, such an entity can allow its behaviour to be modified without altering its source code. This is especially valuable in a production environment, where changes to source code may necessitate code reviews, unit tests, and other such procedures to qualify it for use in a product: code obeying the principle doesn't change when it is extended, and therefore needs no such effort.

The name Open/Closed Principle has been used in two ways. Both ways use inheritance to resolve the apparent dilemma, but the goals, techniques, and results are different.

Liskov substitution principle

My comments:

Behaviorial Subtyping (Allows behavior to be subtyped)

Program to an Interface or Design by contract

Substitutability is a principle in object-oriented programming. It states that, in a computer program, if S is a subtype of T, then objects of type T may be replaced with objects of type S (i.e., objects of type S may be substituted for objects of type T) without altering any of the desirable properties of that program (correctness, task performed, etc.). More formally, the Liskov substitution principle (LSP) is a particular definition of a subtyping relation, called (strong) behavioral subtyping, that was initially introduced by Barbara Liskov in a 1987 conference keynote address entitled Data abstraction and hierarchy. It is a semantic rather than merely syntactic relation because it intends to guarantee semantic interoperability of types in a hierarchy, object types in particular. Liskov and Jeannette Wing formulated the principle succinctly in a 1994 paper as follows:

Let $q(x)$ be a property provable about objects $x$ of type $T$ . Then $q(y)$ should be provable for objects $y$ of type $S$ where $S$ is a subtype of $T$ .

In the same paper, Liskov and Wing detailed their notion of behavioral subtyping in an extension of Hoare logic, which bears a certain resemblance with Bertrand Meyer's Design by Contract in that it considers the interaction of subtyping with pre- and postconditions.

Interface segregation principle

My comments:

Break complex interfaces into fundamental simpler ones that are more cohesive.
Clients programmed to those interfaces will also be broken and smaller and hence the entire system is more modular, highly cohesive and further strengthens the Single Responsibility principle.
This makes the software pieces easily plauggable (in or out) and maintained.

The interface-segregation principle (ISP) is one of the five SOLID principles of Object-Oriented Design.^[1] similar to the High Cohesion Principle of GRASP.^[2] It is a software development principle used for clean development and intends to make software easy-to-change. ISP keeps a system decoupled and thus easier to refactor, change, and redeploy. ISP splits interfaces which are very large into smaller and more specific ones so that clients will only have to know about the methods that are of interest to them. In a nutshell, no client should be forced to depend on methods it does not use.^[1] Such shrunken interfaces are also called role interfaces

Dependency inversion principle

My comments:

This is similar to the inversion of control design pattern/Spring
Spring is a glue framework that provides specific implementation (classes) with different more cohesive functionality based packages and enables your application to be divided in to similar implementations and packages along with the capability to specify the dependencies among them in a declarative fashion. It offers component equality to all the different (and cohesive) components of the application, by the virtue of which, the components lose the essence of being qualified as either High Level component or low level component. All components (services/beans) are treated equal (in terms of service-ability or providing services in an application).

In object-oriented programming, the dependency inversion principle refers to a specific form of decoupling where conventional dependency relationships established from high-level, policy-setting modules to low-level, dependency modules are inverted (i.e. reversed) for the purpose of rendering high-level modules independent of the low-level module implementation details. The principle states:

A. High-level modules should not depend on low-level modules. Both should depend on abstractions.

B. Abstractions should not depend upon details. Details should depend upon abstractions.

The principle inverts the way some people may think about object-oriented design, as both high- and low-level objects depend on the same abstraction.^[1]

A. High-level modules should not depend on low-level modules. Both should depend on abstractions.

B. Abstractions should not depend upon details. Details should depend upon abstractions.

The principle inverts the way some people may think about object-oriented design, as both high- and low-level objects depend on the same abstraction.^[1]

Design considerations for Distributed Caching on the internet - white paper

Abstract
In this paper, we describe the design and implementation of
an integrated architecture for cache systems that scale to hundreds or thousands of caches with thousands to millions of
users. Rather than simply try to maximize hit rates, we take
an end-to-end approach to improving response time by also
considering hit times and miss times. We begin by studying
several Internet caches and workloads, and we derive three
core design principles for large scale distributed caches: (1)
minimize the number of hops to locate and access data on
both hits and misses, (2) share data among many users and
scale to many caches, and (3) cache data close to clients.
Our strategies for addressing these issues are built around a
scalable, high-performance data-location service that tracks
where objects are replicated. We describe how to construct
such a service and how to use this service to provide direct
access to remote data and push-based data replication. We
evaluate our system through trace-driven simulation and ﬁnd
that these strategies together provide response time speedups
of 1.27 to 2.43 compared to a traditional three-level cache
hierarchy for a range of trace workloads and simulated environments.

Hierarchical caching has been examined in the context of
ﬁle systems[5, 37, 20]. Muntz and Honeyman [32] concluded
that the additional hops in such a system often more than offset improvements in hit rate and characterized the extra level
of cache as a “delay server.” We reach similar conclusions in
the context of Internet caching, leading to our design principle of minimizing the number of hops on a hit or miss.
Several researchers have proposed improving the scalability of a data hierarchy by splitting responsibilities according
to a hash function [12, 44]. This approach may work well
for distributing load across a set of caches that are near one
another and near their clients, but in larger systems where
clients are closer to some caches than others, the hash function will prevent the system from exploiting locality.
Several studies have examined push caching and prefetching in the context of web workloads [22, 23, 34]. These systems all used more elaborate history information to predict
future references than the algorithm we examine. Because
large, shared caches do a good job at satisfying references
to popular objects, we explore prefetching strategies that will
work well for the remaining large number of objects about
whose access patterns little is known. Kroeger et. al [27]
examined the limits of performance for caching and prefetching. They found that the rate of change of data and the rate of
accesses to new data and new servers limits achievable performance.
Conclusions
Although caching is increasingly used in the Internet to reduce network trafﬁc and the load on web servers, it has been
less successful in reducing response time observed by clients.
We examine several environments and workloads and conclude that this may be because traditional hierarchical caches
violate several basic design principles for distributed caching
on the Internet. To address these systems, he have constructed
a hint hierarchy that supports direct access and push. Overall,
our techniques provide speedups of 1.27 to 2.43 compared to
a traditional cache hierarchy

'via Blog this'

A little summarizing Performance Note, JDK 7 Fork/join/ Parallelism, Performance/ Scaling - Horizontal/vErtical / Liveness issues

Before I start on the actual article I got side-tracked by what I was reading about Thrashing and where it occurs

Thrashing (computer science) - Wikipedia, the free encyclopedia: (in Main Memory)
"Thrashing lowers the CPU utilization and sometime tends to zero."

Cache Thrashing
Thrashing in Network communication - Silly Window problem in network communication - size of the payload becomes so small that the amount of overhead spend in processing the payload header has significantly increased (over 100% of) per payload hence the time spent in doing useful thing such as processing payload has become very inefficient.

And the N/w IO unit becomes busy doing it resulting in the computer doing less useful work against.

You might also like to see the following links with respect to Thrashing:

http://en.wikipedia.org/wiki/Working_set

http://en.wikipedia.org/wiki/Translation_lookaside_buffer

To attack the issue of Thrashing the best long term solution is to add more memory and
otherwise do the following things which will result in more/sufficient memory for a process in
execution so it doesn't thrash :

Increase the amount of RAM in the computer (generally the best long-term solution).
Decrease the number of programs being run on the computer.
Replace programs that are memory-heavy with equivalents that use less memory.

Improve spatial locality by replacing loops like:

int m[256][256];
 
for(i=0; i<=255; i++){
  for(k=0; k<=255; k++){
    m[k][i] = something();
  }
}

with

int m[256][256];
 
for(i=0; i<=255; i++){
  for(k=0; k<=255; k++){
    m[i][k] = something();
  }
}

Herewith, I'd like to mention a small note (I have a bigger article hidden somewhere in my computer which I will try to pull it someday and add to this blog) which relates to the actual title of the article :-

Real Life comparison:
Real life example - in case of "good" software developers (who can build scale-able software) who are used to do production/support more than having them develop better software and scale the productivity of the organization. It is always better to use resources "efficiently (do more with less)" and the best way to do
that is do it at "max efficiency"; in other words deploy your good resources in places where their
"real talent" is used "at max efficiency". This talent/skill utilization scheme, when perfected can result
in an organization resulting in Max efficiency (an subsequently max capacity) at current costs.

Once you have max performance - then you should go for scaling (horizontally).

Don't lock your resources into doing "non-critical" stuff. But always do lock your resources to do
"critical stuff" else if a resource starts performing multiple critical activities - it won't be efficient
as it will result in heavy context switch (of his mental environment) and will eventually result in
sub-grade performance (This analogy is comparable to multi-threaded concurrent programs and
how well they are written to utilize the resources available to them) . One unseen resource or rather
intangible resource is "time" which is often either taken for granted or mis-understood for performance
or forgotten blissfully.

Increasing contention results in low CPU utilization and heavy context switching. And the liveness
(livelock/deadlock/starvation) problems that come along with this result in the CPU doing the less
of what it should be doing more for more efficiency. In such a case where the programs are not
well-concurrent in nature they are not easy to scale either. Because adding more resources (such as CPU/Memory) to increase the performance won't help since you are not solving the "contention" problem.

Before scaling - attack liveness / memory leak issue so that your system is performing efficiently (at max efficiency) with current resources. (take into consideration good divide and rule techniques anywhere and
everywhere you could deploy in the application /cache (distributed)/memory (distributed)/state partitioning /
partitioning in database/ transaction isolation levels (more granularity = more concurrency).
In other words, Design for performance when writing a new system (define your performance goals).
If improving the performance then solve the above problems to get whatever you can get (remove liveness problems) partition data etc. Scaling (adding resources should be the last measure) for improving performance. And always scale horizontally (its cheap and flexible) . You can easily scale up or scale down.
if you have varying load on your system (or load is seasonal). Vertical scaling has a huge price and you are
locked in and your application design is not leveraged here. Only hardware is doing the work of giving you
more performance so it also becomes costly. It is not easy to scale down. Also not flexible.

JDK 7 supplements the java.util.concurrent package with api for FORK/JOIN to take further leverage
concurrency capabilities in the your java application by providing the capabilities of work-stealing
algorithms ( allows capability to break down your heavy long running tasks (recursively) to the a lowest common factor which the thread pool would work on - join the results recursively bottom-up and return the
result). This allows more concurrency because it allows the framework to break-down the tasks further.

In a non-work stealing multi-threaded environment some threads in your threadpool threads may sit idle
if there are no more new tasks to be run and the currently executing tasks by the busy threads in the Threadpool are substantially long-running. And if there is CPU available for these idle threads then
there is more scope for parallelism if you could break the long running tasks in the first place.

Fork/Join's purpose is just that. It helps you do more (in parallel) by allowing you to break down your tasks
so that the idle threads when done can "steal" (relative to the non fork/join multi-threaded world) the work.

'via Blog this'