Repository Management with Nexus

2.4. What is a Repository?

Maven developers are familiar with the concept of a repository: a collection of binary software artifacts and metadata stored in a defined directory structure which is used by clients such as Apache Ivy to retrieve binaries during a build process. In the case of the Maven repository, the primary type of binary artifact is a JAR file containing Java bytecode, but there is no limit to what type of artifact can be stored in a Maven repository. For example, one could just as easily deploy documentation archives, source archives, Flash libraries and applications, or Ruby libraries to a Maven repository. A Maven repository provides a platform for the storage, retrieval, and management of binary software artifacts and metadata.

In Maven, every software artifact is described by an XML document called a Project Object Model (POM). This POM contains information that describes a project and lists a project’s dependencies — the binary software artifacts which a given component depends upon for successful compilation or execution.

When Maven downloads a dependency from a repository, it also downloads that dependency’s POM. Given a dependency’s POM, Maven can then download any other libraries which are required by that dependency. The ability to automatically calculate a project’s dependencies and transitive dependencies is made possible by the standard and structure set by the Maven repository.

Maven and other tools, such as Ivy which interact with a repository to search for binary software artifacts, model the projects they manage and retrieve software artifacts on-demand from a repository. When you download and install Maven without any customization, Maven will retrieve artifacts from the Central Repository which serves millions of Maven users every single day. While you can configure Maven to retrieve binary software artifacts from a collection of mirrors, the best practice is to install Nexus and use it to proxy and cache the contents of Central on your own network.

In addition to Central, there are a number of major organizations, such as Red Hat, Oracle, and Codehaus which maintain separate repositories.

While this might seem like a simple, obvious mechanism for distributing artifacts, the Java platform existed for several years before the Maven project created a formal attempt at the first repository for Java artifacts. Until the advent of the Maven repository in 2002, a project’s dependencies were gathered in a manual, ad-hoc process and were often distributed with a project’s source code. As applications grew more and more complex, and as software teams developed a need for more complex dependency management capabilities for larger enterprise applications, Maven’s ability to automatically retrieve dependencies and model dependencies between components became an essential part of software development.

2.4.1. Release and Snapshot Repositories

A repository stores two types of artifacts: releases and snapshots. Release repositories are for stable, static release artifacts. Snapshot repositories are frequently updated repositories that store binary software artifacts from projects under constant development.

While it is possible to create a repository which serves both release and snapshot artifacts, repositories are usually segmented into release or snapshot repositories serving different consumers and maintaining different standards and procedures for deploying artifacts. Much like the difference between a production network and a staging network, a release repository is considered a production network and a snapshot repository is more like a development or a testing network. While there is a higher level of procedure and ceremony associated with deploying to a release repository, snapshot artifacts can be deployed and changed frequently without regard for stability and repeatability concerns.

The two types of artifacts managed by a repository manager are:

Release
A release artifact is an artifact which was created by a specific, versioned release. For example, consider the 1.2.0 release of the commons-lang library stored in the Maven Central repository. This release artifact, commons-lang-1.2.0.jar, and the associated POM, commons-lang-1.2.0.pom, are static objects which will never change in the Maven Central repository. Released artifacts are considered to be solid, stable, and perpetual in order to guarantee that builds which depend upon them are repeatable over time. The released JAR artifact is associated with a PGP signature, an MD5 and SHA checksum which can be used to verify both the authenticity and integrity of the binary software artifact.
Snapshot
Snapshot artifacts are artifacts generated during the development of a software project. A Snapshot artifact has both a version number such as "1.3.0" or "1.3" and a timestamp in its name. For example, a snapshot artifact for commons-lang 1.3.0 might have the name commons-lang-1.3.0-20090314.182342-1.jar the associated POM, MD5 and SHA hashes would also have a similar name. To facilitate collaboration during the development of software components, Maven and other clients that know how to consume snapshot artifacts from a repository also know how to interrogate the metadata associated with a Snapshot artifact to retrieve the latest version of a Snapshot dependency from a repository.

A project under active development produces snapshot artifacts that change over time. A release is comprised of artifacts which will remain unchanged over time.

2.4.2. Repository Coordinates

Repositories and tools like Maven know about a set of coordinates, including the following components: groupId, artifactId, version, and packaging. This set of coordinates is often referred to as a GAV coordinate, which is short for Group, Artifact, Version coordinate. The GAV coordinate standard is the foundation for Maven’s ability to manage dependencies. Four elements of this coordinate system are described below:

groupId
A group identifier groups a set of artifacts into a logical group. Groups are often designed to reflect the organization under which a particular software component is being produced. For example, software components being produced by the Maven project at the Apache Software Foundation are available under the groupId org.apache.maven.
artifactId
An artifact is an identifier for a software component. An artifact can represent an application or a library; for example, if you were creating a simple web application your project might have the artifactId "simple-webapp", and if you were creating a simple library, your artifact might be "simple-library". The combination of groupId and artifactId must be unique for a project.
version
The version of a project follows the established convention of Major, Minor, and Point release versions. For example, if your simple-library artifact has a Major release version of 1, a minor release version of 2, and point release version of 3, your version would be 1.2.3. Versions can also have alphanumeric qualifiers which are often used to denote release status. An example of such a qualifier would be a version like "1.2.3-BETA" where BETA signals a stage of testing meaningful to consumers of a software component.
packaging
Maven was initially created to handle JAR files, but a Maven repository is completely agnostic about the type of artifact it is managing. Packaging can be anything that describes any binary software format including ZIP, SWC, SWF, NAR, WAR, EAR, SAR.

2.4.3. Addressing Resources in a Repository

Tools designed to interact Maven repositories translate artifact coordinates into a URL which corresponds to a location in a Maven repository. If a tool such as Maven is looking for version 1.2.0 of the commons-lang JAR in the group org.apache.commons, this request is translated into:

<repoURL>/org/apache/commons/commons-lang/1.2.0/commons-lang-1.2.0.jar

Maven would also download the corresponding POM for commons-lang 1.2.0 from:

<repoURL>/org/apache/commons/commons-lang/1.2.0/commons-lang-1.2.0.pom

This POM may contain references to other dependencies which would then be retrieved from the same repository using the same URL patterns.

2.4.4. The Central Repository

The most useful Maven repository is the Central Repository. The Central Repository is the largest repository for Java-based components and the default repository built into Apache Maven. Statistics about the size of the Central Repository are available at http://search.maven.org/#stats. You can look at the Central Repository as an example of how Maven repositories operate and how they are assembled. Here are some of the properties of release repositories such as the Central Repository:

Artifact Metadata
All software artifacts added to the Central Repository require proper metadata, including a Project Object Model (POM) for each artifact which describes the artifact itself and any dependencies that software artifact might have.
Release Stability
Once published to the Central Repository, an artifact and the metadata describing that artifact never change. This property of release repositories guarantees that projects which depend on releases will be repeatable and stable over time. While new software artifacts are being published every day, once an artifact is assigned a release number on the Central Repository, there is a strict policy against modifying the contents of a software artifact after a release.
Repository Mirrors
The Central Repository is a public resource, and it is currently used by the millions of developers who have adopted Maven and other build tools that understand how to interact with the Maven repository structure. There are a series of mirrors for the Central Repository which are constantly synchronized. Users are encouraged to query for project metadata and cryptographic hashes and they are encouraged to retrieve the actual software artifacts from one of Central’s many mirrors. Tools like Nexus are designed to retrieve metadata from the Central Repository and artifact binaries from mirrors.
Artifact Security
The Central Repository contains cryptographic hashes and PGP signatures, which can be used to verify the authenticity and integrity of software artifacts served from Central or one of the many mirrors of Central and supports connection to Central in a secure manner via HTTP.