Conception of a UNIX system

Pierre Pronchery


Table of Contents
Abstract
Introduction
1. The UNIX system
A description of UNIX
History of the system
The philosophy
Technical aspects
The kernel
The filesystem hierarchy
The system software
User applications
2. Conception of a UNIX system: Defora GNU/Linux
Creation of the base system
The software choice
Preparation of the system
Administration of the system
Configuration
Software management
Distribution of the system
Using the software manager
Without any initial system
3. Development of the software manager
Description of the application
Aim of pkgr
Features
Development background
Code architecture
About the packages
The packages format
Package repositories
Packaging an application
Operations supported
Packages extraction
Packages probing
Packages installation
Packages uninstallation
Installed packages listing
Files search
Conclusion
Bibliography
A. Packaging documentation
B. Packaging script
List of Tables
1-1. Contents of the root directory
1-2. Contents of the /var directory
1-3. Contents of the /usr directory
3-1. System targets
3-2. Hardware targets
List of Figures
3-1. pkgr class hierarchy diagram
List of Examples
1-1. Counting users
1-2. Compiling and installing from source
1-3. Installing from self-configuring source
1-4. Installing a ported software
1-5. Installing a package
3-1. pkgr help screen
3-2. Some package installation examples
3-3. Sample /etc/pkgr/sources file
3-4. pkgr package files extraction
3-5. pkgr packages probing
3-6. pkgr packages installation
3-7. pkgr packages uninstallation
3-8. pkgr installed packages listing
3-9. pkgr installed files search

Abstract

Conception of a UNIX system.

The project presents the UNIX operating system, including its background history and philosophy. The most important technical aspects are also discussed, all in the first part of this report, but the main intention is to create an actual system, from an assembly of preliminary tools.

This system, called Defora, is a complete UNIX distribution, using GNU/Linux software as its base. The conception process from the compilation of the essential tools, to the creation of a distribution media, has been described in the second part of the project.

The achievement of a production level distribution was intended, this implies to comply with the UNIX uses, as well as an easy and efficient way for the users to manage software. This is why a software manager has been written for the project, its development and usage details are described in the third part of this report.


Introduction

Computing is a young technology, in which everything tends to evolve very fast. Computer hardware is now million times more powerful than thirty years ago, while personal computers now fit in someone's pocket. The interaction with users has also known dozens of evolutions since its beginning: from printers to 3D graphics, through text-based, graphical, remote interfaces for example.

There is at least one technology, one of the most important, which is still used today, thirty years after its creation: the UNIX operating system. From a small group of researchers to the explosion of the internet, this system still has the potential to contribute to the future of computing.

The UNIX story itself has had its lot of passions and conflicts. It is probably its main strengh: over the years, people have joined communities to share their passion of the system, and continue to improve it. This is particularly true today, as a free and opened UNIX system is gathering millions of people, from universities, companys, or just hobbyists: GNU/Linux.

I have myself joined this project over three years ago, as a user and more recently an applications developer. I had always been interested by computers, but through its use it raised my curiosity. I wanted to know how and why the system had been done this way.

The UNIX system.

The UNIX conception is bound to its history, it is the first point to be dealt with. But more than just the conception of an operating system, an entire philosophy of computing is born with the system. The technical aspects of the system, such as the kernel, filesystem, system software, and user applications follow.

Conception of a UNIX system: Defora GNU/Linux.

The work already done around UNIX, and the flexibility of the system, allow one today to create such a system. Its creation, administration and distribution is described there.

Development of the package installer.

In the typical idea of UNIX, a piece of software is itself a file, installable on the system. Software files are called "packages". They are the key to an efficient UNIX system, that's why a packaging system has been created for the above system: "pkgr". It is fully described, including the packages themselves, and the different functionalities supported.


Chapter 1. The UNIX system

A description of UNIX

History of the system

UNIX is a specification of an operating system, born around 1970 in the Bell laboratories. It has been largely inspired from another project, ran in the 1960s by an association of a research institute, the MIT, and two companies, Bell and GE. At that time, every different hardware and company had their own operating system, unable to interoperate.

This project was called the Multics, for "Multiplexed Information and Computing Service". Its goal was to perform interactive tasks for many users at a time, and in a convenient way. However, it turned out to be too expensive, and this project was withdrawn.

A group of Multics users at Bell Labs did not lose hope, and started another effort to produce such a system: Ken Thompson, Dennis Ritchie, Doug McIlroy, and JF Ossanna. In 1969, they took notes of their informal conception of an operating system, and gave them to the other researchers: they already contained concepts still used today, like the "inodes" for filesystems.

In 1970 the system was implemented on a PDP-7 computer, and the name "UNIX" suggested, as a reference to "Multics". And in 1971 UNIX was used at production level, on a PDP-11, and some of these computers were even sold with UNIX running on top. First written in assembly language, UNIX was rewritten in two new languages: B, interpreted, and C, compiled. Nowadays most UNIX implementations are still written in C.

In 1976, a member of the Bell Labs team, Ken Thompson, taught the system in the University of California, Berkeley. The "Berkeley Software Distribution", or "BSD", was born. Their improvements over the system became very famous, and a common group to continue the development was created, the "Computer Systems Research Group" (CSRG). Many organizations joined it: academic, military, and some commercial firms too.

Over the 1980s, with its increasing popularity, more and more versions of UNIX were developed and sold. The leader, AT&T, joined Sun Microsystems to bring their work around UNIX together. But other companies were afraid of their new commercial potential, and created the "Open Systems Foundation" (OSF) in 1988, an effort for an "opened" specification of UNIX, which several other companies also joined. The first "UNIX war" took place, with the OSF against AT&T and its "UNIX International".

In 1991, AT&T sold shares to 11 other companies: Novell Corporation in 1993, and the Santa Cruz Operation in 1995 acquired the rights on UNIX. Nowadays many commercial distributions of UNIX are available, such as Solaris from Sun Microsystems, HP-UX from Hewlett-Packard, AIX from IBM, or Tru64 from Compaq. Some free implementations of UNIX are also available, like the BSDs FreeBSD, NetBSD and OpenBSD, GNU/Linux, or Hurd.

UNIX and C had a major influence on computing over the last three decades, and are still widely used today.


The philosophy

During the creation of the system, important conceptual ideas raised about the development and use of the system. It was obvious that the system should be composed of a set of small tools, working together. Every program should do one particular thing only, and do it well. To allow transparent interactions between the programs, every data transferred is based on text streams.

From this concept of the union of simple tools, it became easy to perform complex tasks from a "chain" of tools. Users can do what they want, with little reflexion, and at their first attempt.

Example 1-1. Counting users

who gives the list of the users currently logged in, while wc counts words, characters or lines of its input. Then to obtain the number of users logged in:

khorben@pinge:~$ who | wc -l
     10

It's a correct way, though not the only one, confirmed with the uptime command:

khorben@pinge:~$ uptime 
  8:35pm  up  10:35,  10 users,  load average: 0.00, 0.02, 0.00

Another idea is with the end-user interaction. He should reasonably be protected from the implementation details of a program. He should be warned about anything only if it has gone wrong.

In UNIX, every information and device is represented as a file. For example, a printer is a file: writing data to the printer file will print it; directories are files, containing other files; kernel settings can be seen and eventually modified via virtual files; ...

Last but not least, documentation. Every single command always proposes adequate help, including the known bugs. This sincerity probably helped the users to feel confident with the system.


Technical aspects

The kernel

This is the most important part of an operating system: the layer between the computer hardware and the system software. In UNIX it is usually kept quite small, but two different kinds exist.

  • monolithic kernels: the kernel is a single process, addressing all the necessary tasks.

  • microkernels: the kernel is made of multiple modules, each addressing a specific task.

While monolithic kernels are the most common, and easier to write, microkernels tend to be preferred for new works, because they are more flexible and portable.


The filesystem hierarchy

A reference guide about this is being developed, the "Filesystem Hierarchy Standard", available at http://www.pathname.com/fhs. There are differences about it in the various implementations of UNIX, but they are not very important.

The root directory.

In UNIX one directory is the parent of all others, and called the "root" directory. It typically contains these subdirectories:

Table 1-1. Contents of the root directory

DirectoryUsage
binContains the essential binaries, usable by any user.
bootContains the files necessary for the boot loader (including the kernel).
devContains special files, most of them are access points to the hardware devices, or system entities (terminals, network protocol drivers, ...).
etcSystem setup files.
homeContains the user's directories, often called "home directories".
libContains the essential libraries and shared objects, like the libc and kernel modules.
mntContains the additional filesystems, for example mounted from removable medias (floppys, CD-ROMs, ...), users drives, network drives, ...
optContains additional programs, typically in a /opt/vendor_or_software/{bin,lib,...} hierarchy.
procThis one is specific to Linux, but other UNIX systems may have a similar directory. It is for kernel access and tuning, via files.
sbinContains the essential binaries, intended to the system administrator.
tmpProvides a writable space for user applications' temporary files.
usrContains another hierarchy alike, detailed below.
varContains variable data, such as mails, web pages, printer spools, databases, ... (detailed below).

Purpose of such a hierarchy.

This distinction between categories of software is not without a reason. It allows a great flexibility when putting together hard drives for a single system, network resources, or even for backup purposes.

All the essential data is in /bin, /boot, /dev, /etc, /lib and /sbin. It should be on the same device, and it allows mounting of other devices, on /usr for example.

The users files from /home can be on another device, or shared across a network: this distinction allows this very easily.

The files from /etc and /var are most of the time dependent from a given system, but it is also possible to share them, allowing mail or web sharing, and immediate setup update, over an unlimited number of computers. The /var directory contains specific subdirectories as well:

Table 1-2. Contents of the /var directory

DirectoryUsage
libVariable information from common software.
lockLock files, shared between applications to avoid some file access conflicts.
logSystem and application logfiles, providing information about their operations.
mailContains the users' mail.
runRun-time information for processes.
spoolTransient data, expecting further processing, such as the printer queue, or outgoing mail.
tmpProvides a temporary space like /tmp, with the exception it may "survive" to a certain number of system reinitializations.

The /usr directory is the easiest to share between systems, because it is read only. It contains all the usual applications and shared files, obviously the biggest directory on most systems. Moreover, as it comes with the distribution, it is easy and fast to reproduce its contents when installing a system: backups are generally not needed for this directory.

It typically contains the following directories:

Table 1-3. Contents of the /usr directory

DirectoryUsage
binBinaries for user applications.
docVarious shared documentation, should be in share/doc instead.
gamesBinaries for game applications.
includeC/C++ header files, for development or compilation of applications.
infoDocumentation formatted for the info program, should be in share/info instead.
libLibrary files for user applications.
libexecBinaries for user applications, but not suitable for an invocation from command line (for example, sftp-server, which is usually bound to a SSH2 connection).
localContains another hierarchy like this one, but for programs not coming from the initial distribution. The /opt directory might be preferred for these.
manDocumentation formatted for the man program, should be in share/man instead.
sbinBinaries for user applications, intended to the system administrator.
shareContains all other kinds of shared files, such as static databases, pictures, documentation, ...
X11R6Historical location of the files from the X graphical server.

Applications access.

Another very important gain from this hierarchy is the access to the various application files:

  • list of all available programs.

    They are all present in the above binary directories (*/bin/*, */sbin/*).

  • shared libraries.

    They are all in the above library directories, for a fast library resolving, at program linking or execution.


The system software

Direct kernel calls for programmers and end-users are definitely not convenient to use the system: a set of tools is provided for these two kinds of users.

For the developers.

A base library, the "libc", provides a development environment for the developers. Written in C, it only supports C but is enough to compile a multi-language compiler. The system and libc library calls are all documented in the same way the programs are. From this point developers can compile and use the software and languages of their choice, as the system and library calls are most of the time following the UNIX standards.

For end users.

The end users will simply use the applications developed by the above developers. This can be done transparentely from their original programming language. There are even systems where a single installation of a program can talk any user's language, provided the translation has been written. Of course both text-based and graphical environments are available.

Thousands of applications are available on almost all UNIX systems. Some sites are very famous at referencing them, such as "Freshmeat" http://freshmeat.net.


User applications

Through usage, UNIX applications tend to use similar ways for their compilation and installation from source. However, not everyone wants to compile every software they need, as it can be long, and complex if there are errors during this process. This is the role of the many UNIX distributions available today, and many of them have their own, simplified way to install software.

From source.

This is also used for binary files, the word "source" can be interpreted as "directly for the author/vendor" instead as "source code".

The simplest way for developers is to use the famous make program. It allows the compilation and installation with only two commands:

Example 1-2. Compiling and installing from source

$ make
[...] [and if everything went right:]
$ make install
[...]

Originally written for C programs, this program supports, or has equivalents for many different languages. However, in order to increase portability of a program, or just in necessity of a more complex system, many programs use software like the GNU foundation's autotools, which automatically creates the necessary files for the given platform. The commands to invoke are typically the following:

Example 1-3. Installing from self-configuring source

$ ./configure && make && make install

Not every software uses these methods, in any case however documentation is present in the README or INSTALL files, as a tacit convention.

From the distributions.

Two different systems are mainly used by distributions: ports and packages.

A "port" is a script, and eventually a set of patchs, which automatically runs all the commands needed to perform a compilation of a given software. It may even be able to generate binary packages of the software, avoiding its recompilation of other similar platforms. Ports are generally quite easy to write, can be updated quickly, and don't waste too much disk space.

Installing software from a port is generally a bit like the first case of installation from source:

Example 1-4. Installing a ported software

$ cd /var/ports/software
$ make
[...]
$ make install
[...]

Packages are binary distributions of software, directly installable on the target platform. They need a manager though, to perform their maintainance operations, like installation, removal of software, with eventual additional features. They may be quite complex to create, but they are the most simple way to manage software on a system: typically only one command is necessary to fetch and install any packaged software.

This is often the first distribution-dependant command one will need for its new system:

Example 1-5. Installing a package

    # RPM based distributions:
$ rpm -i <rpm_file>
    # debian based distributions:
$ dpkg -i <deb_file>
$ apt-get install <software>
    # defora distribution:
$ pkgr -i <pkg_file_or_software>
    # BSD packages (created from the ports)
$ pkg_add <package_file>

Chapter 2. Conception of a UNIX system: Defora GNU/Linux

This chapter deals with the whole process of the creation of a UNIX distribution. Existing software has been used for the system itself. Some tools have been written for the project though, such as the software manager described in chapter 3.


Creation of the base system

The software choice

The kernel.

This choice was obvious: linux. There are many reasons, the most important are:

  • It's free.

    Not only it does not cost anything, but full access to the source code is provided. This is the main reason for its popularity, which then led to its other advantages. Even if at least one of the BSD flavours could have been this famous instead, the strength of linux is certainly in its license: it enforces any contribution of the kernel to be open source as well, forcing it to always benefit to the community.

  • It's working.

    Its stability and extensibility have been proven. It has been ported to many architectures, and it has device drivers for most available peripherals.

  • It's easy.

    The accessibility of linux-based distributions is very good now, and still allows an advanced user to completely tune its system. This possibility, such as recompiling the kernel, is certainly easier to grasp on a linux system than on the possible alternatives. Moreover these alternatives are only FreeBSD and OpenBSD, for a comparable set of possibilities. And my own experience so far is mainly with linux-based systems.

  • It's rich.

    The kernel has a huge list of features, but would not have much use without software running on top. Thanks to a community of million of volunteers, and many companies, linux has become a platform of choice for server, development, workstation and desktop uses.

The libc and system tools.

Like the kernel, the choice is quite obvious, particularly with the linux kernel: the GNU libc, called glibc. It almost always comes with Linux distributions, then being a major part of GNU/Linux systems success. Whenever linux is said to be stable, it's also because of the glibc. The GNU foundation also provides a free implementation of all the common UNIX basic tools, which are of course designed to run with the GNU libc, and often the best version available. These tools are also free in the same way of the linux kernel, the GNU foundation being the creator of the license used, the GPL (General Public License, available from http://www.gnu.org/licenses/gpl.html).

Additional tools.

The system would not be complete without a way to install and manage it. The most important tool defining a UNIX distribution is the software manager. That's why one has been written for this project: its philosophy and use is presented in the administration section of this chapter, while its conception details are in chapter 3.


Preparation of the system

This step has been inspired from the "Linux From Scratch" guide. It is written by a community of GNU/Linux users, assembling their own system themselves: it also mentions the known problems with software compilation and installation, which was very helpful sometimes.

Creation of a nested compilation farm.

Of course one needs a prior system, in order to compile the final system. But the compiled programs have to be linked against the final system libraries, not the initial system's. That's why an intermediate system is needed.

However when starting an intermediate system, there are not any system libraries yet (and even no software at all). That's why the programs starting this system will have to be compiled statically, which means that every executable file produced will contain every function call it needs, even the ones that would usually be shared (at least those from the libc). This intermediate system consequently consumes space, but not too much software need to be compiled there.

The software to compile then is basically the libc, the compiler, a shell, some essential tools and low-level libraries such as make (compilation helper) or gzip (compression tool and library).

Creation of the new system base.

When every tool needed to compile the future system base has been statically compiled and installed, the new system creation may start. The first software to compile is the libc, and then the C compiler. However there is still a linking problem, because the default libraries used for compiling and linking are those present in / and /usr, and not those from the intermediate system. To solve this a special technique is used, called chroot. It consists in spawning a program, typically an interactive shell, in a jailed environment, where the / directory is actually bound to any other directory. In our case we need to launch a shell with our base system directory as the faked /. This technique is often used to increase security of networked processes, because in case of compromission the possible damages can only affect the files present in their environment.

From this point the compilation and installation of the other programs and libraries are done one by one, depending on their respective needs. For instance, bash (shell) may be installed immediately, to test the new system, but it may benefit from the curses library, which should then be compiled before (of course this can be also done afterward).

The latter example clearly illustrates the need of dependencies tracking, when the system has to be distributed in binary format (and even in source). This is a good reason to use a software management system.


Administration of the system

Configuration

In an attempt to present the system setup in a convenient way for the user, setup files have been placed according to this simple rule:

  • A specific program or task needs one file:

    the file is directly placed in /etc.

  • Otherwise:

    the files are placed in a subdirectory of /etc.

This rule has been followed where possible, and some programs had to be patched to respect it. Most software original packages use the GNU autotools system, which is very versatile about this, but some required symbolic links to other parts of the system, like /var/lib subdirectories.

Where some scripts had to be written, for example the init scripts (for system initialization), consistency was the main preoccupation. These are in /etc/init/rc.d, from their respective packages, and unfortunately the attempt to share code between each other (implying consistency) is not very successful, because of the complexity of some services.

Generally, some software sometimes needed adjustments, in order to respect the filesystem hierarchy standard for example. There also the "Linux From Scratch" guide was very helpful.


Software management

This is certainly the main role of any UNIX distribution: provide software to users. This is the most frequent operation performed by system administrators, so it has to be easy, fast, and efficient. Almost every distribution has its own system, and I had my own idea of how things should be done, so Defora would have to have its own.

The idea is simply to allow installation or uninstallation of multiple packages at the same time, without the need to perform operations on packages directly. This requires some packages interdependency consideration, and the concept of remote repositories. Of course the usual database and packages information operations would have to be supported: packages probing, installed packages listing, files search.

These operations would ideally require very little effort to be done: the software manager should accept a simple syntax, and offer a sufficient but short help. This is not completely the case in many famous distributions, so I hope this work will be easier to grasp.


Distribution of the system

Using the software manager

As it is able to simply uncompress packages, the software manager alone is able to create a base system of the distribution. The software needed are the following:

bash, bzip2, gcc, glibc, libtools, ncurses, pkgr and tar.

From this point it is possible to chroot inside this minimal system, just like when it was still being compiled, and install the rest of the desired software. Depending on the needs, these should be installed: a kernel image, filesystem utilities (e2fsprogs, fileutils, util-linux, ...), a bootloader (lilo), a text editor (vim), system initialization (sysvinit).

If installed on a bootable device, this new system can be run as the main one, and completely self-manageable and reproducible. This needed a preliminary installed UNIX system on the machine, but the next solution doesn't.


Without any initial system

The only way to install the system without one already installed, is to start the computer on a removable device containing one. Nowadays computers can boot on floppy disks, CD-ROM drives (actually it is a floppy disk emulation), ZIP drives, and even USB drives for the most recent.

Defora on a CD-ROM.

The creation of such a system is not trivial, but one has been built for this project. A Defora system has been created with the above method, to be burned on a CD. It has been setup to launch a graphical session automatically, from which a special... shell script can be invoked to install the system interactively.

A particular configuration.

The first difficult part is to setup the system so that it needs as little disk space as possible, because every temporary file has to be stored in the main memory. Moreover, one cannot boot directly on a CD-ROM drive, because it is then seen as a floppy disk drive. So a special bootable disk had to be prepared.

The bootable image.

A specially tuned kernel is booted from the floppy disk: it has to be small, because the floppy also hosts a minimal system image, loaded in main memory as the root filesystem, in order to continue the process. This image can reasonably only include one executable file, which then has to load the necessary drivers, search every drive for the installation CD, and mount it. The kernel is then told to keep his root filesystem in main memory, so that a writable disk space is available, and launch the initialization sequence from the CD.

Burning the CD.

To burn the bootable CD, the floppy disk image has to be dumped, and placed on the CD-ROM drive as a regular file. Then the CD-ROM image can be created on disk, or directly burned to a recordable CD, using the "El-Torito" standard to specify the wanted floppy image as the emulated boot device.

Defora GNU/Linux is then a self-reproducing UNIX system.


Chapter 3. Development of the software manager

In this chapter the application will be designed as pkgr, which is its preferred executable name.


Description of the application

Aim of pkgr

This software aims to be the main (and only, if possible) way to manage software on the system. Of the various possible ways presented before for this, pkgr uses packages. This system has been preferred because it is closer to the UNIX philosophy: a piece of software is simply a file. Consequently, pkgr will be able to handle these files, the packages, which format has to be defined, so as to perform the required operations. The process of creation of a package was inspired from the ports system though, for simplicity and flexibility.

Beyond software management, and in consideration of Defora principles, pkgr is intended to be as easy to use as possible. As it is a text-based application, the command-line arguments have been decided with this thought in mind. To achieve this it uses a personal library, called libtools, in this case its "Parser" class, to keep and force consistency with the rest of the system. Here is the output of the program help screen:

Example 3-1. pkgr help screen

Usage: pkgr [argument [option]] ...
Choose one operation from below:
  -i, --install         install one or more packages
  -u, --uninstall       uninstall one or more packages
  -l, --list            list known packages
  -p, --probe           probe one or more packages
  -e, --extract         extract one or more packages
  -f, --files           files search
Common options:
  -h, --help            display contextual help
  -v, --verbose         increase verbosity
  -V, --version         display program version

A graphical version of the program is not planned, but would be feasible.


Features

Of course pkgr supports the basic functions of a software manager. There follows a list of basic, extra and planned features of the software.

Every software manager using packages should handle at least these:

  • package installation : extracts a package in the system, making it directly useable.

  • package uninstallation : removes all files of a given installed package, so that the system gets back to the same state it would have been without it.

  • listing of installed packages : lists all packages that are currently installed on the system.

  • probing of an installed package : gives information about an installed package, such as its file list.

  • probing of a package file : gives information about a package file, such as its file list.

More features can be proposed, and pkgr currently supports the following:

  • dependencies : gives information of the other packages required by one.

  • package files extraction : simply extracts the files of a package.

  • files search : search in the packages files database for matching strings.

  • packages repositories : groups of packages, listed in a special file which allows a much simpler installation process, and described below in 3.2.2.

  • configuration and site-specific files handling : recognition of site-specific files, which can be let in place even at uninstallation, so that if a package is installed back it doesn't need to be setup again.

Some other features are planned, such as:

  • package installation over network : automatic downloading of packages placed on a distant repository.

  • full dependencies tracking : better handling of dependencies.

  • alternative programs handling : having different packages at the same time, implementing a particular command (with a setup for default package).


Development background

The development environment is of course the Defora operating system, which has been fully described in the previous chapter. This software only depends on one external program, tar (multiple files archiver), and two external libraries, which are libtools (personal library, as said in 3.1.1), and libbz2 (file compression library).

Languages used.

This package manager not only consists in a binary executable: scripting has been used whenever possible and adequate. Finally two programming languages have been used: C++ and bash.

The binary program development.

The pkgr executable has been developed with C++, for two main reasons: efficiency, and object orientation. Actually there are particularities in this use of C++: it doesn't use the STL (Standard Template Library) at all, so it is just object-oriented C code. It raises some points:

  • low-level programming: the C library is very close to the basic instructions of the kernel, it allows a better feeling of what is actually performed by a program, with the drawback of possible code obscurity.

  • syntax facilities: C++ has many improvements over C, from variables declarations to stricter code checking (using a C++ compiler with C code will help to track down more problems).

  • code reuse, and beautification: there also, C++ is better than C, because thanks to object-orientation, C++ code can be faster to write and understand than its C equivalent.

  • the STL: this standard C++ library has not been used in this project for some personal reasons also; even if I'd like to learn to use it soon, I still had to improve my skills about basic C++ programming; moreover I particularly like low-level programming.

Also, to ease the compilation of the program, two Makefiles have been written. Installation of this program then falls down to the "make install" category, as seen in 1.2.4.

The packaging scripts development.

When it came to create the package files themselves, the entire process could be performed through the execution of a few programs. Obviously a shell scripting language was the best way to do it. I chose bash because it is quite simple, and widely used (I use it personaly). There are mainly two script files used:

  • one in every package, placed in defora/pkgr, contains the commands needed to install the given software in a particular directory.

  • the second one is called /usr/lib/pkgr/functions, in fact it is called at various steps of the packaging process by the previous one, as it contains the redundant operations.


Code architecture

The program class hierarchy is very simple. There is one per possible data entity: commands handler "Pkgr", package "Package", and database "Database".

Figure 3-1. pkgr class hierarchy diagram

pkgr is like an engine, managing the packages and databases relationship. The local and remote package repositories are all Databases, containing Packages, while possible arguments are seen as virtual Packages (using Package data structure states), and package filenames as Package files. The difference between the different Package types is made according to the setting of their "filename" and "source" attributes.

  • "filename" and "source" are not set. The package is a virtual one, its only possible origin is a user request.

  • "filename" is set and "source" is not. The package structure has been initialized from a package file, its source does not matter.

  • "source" is set. The package matches a particular repository, and its "filename", if not set, can be guessed.


About the packages

The packages format

Compression.

The packages are actually bzip2-compressed files. bzip2 is a free data compressor. Its compression system is designed for efficiency more than speed, which is a good balance for a packaging system, since one is not likely to install gigabytes of software everyday.

The files.

This compressed file has actually two parts: a plain text header, followed by a tar archive. tar is a well-known UNIX tool, it is used to pack files together. It's able to keep their ownership and permissions, so that files extracted from the tar archive have the same attributes as before.

The header.

The most important part of the packaging system is to choose the necessary data to handle them efficiently, and where to store it. I have chosen to keep the respective packages data inside the packages themselves. Every package contains the following list of information:

  • Name: name of the package.

  • Version: original version name of the package.

  • Revision: revision number of the package for the distribution, it is useful when the package has to be modified while the original archive doesn't have a new version number. Then the full version of a package is "<version>-<revision>".

  • Architecture: architecture the package has been built for, including the kernel name (allows the use of pkgr for multiple system types at the same time, like linux and OpenBSD).

  • Size: contains the uncompressed size of the archive, in kilobytes.

  • Depends: comma-separated list of the package names (and eventually versions) this package depends on.

  • Provides: comma-separated list of the package names (and eventually versions) that this package also provides when it is installed.

  • Description: a one line description of the package content; it may be followed by a multi-line longer description, ending at the next empty line.

About the architectures.

As just said, packages of the same name and version can be done for different architectures: this is because most packages contain binary files, such as libraries and programs, which are only usable on the platform they've been created for. Moreover the creation of packages for a given hardware architecture, but for use with different kernels or base libraries also creates incompatibilities. That's why the architecture field actually contains two informations: the first letter defines the system target, and the rest defines the hardware target. Some possibilities are listed in these tables:

Table 3-1. System targets

AbbreviationSystem
hGNU Hurd with GNU libraries
lLinux with GNU libraries

Table 3-2. Hardware targets

ArchitectureDetails
i386Intel 80386 compatible processors
i486Intel 80486 compatible processors
i586Intel 80586 compatible processors
i686Intel 80686 compatible processors

Package files naming.

Another important convention used in pkgr about packages is their name. Package files have to be called this way, according to their content: "<name>_<version>-<revision>_<architecture>.pkg". This is checked by the program and avoids potential errors. But more than that, this notation can also be used at installation time, to ask for a particular version, revision or architecture of a package. The full description of the installation process is described later, in section 3.3.3, but this possibility is worth an example:

Example 3-2. Some package installation examples

(The "-i" flag switches to installation mode)

Recognizing the extension and the valid name, pkgr looks directly for the given filename.

$ pkgr -i package1_1.02-0_li686.pkg

Installs the remote, most recent available version of package1.

$ pkgr -i package1

Installs the remote, latest revision of package1 version 1.02.

$ pkgr -i package1_1.02

Installs the remote, 1.02-2 version.

$ pkgr -i package1_1.02-2

Installs the most recent version of package1, forcing the li386 architecture one.

$ pkgr -i package1__li386

Package repositories

As just mentioned in the example below, remote packages installation is possible. This section explains how such repositories are prepared and used. Moreover the local packages database has some points in common.

Local repository: installed packages database.

Information about the installed packages is needed. The files list for a given package are stored in the corresponding /var/lib/pkgr/<package>.files file, but the most important is the installed packages database. It basically contains the installed packages headers in a particular file, /var/lib/pkgr/status. However there is an extra field kept there, which is the following:

  • MD5: a MD5 hash of the installed package file. Its creation and use is not yet implemented, and will allow the user to confront this field with the official values, and determine if the package installed has been modified. This solves a security issue, and will help avoid having different packages with the same version (for example if a maintainer forgets to increment the revision number while updating a package).

Remote package repositories.

It is possible to define remote package repositories sources, with the setup file /etc/pkgr/sources. The repositories consist in a definition file, listing all the available packages, and the subdirectories containing the packages. This way invoking just a package name for installation will automatically select the most recent package, and install its file automatically.

A repository consists in its definition file, called Packages, and the adequate subdirectories, called by the name of the original source. This way the actual package filename can be automatically generated, just from the definition from the sources file. This file simply contains the path to the repository, for example:

Example 3-3. Sample /etc/pkgr/sources file

# this is a comment line
#empty lines are allowed too

file:/pool2
#

The "file" protocol (direct file access) is the only one implemented at the moment, but at least "http" should be added, then allowing installation over networks.

The repository definition file uses the same format as the local packages database. However the possibility to store the files list for part or all of the packages is planned and partly implemented. This will allow a user to search transparently for files inside packages he hasn't even installed or downloaded.


Packaging an application

A documentation (HTML) has been written as a reference for potential package maintainers. It is available as Appendix A in this report, and contains the necessary steps to actually build a package. In fact, packaging an application is very simple, because it is very close to the way one would install any piece of software.

The packaging system has been kept voluntarily as simple as possible, but with flexibility and safety in mind. Given this criteria, and the format of the final package, an obvious solution was to use a shell script. To create the package, once the sources uncompressed and prepared, is to run one command, a bash shell script, this way:

$ sh defora/pkgr

With the requirements of being as simple as possible, the default script file for this finally contains only 7 effective lines. 3 of them are pkgr internal calls, but all the others are exactly the commands one would use to compile and install the software from source on his system, with the only difference that the installation directory has to be a particular defora/ subdirectory. Then the files are automatically checked for consistency, and a separate package (called "<name>-dev") is automatically generated, containing the files only required by the developers.

To store the additional pkgr file, and eventual additional patchs, every package revision has its own patch file in the repository. It uses the unified output format from the GNU version of the diff and patch programs, in order to respectively create and apply the patchs to the archives. These tools are not too difficult to handle, and a script has even been written to handle this task automatically. It has been included as Appendix B.


Operations supported

The extract and probe operations are presented first, because package installation needs both of them.


Packages extraction

Only valid package filenames can be requested to be uncompressed. First, pkgr uses the libbz2 library to open the file: it is the shared library proposed by the bzip2 compression program. Then the header is read and checked (not yet against the filename correctness). The next step is to "untar" the embedded archive: the program forks and invokes the tar program, which is sent the archive for uncompression.

In verbose mode, this operation prints the name of the files as they're being uncompressed (actually tar does it, with its verbose flag).

Example 3-4. pkgr package files extraction

khorben@pinge:/pool2/pkgr$ ls
patch-pkgr_0.4.8-0.gz  pkgr-0.4.8.tar.gz  pkgr_0.4.8-0_li686.pkg
khorben@pinge:/pool2/pkgr$ time pkgr -e pkgr_0.4.8-0_li686.pkg 
Extracting "pkgr": done

real    0m0.578s
user    0m0.380s
sys     0m0.200s
khorben@pinge:/pool2/pkgr$ du -hs usr/
304k    usr
khorben@pinge:/pool2/pkgr$ pkgr -e --verbose pkgr_0.4.8-0_li686.pkg 
Extracting "pkgr": 
usr/
usr/bin/
usr/bin/pkgr
usr/lib/
usr/lib/pkgr/
usr/lib/pkgr/functions
usr/share/
usr/share/doc/
usr/share/doc/pkgr/
usr/share/doc/pkgr/AUTHORS
usr/share/doc/pkgr/BUGS
usr/share/doc/pkgr/COPYING
usr/share/doc/pkgr/pkgr.sample
usr/share/doc/pkgr/README
usr/share/doc/pkgr/TODO

Packages probing

If the argument given is a filename, it opens the file, still using the libbz2 library. The header fields are read and printed on standard output. Else if the argument designs a valid installed package, its database entry is printed on screen. Probing of a remote repository package is not yet implemented, though it would be trivial.

In verbose mode, the package files lists are also printed. In case of a filename, tar is launched the same way as an extraction, but lists the files instead of extracting. Else the adequate files list file from /var/lib/pkgr/ is printed.

Example 3-5. pkgr packages probing

khorben@pinge:/pool2/pkgr$ pkgr -p pkgr pkgr_0.4.8-0_li686.pkg 
Reading installed packages information: done
Package "pkgr" not found.

Name: pkgr
Version: 0.4.8
Revision: 0
Architecture: li686
Size: 308
Depends: 
Provides: 
Description: Software packages manager

Packages installation

This is the most complex operation to perform. Not everything has been implemented so far, but it works. For example, some package version comparisons are still loose, and update of a package doesn't remove the unused files if any.

Database match.

Each given argument is analyzed: if it is a filename it probes it, or if it is supposed to be known from a database, the repositories databases are loaded, and the most appropriate entry there is considered. Else the operation ends as an error, because at least a requested package could not be identified.

Dependencies tracking.

For every requested package, a check is performed to ensure its dependencies can be matched. For this it looks first at the other packages given on command line, and then in all package repositories known. Every necessary package is added to the list of those to install.

Current installed packages comparison.

When all dependencies are matched and installable, a test is done to check if it is already installed on the system. If it is the case, the user is warned and this package installation is cancelled. Finally, the user is asked for a global confirmation, if any package he didn't initially mention is to be installed because of dependencies.

Actual installation.

Every requested package is then extracted. Being done as root user, in the system root directory /, the package files are directly usable by any user. The data structure containing the installed packages information is updated every time a package is extracted, but only written to disk if necessary, and at the end of the whole operation, which is much more efficient. The other important information database about installed packages is their files list file: it is grabbed from tar output, and written to the appropriate place.

Verbose mode.

When asked to be verbose, pkgr also prints the list of files as they're uncompressed.

Example 3-6. pkgr packages installation

root@pinge:/pool2/pkgr# pkgr -i pkgr_0.4.8-0_li686.pkg 
Installing "pkgr": /bin/tar: usr/bin/pkgr: Could not create file: Text file busy
done
Updating installed packages database: done
khorben@pinge:/pool2/pkgr$ pkgr -p pkgr 
Reading installed packages information: done
Name: pkgr
Version: 0.4.8
Revision: 0
Architecture: li686
Size: 308
MD5: 00000000000000000000000000000000
Depends: 
Provides: 
Description: Software packages manager
root@pinge:/pool2/pcre# pkgr -i pcre-dev_4.1-0_li686.pkg 
Dependency "pcre" is unknown
Operation failed.
root@pinge:/pool2/pcre# pkgr -i pcre-dev_4.1-0_li686.pkg pcre_4.1-0_li686.pkg 
Installing "pcre-dev": done
Installing "pcre": done
file: file.cpp, line 148: fclose: Bad file descriptor
Updating installed packages database: done

This has illustrated a known problem at the same time: the way tar overwrites files currently used (obviously the pkgr system version, already installed from source) avoids them to be overwritten, which is problematic in our case. Consequently, the update of the system libc, the C/C++ compiler libraries, the libtools library, tar, bzip2 and pkgr itself has to be done manually.

It also shows that MD5 checks are not implemented yet, the hash is not even created (the corresponding code is present in another personal software, makepasswd, but has not been included yet).

A minor bug is also visible here, maybe inherited from the bzip2 library: the second and subsequent calls to fclose() on bzipped files fail, but without affecting normal operation.


Packages uninstallation

This operation is quite complex too. Of course it expects already installed packages as valid arguments, and checks it. First the installed packages database is loaded. Every installed package is checked to see if it depends on any package to uninstall, if it is the case it is added for uninstallation, and the user is warned and asked for confirmation.

The uninstallation process is then to remove all files listed as installed, for each package to uninstall. However the ones present in /etc and /var are not removed, so that they're used again if the package is reinstalled. Actually reinstallation may overwrite them, the user still has some backups to do before, though an automatic one is planned to be implemented for these files. Finally, every time a package is removed the data structure for installed packages is updated in main memory, and written to disk only if necessary at the end of the whole process.

A precision about directories: they have to be removed in reverse order than creation. For this they are queued in memory, and removed at the end of each package removal.

In verbose mode, this operation lists the files as they are being removed.

Example 3-7. pkgr packages uninstallation

root@pinge:/pool2/pkgr# pkgr -v --uninstall pkgr
Resolving dependencies: done
Uninstalling package "pkgr": 
/usr/bin/pkgr
/usr/lib/pkgr/functions
/usr/share/doc/pkgr/AUTHORS
/usr/share/doc/pkgr/BUGS
/usr/share/doc/pkgr/COPYING
/usr/share/doc/pkgr/pkgr.sample
/usr/share/doc/pkgr/README
/usr/share/doc/pkgr/TODO
/usr/share/doc/pkgr/
/usr/lib/pkgr/
Updating installed packages database: done
root@pinge:/pool2/pkgr# pkgr -V
-bash: /usr/bin/pkgr: No such file or directory
root@pinge:/pool2/pkgr# cp -Rd usr/* /usr/
root@pinge:/pool2/pkgr# pkgr --version
pkgr 0.4.8
root@pinge:/pool2/pcre# pkgr -u pcre
The following extra packages will also be uninstalled:
 pcre-dev
Are you sure you want to continue [y/N]? y
Uninstalling package "pcre": done
Uninstalling package "pcre-dev": done
Updating installed packages database: done

Installed packages listing

This operation is very simple: once the installed packages database read, every entry is printed on screen, with the information summed up to package name, version, revision, and short description.

Verbose mode doesn't affect anything there.

Example 3-8. pkgr installed packages listing

khorben@pinge:~$ pkgr -l | head -n 6
Package name    |version   |description
---------------------------------------
sdlnet-dev      |1.2.5-0   |SDL net (development files)
sdlnet          |1.2.5-0   |SDL net
man-pages       |1.54-0    |Linux man pages
openssl-dev     |0.9.7a-0  |OpenSSL (development files)

Files search

This operation lists the installed filenames matching one or more strings. All it does is reading all files list file entry from /var/lib/pkgr/, and print the filenames matching the given strings. It is equivalent to the common grep UNIX command, just easier to remember for a user.

Verbose mode doesn't affect anything there.

Example 3-9. pkgr installed files search

khorben@pinge:~$ pkgr -f c++
binutils: /usr/man/man1/c++filt.1
binutils: /usr/bin/c++filt
qt: /usr/share/doc/qt/html/qd-editpreferencesc++.png
qt: /usr/share/doc/qt/html/qd-projectsettingsc++tabdialog.png
gnome-mime-data: /usr/share/pixmaps/document-icons/gnome-text-x-c++.png
gnome-icon-theme: /usr/share/icons/gnome/48x48/mimetypes/gnome-mime-text-x-c++.png

Conclusion

The incredible length of the UNIX operating system(s) is certainly not only due to its technical concepts. They have proven to be efficient, flexible and reliable over these last three decades indeed. But the main strengh of the system is obviously its users.

However the typical UNIX users really are communities of users. From its history, conception and resulting philosophy the system has turned into a social movement. Engineers, scientists, teachers, students, administrators, or just hobbyists (and hopefully some commercials) have shared their passion for the system throughout these years.

This passion has not always been peaceful: UNIX history is full of battles of ideas or software projects, whether commercial or not. The consequences were not necessary against the progression of the system. On the contrary, from these disputes has not only raised the need of interoperability standards, but they also continuously led developers to innovate, following the new hardware and users needs.

Consequently, one can say for UNIX distributions, or even UNIX related software projects, that there are at least two different ways to deal with every possible need. From just reading a file, to the concurrent use of thousands of computers at a time, UNIX has multiple solutions. UNIX is a system of choice, and is full of choices.

And I feel I am a passionate of the system too. More than just the scope of the project, the conception of the Defora GNU/Linux operating system has also been a hobby for me. The achievement of every important feature of the software manager has been a satisfaction, and despite the few minor bugs left it is usable at a production level.

To end this report, I want to drop a few lines about the tools I have used to write it. The DocBook system, referenced in bibliography, generates documents from SGML or XML definitions, into a dozen of output formats. With a little effort, it can be used for publications such as books, manual pages or articles; a famous books editor, O'Reilly, uses it for production. It is the result of the union of a dozen of different software, all open source, free, and running in a UNIX environment.


Bibliography

Web sites

Bell Labs, The Creation of the UNIX Operating System: http://www.bell-labs.com/history/unix/.

freestandards.org, Filesystem Hierarchy Standard: http://www.pathname.com/fhs.

Gerard Beekmans, Linux From Scratch (version 4.0): http://www.linuxfromscratch.org/view/4.0/.

BLFS Development Team (version CVS), Beyond Linux From Scratch: http://beyond.linuxfromscratch.org/view/cvs/.

Free Software Foundation, the GNU Project: http://www.gnu.org.

The Freshmeat team, freshmeat.net: http://freshmeat.net.

Many authors, Linux manual pages: http://www.win.tue.nl/~aeb/linux/man/.

Many authors, The Linux Documentation Project: http://www.tldp.org/.

Norman Walsh, Leonard Muellner, and Bob Stayton, DocBook: The Definitive Guide (version 2.0.8): http://www.docbook.org/tdg/en/html/docbook.html.


Appendix A. Packaging documentation

Packaging

It is recommended that you get the latest source version of pkgr (>= 0.4.0 at the moment). It contains the most recent version of the required files.

1. Uncompress original sources

$ tar -xzvf <name>-<version>.tar.gz
You are of course advised to read the supplied documentation, to see how the program should be compiled and installed. However, most programs include an auto-configuration script called "configure": if it's the case half of the work is possibly already done.

2. Apply naming conventions (if necessary)

It is required that the original tarball is in gzipped tar format, and uncompress files in a subdirectory called "<name>-<version>". For example, you may have to do the following:

$ mv <name> <name>-<version>
$ tar -czvf <name>-<version>.tar.gz <name>-<version>

3. Patch the sources (if necessary)

If you really have to patch the original sources, here's how you would generally do that:

$ gunzip -c <patch_name> | patch -p0
However be sure you have understood the general distribution policy about patches. Well, it's yet to be written. So here's a brief:

4. Add the packaging information to the uncompressed sources

This is certainly necessary, as this distribution isn't quite as famous as Debian.

First, create the "defora" sub-directory:

$ mkdir <name>-<version>/defora
Then place the "defora/pkgr" file, which is executed to create the package. You may want to start with the default "pkgr.sample" file from the "pkgr" tarball:
$ cp <path_to_the_file>/pkgr.sample <name>-<version>/defora/pkgr
Then modify it as required.

Moreover, if you have additional modifications to apply to the original sources, that's the right moment for this.

5. Create the Defora packaging patch

The patch revision number (line "REVISION=0" in defora/pkgr) starts with 0. Increment it by one everytime you update your patch after a public release. Use then the following commands:

$ mv <name>-<version> <name>-<version>.pkg
$ tar -xzvf <name>-<version>.tar.gz
$ diff -Naur <name>-<version> <name>-<version>.pkg | gzip \
    > patch-<name>_<version>-<revision>.gz

6. Build the program

In a perfect world, you would simply have to do this:

$ rm -fr <name>-<version>.pkg
$ gunzip -c patch-<name>_<version>-<revision>.gz | patch -p0
$ cd <name>-<version>
$ sh defora/pkgr
However keep these in mind:

Because the packages quality determines the distribution's quality.

7. Create the package

If everything looks good, it's time to create the package. Login as root and execute "defora/pkgr" again, the package will be created. Note that even if this should not be necessary, it may be better to remove the source directory, and uncompress and patch it again.


Appendix B. Packaging script

#!/bin/bash



#check syntax
if [ $# -ne 2 ]; then
	echo "Usage: $0 <name> <version>"
	exit 1
fi


NAME=$1
VERSION=$2
REVISION="0"

echo "Please enter description for this package:"
read DESCRIPTION

#uncompress source
if [ ! -f $NAME-$VERSION.tar.gz ]; then
	echo "ERROR! Couldn't find file \"$NAME-$VERSION.tar.gz\"."
fi
rm -fr $NAME-$VERSION &&
tar xzvf $NAME-$VERSION.tar.gz &&

#create defora subdirectory
if [ ! -d $NAME-$VERSION ]; then
	echo "ERROR! Couldn't find directory \"$NAME-$VERSION\""
fi
mkdir $NAME-$VERSION/defora &&
cat > $NAME-$VERSION/defora/pkgr << EOF
#!/bin/bash


#package info
NAME=$NAME
VERSION=$VERSION
REVISION=$REVISION
DESCRIPTION="$DESCRIPTION"

PROVIDES=""


#internals
. /usr/lib/pkgr/functions
pkgr_init

#clean
make distclean

#configure
./configure --host=\$HOST --prefix=/usr &&

#build
make &&

#install
make DESTDIR=\$PWD/defora/\${NAME}-\${VERSION} install &&

#package
pkgr_build
EOF

#apply changes
echo "Please apply further changes now and press enter when finished."
read

#create patch
mv $NAME-$VERSION $NAME-$VERSION.pkg &&
tar xzvf $NAME-$VERSION.tar.gz &&
diff -Naur $NAME-$VERSION $NAME-$VERSION.pkg | gzip \
    > patch-${NAME}_${VERSION}-$REVISION.gz &&
rm -fr $NAME-$VERSION.pkg &&

#apply patch
gunzip -c patch-${NAME}_${VERSION}-$REVISION.gz | patch -p0 &&

#build package
echo "Do you want to try to build the package? (CTRL+C to abort)"
read
cd $NAME-$VERSION &&
sh defora/pkgr || exit 0

#install package
if [ $UID -ne 0 ]; then
	exit 0
fi
echo "Do you want to install the newly created package? (CTRL+C to abort)"
echo "Please note that you should check it before!"
read
cd ..
#FIXME ARCH
for pkg in `ls ${NAME}_${VERSION}-${REVISION}_*.pkg`; do
	pkgr -i $pkg
	break
done

exit 0