A large number of kernel function are time-driven, as opposed to even-driven.
The hardware provides a system timer
that the kernel uses to gauge the passing of time. The system timer goes off at a preprogrammed frequency, called the tick rate
. When the system timer goes off, it issues an interrupt that the kernel handles via a special interrupt handler. The period is called a tick
and is equal to 1/(tick rate)
seconds.
Some of the work executed periodically by the timer interrupt:
- Updating the system uptime
- Updating the time of day
- On an SMP system, ensuring that the scheduler runqueues are balanced and, if not, balancing them
- Running any dynamic timers that have expired
- Updating resource usage and processor time statistics
The frequency of the system timer is programmed on system boot based on a static preprocessor define, HZ
.
The kernel defines the value in <asm/param.h>
. The tick rate has a frequency of HZ hertz and a period of 1/HZ seconds.
- Kernel timers execute with finer resolution and increased accuracy.
- System calls such as
poll()
andselect()
that optionally employ a timeout value execute with improved precision. - Measurements, such as resource usage or the system uptime, are recorded with a finer resolution.
- Process preemption occurs more accurately.
A higher tick rate implies more frequent timer interrupts, which implies higher overhead, because the processor must spend more time executing the timer interrupt handler.
The global variable jiffies
holds the number of ticks that have occurred since the system booted. The kernel initializes jiffies
to a special initial value, causing the variable to overflow more often, catching bugs. When the actual value of jiffies
is sought, this “offset” is first subtracted.
The jiffies
variable has always been an unsigned long
, and therefore 32 bits in size on 32-bit architectures and 64-bits on 64-bit architectures.
jiffies is defined as an unsigned long in <linux/jiffies.h>
:
extern unsigned long volatile jiffies;
extern u64 jiffies_64;
The ld(1)
script used to link the main kernel image then overlays
the jiffies variable over the start of the jiffies_64
variable:
jiffies = jiffies_64;
Thus, jiffies
is the lower 32 bits of the full 64-bit jiffies_64
variable. Because most code uses jiffies
simply to measure elapses in time, most code cares about only the lower 32 bits. The time management code uses the entire 64 bits, however, and thus prevents overflow of the full 64-bit value.
The jiffies variable, experiences overflow when its value is increased beyond its maximum storage limit. When the tick count is equal to this maximum and it is incremented, it wraps around to zero.
The kernel provides four macros for comparing tick counts that correctly handle wraparound in the tick count. They are in <linux/jiffies.h>
. Listed here are
simplified versions of the macros:
#define time_after(unknown, known) ((long)(known) - (long)(unknown) < 0)
#define time_before(unknown, known) ((long)(unknown) - (long)(known) < 0)
#define time_after_eq(unknown, known) ((long)(unknown) - (long)(known) >= 0)
#define time_before_eq(unknown, known) ((long)(known) - (long)(unknown) >= 0)
The unknown parameter is typically jiffies and the known parameter is the value against which you want to compare.
The kernel defined USER_HZ
, which is the HZ
value that user-space expects. The function jiffies_to_clock_t()
, defined in kernel/time.c
, is then used to scale a tick count in terms of HZ
to a tick count in terms of USER_HZ
. The function jiffies_64_to_clock_t()
is provided to couvert a 64-bit jiffies value from HZ
to USER_HZ
units.
The real-time clock (RTC) provides a nonvolatile device for storing the system time. Its primary importance is only during boot, when the xtime
variable is initialized.
The system timer is to provide a mechanism for driving an interrupt at a periodic rate. On x86, the primary system timer is the programmable interrupt timer (PIT).
The timer interrupt is broken into two pieces: an architecture-dependent and an architecture-independent routine.
The architecture-dependent routine is registered as the interrupt handler for the system timer and, thus, runs when the timer interrupt hits.
Most handlers perform at least the following work:
- Obtain the
xtime_lock
lock, which protects access tojiffies_64
and the wall time value,xtime
. - Acknowledge or reset the system timer as required.
- Periodically save the updated wall time to the real time clock.
- Call the architecture-independent timer routine,
tick_periodic()
.
The architecture-independent routine, tick_periodic()
, performs much more work:
- Increment the
jiffies_64
count by one. - Update resource usages, such as consumed system and user time, for the currently running process.
- Run any dynamic timers that have expired.
- Execute
scheduler_tick()
. - Update the wall time, which is stored in
xtime
. - Calculate the infamous load average.
static void tick_periodic(int cpu)
{
if (tick_do_timer_cpu == cpu) {
write_seqlock(&xtime_lock);
/* Keep track of the next tick event */
tick_next_period = ktime_add(tick_next_period, tick_period);
do_timer(1);
write_sequnlock(&xtime_lock);
}
update_process_times(user_mode(get_irq_regs()));
profile_tick(CPU_PROFILING);
}
do_timer()
is responsible for actually performing the increment to jiffies_64
:
void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time();
calc_global_load();
}
The function update_wall_time()
updates the wall time in accordance with the elapsed ticks, whereas calc_global_load()
updates the system’s load average statistics.
When do_timer()
ultimately returns, update_process_times()
is invoked to update various statistics that a tick has elapsed, noting via user_tick
whether it occurred in user-space or kernel-space:
void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id();
/* Note: this timer irq context must be accounted for as well. */
account_process_tick(p, user_tick);
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
printk_tick();
scheduler_tick();
run_posix_cpu_timers(p);
}
The account_process_tick()
function does the actual updating of the process’s times:
void account_process_tick(struct task_struct *p, int user_tick)
{
cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
struct rq *rq = this_rq();
if (user_tick)
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
one_jiffy_scaled);
else
account_idle_time(cputime_one_jiffy);
}
Next, the run_local_timers()
function marks a softirq to handle the execution of any expired timers.
Finally, the scheduler_tick()
function decrements the currently running process’s timeslice and sets need_resched
if needed.
The current time of day (the wall time) is defined in kernel/time/timekeeping.c
:
struct timespec xtime;
The timespec data structure is defined in <linux/time.h>
as:
struct timespec {
__kernel_time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
Reading or writing the xtime
variable requires the xtime_lock
lock, which is not a normal spinlock but a seqlock.
write_seqlock(&xtime_lock);
/* update xtime ... */
write_sequnlock(&xtime_lock);
Reading xtime
requires the use of the read_seqbegin()
and read_seqretry()
functions:
unsigned long seq;
do {
unsigned long lost;
seq = read_seqbegin(&xtime_lock);
usec = timer->get_offset();
lost = jiffies - wall_jiffies;
if (lost)
usec += lost * (1000000 / HZ);
sec = xtime.tv_sec;
usec += (xtime.tv_nsec / 1000);
} while (read_seqretry(&xtime_lock, seq));
This loop repeats until the reader is assured that it read the data without an intervening write. If the timer interrupt occurred and updated xtime during the loop, the returned sequence number is invalid and the loop repeats.
The primary user-space interface for retrieving the wall time is gettimeofday()
, which is implemented as sys_gettimeofday()
in kernel/time.c
:
asmlinkage long sys_gettimeofday(struct timeval *tv, struct timezone *tz)
{
if (likely(tv)) {
struct timeval ktv;
do_gettimeofday(&ktv);
if (copy_to_user(tv, &ktv, sizeof(ktv)))
return -EFAULT;
}
if (unlikely(tz)) {
if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
return -EFAULT;
}
return 0;
}
Timers is a tool for delaying work a specified amount of time.
Timers are represented by struct timer_list
, which is defined in <linux/timer.h>
:
struct timer_list {
struct list_head entry; /* entry in linked list of timers */
unsigned long expires; /* expiration value, in jiffies */
void (*function)(unsigned long); /* the timer handler function */
unsigned long data; /* lone argument to the handler */
struct tvec_t_base_s *base; /* internal timer field, do not touch */
};
The kernel provides a family of timer-related interfaces to make timer management easy. Everything is declared in <linux/timer.h>
. Most of the actual implementation is in kernel/timer.c
.
The first step in creating a timer is defining it:
struct timer_list my_timer;
Next, the timer’s internal values must be initialized. This is done via a helper function and must be done prior to calling any timer management functions on the timer:
init_timer(&my_timer);
Now you fill out the remaining values as required:
my_timer.expires = jiffies + delay; /* timer expires in delay ticks */
my_timer.data = 0; /* zero is passed to the timer handler */
my_timer.function = my_function; /* function to run when timer expires */
As you can see from the timer_list definition
, the function must match this prototype:
void my_timer_function(unsigned long data);
The data parameter enables you to register multiple timers with the same handler, and differentiate between them via the argument. If you do not need the argument, you can simply pass zero.
Finally, you activate the timer:
add_timer(&my_timer);
Typically, timers are run fairly close to their expiration; however, they might be delayed until the first timer tick after their expiration. Consequently, timers cannot be used to implement any sort of hard real-time processing.
The kernel implements a function, mod_timer()
, which changes the expiration of a given timer:
mod_timer(&my_timer, jiffies + new_delay); /* new expiration */
The mod_timer()
function can operate on timers that are initialized but not active, too. If the timer is inactive, mod_timer()
activates it. The function returns zero if the timer were inactive and one if the timer were active. In either case, upon return from mod_timer()
, the timer is activated and set to the new expiration.
If you need to deactivate a timer prior to its expiration, use the del_timer()
function:
del_timer(&my_timer);
The function works on both active and inactive timers. If the timer is already inactive, the function returns zero; otherwise, the function returns one. Note that you do not need to call this for timers that have expired because they are automatically deactivated.
A potential race condition that must be guarded against exists when deleting timers. When del_timer()
returns, it guarantees only that the timer is no longer active. On a multiprocessing machine, however, the timer handler might already be executing on another processor. To deactivate the timer and wait until a potentially executing handler for the timer exits, use del_timer_sync()
:
del_timer_sync(&my_timer);
Unlike del_timer()
, del_timer_sync()
cannot be used from interrupt context. In almost all cases, you should use del_timer_sync()
over del_timer()
.
After the timer interrupt, the timer interrupt handler runs update_process_times()
, which calls run_local_timers()
:
void run_local_timers(void)
{
hrtimer_run_queues();
raise_softirq(TIMER_SOFTIRQ); /* raise the timer softirq */
softlockup_tick();
}
The TIMER_SOFTIRQ
softirq is handled by run_timer_softirq()
. This function runs all the expired timers on the current processor.
The kernel partitions timers into five groups based on their expiration value for efficience.
Busy looping
onlyy used when the time you want to delay is some integer multiple of the tick rate on precision is not important.
unsigned long timeout = jiffies + 10; /* ten ticks */
while (time_before(jiffies, timeout))
;
The loop continues until jiffies is larger than delay , which occurs only after 10 clock ticks have passed.
A better solution would be to reschedule your process to allow the processor to accomplish other work while your code waits:
unsigned long delay = jiffies + 5*HZ;
while (time_before(jiffies, delay))
cond_resched();
The call to cond_resched()
schedules a new process, but only if need_resched is set. Note that because this approach invokes the scheduler, you can only make use of it from process context.
The kernel provides three functions for microsecond, nanosecond, and millisecond delays, defined in <linux/delay.h>
and <asm/delay.h>
, which do not use jiffies
:
void udelay(unsigned long usecs)
void ndelay(unsigned long nsecs)
void mdelay(unsigned long msecs)
A more optimal method of delaying execution is to use schedule_timeout()
. This call puts your task to sleep until at least the specified time has elapsed.
/* set task’s state to interruptible sleep */
set_current_state(TASK_INTERRUPTIBLE);
/* take a nap and wake up in “s” seconds */
schedule_timeout(s * HZ);
If the code does not want to process signals, you can use TASK_UNINTERRUPTIBLE
instead. The task must be in one of these two states before schedule_timeout()
is called or else the task will not go to sleep.
signed long schedule_timeout(signed long timeout)
{
timer_t timer;
unsigned long expire;
switch (timeout)
{
case MAX_SCHEDULE_TIMEOUT:
schedule();
goto out;
default:
if (timeout < 0)
{
printk(KERN_ERR “schedule_timeout: wrong timeout “
“value %lx from %p\n”, timeout,
__builtin_return_address(0));
current->state = TASK_RUNNING;
goto out;
}
}
expire = timeout + jiffies;
init_timer(&timer);
timer.expires = expire;
timer.data = (unsigned long) current;
timer.function = process_timeout;
add_timer(&timer);
schedule();
del_timer_sync(&timer);
timeout = expire - jiffies;
out:
return timeout < 0 ? 0 : timeout;
}
When the timer expires, it runs process_timeout()
:
void process_timeout(unsigned long data)
{
wake_up_process((task_t *) data);
}
This function puts the task in the TASK_RUNNING
state and places it back on the runqueue.
Sometimes it is desirable to wait for a specific event or wait for a specified time to elapse. In those cases, code might simply call schedule_timeout()
instead of schedule() after placing itself on a wait queue.