_ RU.LINUX (2:5077/15.22) ___________________________________________ RU.LINUX _
From : Boris Tobotras 2:5020/400 26 Feb 98 14:56:18
Subj : /proc/sys/{vm,kernel} documented
________________________________________________________________________________
From: Boris Tobotras <tobotras@jet.msk.su>
Documentation for /proc/sys/*/* version 0.1
(c) 1998, Rik van Riel <H.H.vanRiel@fys.ruu.nl>
'Why', I hear you ask, 'would anyone even _want_ documentation
for them sysctl files? If anybody really needs it, it's all in
the source...'
Well, this documentation is written because some people either
don't know they need to tweak something, or because they don't
have the time or knowledge to read the source code.
Furthermore, the programmers who built sysctl have built it to
be actually used, not just for the fun of programming it :-)
Legal blurb:
As usual, there are two main things to consider:
1. you get what you pay for
2. it's free
The consequences are that I won't guarantee the correctness of
this document, and if you come to me complaining about how you
screwed up your system because of wrong documentation, I won't
feel sorry for you. I might even laugh at you...
But ofcourse, if you _do_ manage to screw up your system using
only the sysctl options used in this file, I'd like to hear of
it. Not only to have a great laugh, but also to make sure that
you're the last RTFMing person to screw up.
In short, e-mail your suggestions, corrections and / or horror
stories to: <H.H.vanRiel@fys.ruu.nl>
Rik van Riel.
Introduction:
Sysctl is a means of configuring certain aspects of the kernel
at run-time, and the /proc/sys/ directory is there so that you
don't even need special tools to do it!
In fact, there are only four things needed to use these config
facilities:
- a running Linux system
- root access
- common sense (this is especially hard to come by these days)
- knowledge of what all those values mean
As a quick 'ls /proc/sys' will show, the directory consists of
several (arch-dependant?) subdirs. Each subdir is mainly about
one part of the kernel, so you can do configuration on a piece
by piece basis, or just some 'thematic frobbing'.
The subdirs are about:
debug/ <empty>
fs/ specific filesystems
binfmt_misc <linux/Documentation/binfmt_misc.txt>
kernel/ global kernel info / tuning
open file / inode tuning
miscellaneous stuff
net/ networking stuff, for documentation look in:
<linux/Documentation/networking/>
proc/ <empty>
vm/ memory management tuning
buffer and cache management
These are the subdirs I have on my system. There might be more
or other subdirs in another setup. If you see another dir, I'd
really like to hear about it :-)
Documentation for /proc/sys/kernel/* version 0.1
(c) 1998, Rik van Riel <H.H.vanRiel@fys.ruu.nl>
For general info and legal blurb, please look in README.
This file contains documentation for the sysctl files in
/proc/sys/kernel/ and is valid for Linux kernel version 2.1.
The files in this directory can be used to tune and monitor
miscelaneous and general things in the operation of the Linux
kernel. Since some of the files _can_ be used to screw up your
system, it is advisable to read both documentation and source
before actually making adjustments.
Currently, these files are in /proc/sys/kernel:
- ctrl-alt-del
- dentry-state
- domainname
- file-max
- file-nr
- hostname
- inode-max
- inode-nr
- inode-state
- osrelease
- ostype
- panic
- printk
- securelevel
- version
ctrl-alt-del:
When the value in this file is 0, ctrl-alt-del is trapped and
sent to the init(1) program to handle a graceful restart.
When, however, the value is > 0, Linux' reaction to a Vulcan
Nerve Pinch (tm) will be an immediate reboot, without even
syncing it's dirty buffers.
Note: when a program (like dosemu) has the keyboard in 'raw'
mode, the ctrl-alt-del is intercepted by the program before it
ever reaches the kernel tty layer, and it's up to the program
to decide what to do with it.
dentry-state:
From linux/fs/dentry.c:
--------------------------------------------------------------
struct {
int nr_dentry;
int nr_unused;
int age_limit; /* age in seconds */
int want_pages; /* pages requested by system */
int dummy[2];
} dentry_stat = {0, 0, 45, 0,};
--------------------------------------------------------------
Dentries are dynamically allocated and deallocated, and
nr_dentry seems to be 0 all the time. Hence it's safe to
assume that only nr_unused, age_limit and want_pages are
used. Nr_unused seems to be exactly what it's name says.
Age_limit is the age in seconds after which dcache entries
can be reclaimed when memory is short and want_pages is
nonzero when shrink_dcache_pages() has been called and the
dcache isn't pruned yet.
domainname & hostname:
These files can be controlled to set the domainname and
hostname of your box. For the classic darkstar.frop.org
a simple:
# echo "darkstar" > /proc/sys/kernel/hostname
# echo "frop.org" > /proc/sys/kernel/domainname
would suffice to set your hostname and domainname.
file-max & file-nr:
The kernel allocates filehandles dynamically, but as yet it
doesn't free them again...
The value in file-max denotes the maximum number of file-
handles that the Linux kernel will allocate. When you get lots
of error messages about running out of file handles, you might
want to increase this limit.
The three values in file-nr denote the number of allocated
file handles, the number of used file handles and the maximum
number of file handles. When the allocated filehandles come
close to the maximum, but the number of actually used ones is
far behind, you've encountered a peek in your filehandle usage
and you don't need to increase the maximum.
inode-max, inode-nr & inode-state:
As with filehandles, the kernel allocates the inode structures
dynamically, but can't free them yet...
The value in inode-max denotes the maximum number of inode
handlers. This value should be 3-4 times larger as the value
in file-max, since stdin, stdout and network sockets also
need an inode struct to handle them. When you regularly run
out of inodes, you need to increase this value.
The file inode-nr contains the first two items from
inode-state, so we'll skip to that file...
Inode-state contains three actual numbers and four dummies.
The actual numbers are, in order of appearance, nr_inodes,
nr_free_inodes and preshrink.
Nr_inodes stands for the number of inodes the system has
allocated, this can be slightly more than inode-max because
Linux allocates them one pagefull at a time.
Nr_free_inodes represents the number of free inodes (?) and
preshrink is nonzero when the nr_inodes > inode-max and the
system needs to prune the inode list instead of allocating
more.
osrelease, ostype & version:
# cat osrelease
2.1.88
# cat ostype
Linux
# cat version
#5 Wed Feb 25 21:49:24 MET 1998
The files osrelease and ostype should be clear enough. Version
needs a little more clarification however. The '#5' means that
this is the fifth kernel built from this source base and the
date behind it indicates the time the kernel was built.
The only way to tune these values is to rebuild the kernel :-)
panic:
The value in this file represents the number of seconds the
kernel waits before rebooting on a panic. When you use the
software watchdog, the recommended setting is 60.
printk:
The four values in printk denote: console_loglevel,
default_message_loglevel, minimum_console_level and
default_console_loglevel respectively.
These values have influence on printk() behaviour when
printing / logging error messages. See 'man 2 syslog'
for more info on the different loglevels.
- console_loglevel: messages with a higher priority than
this will be printed to the console
- default_message_level: messages without an explicit priority
will be printed with this priority
- minimum_console_loglevel: minimum (highest) value to which
console_loglevel can be set
- default_console_loglevel: default value for console_loglevel
Note: a quick look in linux/kernel/printk.c will reveal that
these variables aren't put inside a structure, so their order
in-core isn't formally guaranteed and garbage values _might_
occur when the compiler changes. (???)
securelevel:
When the value in this file is nonzero, root is prohibited
from:
- changing the immutable and append-only flags on files
- changing sysctl things (limited ???)
real-root-dev: (CONFIG_INITRD only)
This file is used to configure the real root device when using
an initial ramdisk to configure the system before switching to
the 'real' root device. See linux/Documentation/initrd.txt for
more info.
reboot-cmd: (Sparc only)
??? This seems to be a way to give an argument to the Sparc
ROM/Flash boot loader. Maybe to tell it what to do after
rebooting. ???
Documentation for /proc/sys/vm/* version 0.1
(c) 1998, Rik van Riel <H.H.vanRiel@fys.ruu.nl>
For general info and legal blurb, please look in README.
This file contains the documentation for the sysctl files in
/proc/sys/vm and is valid for Linux kernel version 2.1.
The files in this directory can be used to tune the operation
of the virtual memory (VM) subsystem of the Linux kernel, and
one of the files (bdflush) also has a little influence on disk
usage.
Currently, these files are in /proc/sys/vm:
- bdflush
- freepages
- overcommit_memory
- swapctl
- swapout_interval
bdflush:
This file controls the operation of the bdflush kernel
daemon. The source code to this struct can be found in
linux/mm/buffer.c. It currently contains 9 integer values,
of which 6 are actually used by the kernel.
From linux/fs/buffer.c:
--------------------------------------------------------------
union bdflush_param{
struct {
int nfract; /* Percentage of buffer cache dirty to
activate bdflush */
int ndirty; /* Maximum number of dirty blocks to
write out per wake-cycle */
int nrefill; /* Number of clean buffers to try to
obtain each time we call refill */
int nref_dirt; /* Dirty buffer threshold for activating
bdflush when trying to refill buffers. */
int dummy1; /* unused */
int age_buffer; /* Time for normal buffer to age before
we flush it */
int age_super; /* Time for superblock to age before we
flush it */
int dummy2; /* unused */
int dummy3; /* unused */
} b_un;
unsigned int data[N_PARAM];
} bdf_prm = {{40, 500, 64, 256, 15, 30*HZ, 5*HZ, 1884, 2}};
--------------------------------------------------------------
The first parameter governs the maximum number of of dirty
buffers in the buffer cache. Dirty means that the contents
of the buffer still have to be written to disk (as opposed
to a clean buffer, which can just be forgotten about).
Setting this to a high value means that Linux can delay disk
writes for a long time, but it also means that it will have
to do a lot I/O at once when memory becomes short. A low
value will spread out disk I/O more evenly.
The second parameter (ndirty) gives the maximum number of
dirty buffers that bdflush can write to the disk in one time.
A high value will mean delayed, bursty I/O, while a small
value can lead to memory shortage when bdflush isn't woken
up often enough...
The third parameter (nrefill) is the number of buffers that
bdflush will add to the list of free buffers when
refill_freelist() is called. It is nessecary to allocate free
buffers beforehand, since the buffers often are of a different
size than memory pages and some bookkeeping needs to be done
beforehand. The higher the number, the more memory will be
wasted and the less often refill_freelist() will need to run.
When refill_freelist() comes across more than nref_dirt dirty
buffers, it will wake up bdflush.
Finally, the age_buffer and age_super parameters govern the
maximum time Linux waits before writing out a dirty buffer
to disk. The value is expressed in jiffies (clockticks), the
number of jiffies per second is 100, except on Alpha machines
(1024). Age_buffer is the maximum age for data blocks, while
age_super is for filesystem metadata.
freepages:
This file contains three values: min_free_pages, free_pages_low
and free_pages_high in order.
These numbers are used by the VM subsystem to keep a reasonable
number of pages on the free page list, so that programs can
allocate new pages without having to wait for the system to
free used pages first. The actual freeing of pages is done
by kswapd, a kernel daemon.
min_free_pages -- when the number of free pages reaches this
level, only the kernel can allocate memory
for _critical_ tasks only
free_pages_low -- when the number of free pages drops below
this level, kswapd is woken up immediately
free_pages_high -- this is kswapd's target, when more than
free_pages_high pages are free, kswapd will
stop swapping.
When the number of free pages is between free_pages_low and
free_pages_high, and kswapd hasn't run for swapout_interval
jiffies, then kswapd is woken up too. See swapout_interval
for more info.
When free memory is always low on your system, and kswapd has
trouble keeping up with allocations, you might want to
increase these values, especially free_pages_high and perhaps
free_pages_low. I've found that a 1:2:4 relation for these
values tend to work rather well in a heavily loaded system.
overcommit_memory:
This file contains only one value. The followin algorithm
is used to decide if there's enough memory. If the value
of overcommit_memory > 0, then there's always enough
memory :-). This is a useful feature, since programs often
malloc() huge amounts of memory 'just in case', while they
only use a small part of it. Leaving this value at 0 will
lead to the failure of such a huge malloc(), when in fact
the system has enough memory for the program to run...
On the other hand, enabling this feature can cause you to
run out of memory and thrash the system to death, so large
and/or important servers will want to set this value to 0.
From linux/mm/mmap.c:
--------------------------------------------------------------
static inline int vm_enough_memory(long pages)
{
/* Stupid algorithm to decide if we have enough memory: while
* simple, it hopefully works in most obvious cases.. Easy to
* fool it, but this should catch most mistakes.
*/
long freepages;
/* Sometimes we want to use more memory than we have. */
if (sysctl_overcommit_memory)
return 1;
This file contains no less than 16 variables, of which about
half is actually used :-) In the listing below, the unused
variables are marked as such.
All of these values are used by kswapd, and the usage can be
found in linux/mm/vmscan.c.
From linux/include/linux/swapctl.h:
--------------------------------------------------------------
typedef struct swap_control_v5
{
unsigned int sc_max_page_age;
unsigned int sc_page_advance;
unsigned int sc_page_decline;
unsigned int sc_page_initial_age;
unsigned int sc_max_buff_age; /* unused */
unsigned int sc_buff_advance; /* unused */
unsigned int sc_buff_decline; /* unused */
unsigned int sc_buff_initial_age; /* unused */
unsigned int sc_age_cluster_fract;
unsigned int sc_age_cluster_min;
unsigned int sc_pageout_weight;
unsigned int sc_bufferout_weight;
unsigned int sc_buffer_grace; /* unused */
unsigned int sc_nr_buffs_to_free; /* unused */
unsigned int sc_nr_pages_to_free; /* unused */
enum RCL_POLICY sc_policy; /* RCL_PERSIST hardcoded */
} swap_control_v5;
--------------------------------------------------------------
The first four variables are used to keep track of Linux'
page aging. Page aging is a bookkeeping method to keep track
of which pages of memory are used often, and which pages can
be swapped out without consequenses.
When a page is swapped in, it starts at sc_page_initial_age
(default 3) and when the page is scanned by kswapd, it's age
is adjusted according to the following scheme:
- if the page was used since the last time we scanned, it's
age is increased sc_page_advance (default 3) up to a maximum
of sc_max_page_age (default 20)
- else (it wasn't used) it's age is decreased sc_page_decline
(default 1)
And when a page reaches age 0, it's ready to be swapped out.
The variables sc_age_cluster_fract till sc_bufferout_weight
have to do with the amount of scanning kswapd is doing on
each call to try_to_swap_out().
sc_age_cluster_fract is used to calculate how many pages from
a process are to be scanned by kswapd. The formula used is
sc_age_cluster_fract/1024 * RSS, so if you want kswapd to scan
the whole process, sc_age_cluster_fract needs to have a value
of 1024. The minimum number of pages kswapd will scan is
represented by sc_age_cluster_min, this is done so kswapd will
also scan small processes.
The values of sc_pageout_weight and sc_bufferout_weight are
used to control the how many tries kswapd will do in order
to swapout one page / buffer. As with sc_age_cluster_fract,
the actual value is calculated by several more or less complex
formulae and the default value is good for every purpose.
swapout_interval:
The single value in this file controls the amount of time
between successive wakeups of kswapd when nr_free_pages is
between free_pages_low and free_pages_high. The default value
of HZ/4 is usually right, but when kswapd can't keep up with
the number of allocations in your system, you might want to
decrease this number.