TC-HFSC(7) - phpMan

TC-HFSC(7)                           Linux                          TC-HFSC(7)
NAME
       tc-hfcs - Hierarchical Fair Service Curve
HISTORY & INTRODUCTION
       HFSC  (Hierarchical  Fair Service Curve) is a network packet scheduling
       algorithm that was first presented at SIGCOMM'97. Developed as  a  part
       of ALTQ (ALTernative Queuing) on NetBSD, found its way quickly to other
       BSD systems, and then a few years ago became part of the linux  kernel.
       Still,  it's  not the most popular scheduling algorithm - especially if
       compared to HTB - and it's not well documented for  the  enduser.  This
       introduction aims to explain how HFSC works without using too much math
       (although some math it will be inevitable).
       In short HFSC aims to:
           1)  guarantee precise bandwidth and delay allocation for  all  leaf
               classes (realtime criterion)
           2)  allocate  excess bandwidth fairly as specified by class hierar-
               chy (linkshare & upperlimit criterion)
           3)  minimize any discrepancy between  the  service  curve  and  the
               actual amount of service provided during linksharing
       The  main  "selling" point of HFSC is feature (1), which is achieved by
       using nonlinear service curves (more about what it actually is  later).
       This  is particularly useful in VoIP or games, where not only a guaran-
       tee of consistent bandwidth is important, but also limiting the initial
       delay  of  a  data  stream.  Note that it matters only for leaf classes
       (where the actual queues are) - thus class hierarchy is ignored in  the
       realtime case.
       Feature  (2) is well, obvious - any algorithm featuring class hierarchy
       (such as HTB or CBQ) strives to achieve  that.  HFSC  does  that  well,
       although  you  might end with unusual situations, if you define service
       curves carelessly - see section CORNER CASES for examples.
       Feature (3) is mentioned due to the nature of the problem. There may be
       situations  where  it's either not possible to guarantee service of all
       curves at the same time, and/or it's impossible to do so  fairly.  Both
       will  be  explained later. Note that this is mainly related to interior
       (aka aggregate) classes, as the  leafs  are  already  handled  by  (1).
       Still,  it's perfectly possible to create a leaf class without realtime
       service, and in such a case the caveats will naturally extend  to  leaf
       classes as well.
ABBREVIATIONS
       For the remaining part of the document, we'll use following shortcuts:
           RT - realtime
           LS - linkshare
           UL - upperlimit
           SC - service curve
BASICS OF HFSC
       To  understand how HFSC works, we must first introduce a service curve.
       Overall, it's a nondecreasing function of some time unit, returning the
       amount of service (an allowed or allocated amount of bandwidth) at some
       specific point in time. The purpose  of  it  should  be  subconsciously
       obvious:  if  a  class was allowed to transfer not less than the amount
       specified by its service curve, then the service curve is not violated.
       Still, we need more elaborate criterion than just the  above  (although
       in the most generic case it can be reduced to it). The criterion has to
       take two things into account:
           o   idling periods
           o   the ability to "look back", so if during current active  period
               the  service  curve  is  violated,  maybe  it isn't if we count
               excess bandwidth received during earlier active period(s)
       Let's define the criterion as follows:
           (1) For each t1, there must exist t0 in set B, so S(t1-t0) <= w(t0,t1)
       Here 'w' denotes the amount of service received during some time period
       between  t0  and  t1.  B is a set of all times, where a session becomes
       active after idling period (further denoted as 'becoming  backlogged').
       For a clearer picture, imagine two situations:
           a)  our  session  was  active during two periods, with a small time
               gap between them
           b)  as in (a), but with a larger gap
       Consider (a): if the service received during both  periods  meets  (1),
       then  all  is well. But what if it doesn't do so during the 2nd period?
       If the amount of service received during the 1st period is larger  than
       the  service curve, then it might compensate for smaller service during
       the 2nd period and the gap - if the gap is small enough.
       If the gap is larger (b) - then it's less likely to happen (unless  the
       excess  bandwidth  allocated  during  the  1st  part was really large).
       Still, the larger the gap - the less interesting is  what  happened  in
       the  past  (e.g.  10 minutes ago) - what matters is the current traffic
       that just started.
       From HFSC's perspective, more interesting is  answering  the  following
       question: when should we start transferring packets, so a service curve
       of a class is not violated. Or rephrasing it: How much  X()  amount  of
       service should a session receive by time t, so the service curve is not
       violated. Function X() defined as below is the basic building block  of
       HFSC, used in: eligible, deadline, virtual-time and fit-time curves. Of
       course, X() is based on equation (1) and is defined recursively:
           o   At the 1st backlogged period beginning function X  is  initial-
               ized to generic service curve assigned to a class
           o   At any subsequent backlogged period, X() is:
               min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
               ...  where  t0  denotes the beginning of the current backlogged
               period.
       HFSC uses either linear, or two-piece linear service curves. In case of
       linear  or  two-piece  linear  convex  functions  (first slope < second
       slope), min() in X's definition reduces to the  2nd  argument.  But  in
       case  of  two-piece  concave  functions, the 1st argument might quickly
       become lesser for some t>=t0. Note, that for  some  backlogged  period,
       X()  is  defined  only  from  that  period's  beginning. We also define
       X^(-1)(w) as smallest t>=t0, for which X(t) = w. We have to  define  it
       this way, as X() is usually not an injection.
       The above generic X() can be one of the following:
           E() In realtime criterion, selects packets eligible for sending. If
               none are eligible, HFSC will use linkshare criterion.  Eligible
               time  'et'  is  calculated  with  reference to packets' heads (
               et = E^(-1)(w) ). It's based on RT service curve, but  in  case
               of a convex curve, uses its 2nd slope only.
           D() In  realtime  criterion,  selects the most suitable packet from
               the ones chosen by E(). Deadline time 'dt' corresponds to pack-
               ets'  tails  (dt  = D^(-1)(w+l), where 'l' is packet's length).
               Based on RT service curve.
           V() In linkshare criterion, arbitrates which packet to  send  next.
               Note  that  V()  is  function of a virtual time - see LINKSHARE
               CRITERION section for details. Virtual time 'vt' corresponds to
               packets' heads (vt = V^(-1)(w)). Based on LS service curve.
           F() An  extension  to  linkshare  criterion, used to limit at which
               speed linkshare criterion is allowed to dequeue. Fit-time  'ft'
               corresponds  to  packets' heads as well (ft = F^(-1)(w)). Based
               on UL service curve.
       Be sure to make clean distinction between session's RT, LS and UL  ser-
       vice curves and the above "utility" functions.
REALTIME CRITERION
       RT  criterion  ignores class hierarchy and guarantees precise bandwidth
       and delay allocation. We say that a packet  is  eligible  for  sending,
       when  the  current  real  time  is  later than the eligible time of the
       packet. From all eligible packets, the one most suited for  sending  is
       the  one  with the shortest deadline time. This sounds simple, but con-
       sider the following example:
       Interface 10Mbit, two  classes,  both  with  two-piece  linear  service
       curves:
           o   1st  class  - 2Mbit for 100ms, then 7Mbit (convex - 1st slope <
               2nd slope)
           o   2nd class - 7Mbit for 100ms, then 2Mbit (concave - 1st slope  >
               2nd slope)
       Assume  for  a  moment,  that we only use D() for both finding eligible
       packets, and choosing the most fitting one, thus eligible time would be
       computed   as   D^(-1)(w)  and  deadline  time  would  be  computed  as
       D^(-1)(w+l). If the 2nd class starts sending packets 1 second after the
       1st class, it's of course impossible to guarantee 14Mbit, as the inter-
       face capability is only 10Mbit.  The only workaround in  this  scenario
       is  to  allow the 1st class to send the packets earlier that would nor-
       mally be allowed. That's where separate E() comes to help. Putting  all
       the math aside (see HFSC paper for details), E() for RT concave service
       curve is just like D(), but for the RT convex service curve - it's con-
       structed using only RT service curve's 2nd slope (in our example
        7Mbit).
       The  effect of such E() - packets will be sent earlier, and at the same
       time D() will be updated - so the current deadline time calculated from
       it  will  be  later.  Thus,  when  the 2nd class starts sending packets
       later, both the 1st and the 2nd class will be  eligible,  but  the  2nd
       session's  deadline  time  will be smaller and its packets will be sent
       first. When the 1st class becomes idle at some  later  point,  the  2nd
       class  will be able to "buffer" up again for later active period of the
       1st class.
       A short remark - in a situation, where the total  amount  of  bandwidth
       available  on the interface is larger than the allocated total realtime
       parts (imagine a 10 Mbit interface,  but  1Mbit/2Mbit  and  2Mbit/1Mbit
       classes),  the  sole  speed of the interface could suffice to guarantee
       the times.
       Important part of RT criterion is that apart from updating its D()  and
       E(),  also V() used by LS criterion is updated. Generally the RT crite-
       rion is secondary to LS one, and used only if there's a risk of violat-
       ing  precise realtime requirements. Still, the "participation" in band-
       width distributed by LS criterion is there, so V() has  to  be  updated
       along  the way. LS criterion can than properly compensate for non-ideal
       fair sharing situation, caused by RT scheduling. If you use UL  service
       curve its F() will be updated as well (UL service curve is an extension
       to LS one - see UPPERLIMIT CRITERION section).
       Anyway - careless specification of LS and RT service curves can lead to
       potentially  undesired situations (see CORNER CASES for examples). This
       wasn't the case in HFSC paper where LS and RT service  curves  couldn't
       be specified separately.
LINKSHARING CRITERION
       LS  criterion's  task is to distribute bandwidth according to specified
       class hierarchy. Contrary to  RT  criterion,  there're  no  comparisons
       between  current  real  time  and  virtual time - the decision is based
       solely on direct comparison of virtual times of all active subclasses -
       the  one  with  the  smallest vt wins and gets scheduled. One immediate
       conclusion from this fact is that absolute values don't matter  -  only
       ratios  between  them (so for example, two children classes with simple
       linear 1Mbit service curves will get the same treatment from LS  crite-
       rion's  perspective,  as  if they were 5Mbit). The other conclusion is,
       that in perfectly fluid system with linear curves,  all  virtual  times
       across whole class hierarchy would be equal.
       Why is VC defined in term of virtual time (and what is it)?
       Imagine  an  example:  class A with two children - A1 and A2, both with
       let's say 10Mbit SCs. If A2 is idle, A1 receives all the bandwidth of A
       (and  update its V() in the process). When A2 becomes active, A1's vir-
       tual time is already far later than A2's one. Considering the  type  of
       decision made by LS criterion, A1 would become idle for a long time. We
       can workaround this situation by adjusting virtual time  of  the  class
       becoming  active  -  we do that by getting such time "up to date". HFSC
       uses a mean of the smallest and the biggest virtual time  of  currently
       active children fit for sending. As it's not real time anymore (exclud-
       ing trivial case of situation where all classes become  active  at  the
       same time, and never become idle), it's called virtual time.
       Such  approach  has  its price though. The problem is analogous to what
       was presented in previous section and is  caused  by  non-linearity  of
       service curves:
       1)  either  it's  impossible  to  guarantee  service curves and satisfy
           fairness during certain time periods:
           Recall the example from RT section, slightly modified  (with  3Mbit
           slopes instead of 2Mbit ones):
           o   1st  class  - 3Mbit for 100ms, then 7Mbit (convex - 1st slope <
               2nd slope)
           o   2nd class - 7Mbit for 100ms, then 3Mbit (concave - 1st slope  >
               2nd slope)
           They  sum up nicely to 10Mbit - the interface's capacity. But if we
           wanted to only use LS for guarantees and fairness - it simply won't
           work.  In  LS  context,  only V() is used for making decision which
           class to schedule. If the 2nd class becomes active when the 1st one
           is in its second slope, the fairness will be preserved - ratio will
           be 1:1 (7Mbit:7Mbit), but LS itself is of course unable to  guaran-
           tee  the absolute values themselves - as it would have to go beyond
           of what the interface is capable of.
       2)  and/or it's impossible to guarantee service curves of  all  classes
           at the same time [fairly or not]:
           This  is  similar to the above case, but a bit more subtle. We will
           consider two subtrees, arbitrated by their common (root here)  par-
           ent:
           R (root) - 10Mbit
           A  - 7Mbit, then 3Mbit
           A1 - 5Mbit, then 2Mbit
           A2 - 2Mbit, then 1Mbit
           B  - 3Mbit, then 7Mbit
           R arbitrates between left subtree (A) and right (B). Assume that A2
           and B are constantly backlogged, and at some later point A1 becomes
           backlogged (when all other classes are in their 2nd linear part).
           What  happens now? B (choice made by R) will always get 7 Mbit as R
           is only (obviously) concerned with the  ratio  between  its  direct
           children.  Thus  A  subtree gets 3Mbit, but its children would want
           (at the point when A1 became backlogged) 5Mbit + 1Mbit.  That's  of
           course impossible, as they can only get 3Mbit due to interface lim-
           itation.
           In the left subtree - we have  the  same  situation  as  previously
           (fair split between A1 and A2, but violated guarantees), but in the
           whole tree - there's no fairness (B got 7Mbit, but A1 and  A2  have
           to fit together in 3Mbit) and there's no guarantees for all classes
           (only B got what it wanted). Even if we violated fairness in the  A
           subtree and set A2's service curve to 0, A1 would still not get the
           required bandwidth.
UPPERLIMIT CRITERION
       UL criterion is an extensions to LS one, that permits  sending  packets
       only  if  current real time is later than fit-time ('ft'). So the modi-
       fied LS criterion becomes: choose the smallest virtual  time  from  all
       active  children,  such  that  fit-time < current real time also holds.
       Fit-time is calculated from F(), which is based on UL service curve. As
       you  can  see,  its  role is kinda similar to E() used in RT criterion.
       Also, for obvious reasons - you can't specify UL service curve  without
       LS one.
       The  main purpose of the UL service curve is to limit HFSC to bandwidth
       available on the upstream router (think  adsl  home  modem/router,  and
       linux server as NAT/firewall/etc. with 100Mbit+ connection to mentioned
       modem/router).  Typically, it's used to create a single class  directly
       under  root, setting a linear UL service curve to available bandwidth -
       and then creating your class structure from that  class  downwards.  Of
       course,  you're  free  to add a UL service curve (linear or not) to any
       class with LS criterion.
       An important part about the UL service curve is that whenever  at  some
       point  in  time  a  class  doesn't  qualify  for linksharing due to its
       fit-time, the next time it does qualify it will update its virtual time
       to  the  smallest virtual time of all active children fit for linkshar-
       ing. This way, one of the main things the LS criterion tries to achieve
       -  equality  of all virtual times across whole hierarchy - is preserved
       (in perfectly fluid system with only linear curves, all  virtual  times
       would be equal).
       Without  that,  'vt'  would  lag  behind other virtual times, and could
       cause problems. Consider an interface with a capacity  of  10Mbit,  and
       the  following  leaf  classes  (just  in case you're skipping this text
       quickly - this example shows behavior that doesn't happen):
       A - ls 5.0Mbit
       B - ls 2.5Mbit
       C - ls 2.5Mbit, ul 2.5Mbit
       If B was idle, while A and C were constantly backlogged, A and C  would
       normally  (as far as LS criterion is concerned) divide bandwidth in 2:1
       ratio. But due to UL service curve  in  place,  C  would  get  at  most
       2.5Mbit,  and  A  would get the remaining 7.5Mbit. The longer the back-
       logged period, the more the virtual times of A and C would drift apart.
       If  B  became  backlogged at some later point in time, its virtual time
       would be set to (A's vt + C's vt)/2, thus blocking A from  sending  any
       traffic until B's virtual time catches up with A.
SEPARATE LS / RT SCs
       Another  difference  from the original HFSC paper is that RT and LS SCs
       can be specified separately. Moreover, leaf classes are allowed to have
       only  either  RT  SC  or  LS SC. For interior classes, only LS SCs make
       sense: any RT SC will be ignored.
CORNER CASES
       Separate service curves for LS and RT  criteria  can  lead  to  certain
       traps  that come from "fighting" between ideal linksharing and enforced
       realtime guarantees. Those situations didn't  exist  in  original  HFSC
       paper,  where  specifying  separate LS / RT service curves was not dis-
       cussed.
       Consider an interface with a 10Mbit capacity, with the  following  leaf
       classes:
       A - ls 5.0Mbit, rt 8Mbit
       B - ls 2.5Mbit
       C - ls 2.5Mbit
       Imagine  A and C are constantly backlogged. As B is idle, A and C would
       divide bandwidth in 2:1 ratio, considering LS service curve (so in the-
       ory  -  6.66 and 3.33). Alas RT criterion takes priority, so A will get
       8Mbit and LS will be able to compensate class C for only 2 Mbit -  this
       will cause discrepancy between virtual times of A and C.
       Assume  this  situation lasts for a long time with no idle periods, and
       suddenly B  becomes  active.  B's  virtual  time  will  be  updated  to
       (A's  vt + C's vt)/2, effectively landing in the middle between A's and
       C's virtual time. The effect - B, having no RT guarantees, will be pun-
       ished  and  will  not  be  allowed  to  transfer until C's virtual time
       catches up.
       If the interface had a higher capacity, for example 100Mbit, this exam-
       ple would behave perfectly fine though.
       Let's  look  a  bit closer at the above example - it "cleverly" invali-
       dates one of the basic things LS criterion tries to achieve -  equality
       of  all  virtual  times across class hierarchy. Leaf classes without RT
       service curves are literally left to their own fate (governed by messed
       up virtual times).
       Also,  it doesn't make much sense. Class A will always be guaranteed up
       to 8Mbit, and this is more than any absolute bandwidth that could  hap-
       pen  from  its  LS  criterion  (excluding  trivial case of only A being
       active). If the bandwidth taken by A is  smaller  than  absolute  value
       from  LS  criterion,  the unused part will be automatically assigned to
       other active classes (as A has idling periods in such case).  The  only
       "advantage"  is,  that even in case of low bandwidth on average, bursts
       would be handled at the speed defined by RT criterion. Still, if  extra
       speed is needed (e.g. due to latency), non linear service curves should
       be used in such case.
       In the other words: the LS criterion is meaningless in the above  exam-
       ple.
       You  can  quickly "workaround" it by making sure each leaf class has RT
       service curve assigned (thus guaranteeing all of  them  will  get  some
       bandwidth), but it doesn't make it any more valid.
       Keep in mind - if you use nonlinear curves and irregularities explained
       above happen only in the first segment, then there's little wrong  with
       "overusing" RT curve a bit:
       A - ls 5.0Mbit, rt 9Mbit/30ms, then 1Mbit
       B - ls 2.5Mbit
       C - ls 2.5Mbit
       Here,  the  vt of A will "spike" in the initial period, but then A will
       never get more than 1Mbit until B & C catch up. Then everything will be
       back to normal.
LINUX AND TIMER RESOLUTION
       In  certain  situations, the scheduler can throttle itself and setup so
       called watchdog to wakeup dequeue function at some time later. In  case
       of  HFSC it happens when for example no packet is eligible for schedul-
       ing, and UL service curve is used to limit the speed at which LS crite-
       rion  is  allowed to dequeue packets. It's called throttling, and accu-
       racy of it is dependent on how the kernel is compiled.
       There're 3 important options in modern kernels, as far as timers' reso-
       lution  goes:  'tickless  system',  'high resolution timer support' and
       'timer frequency'.
       If you have 'tickless system' enabled, then the  timer  interrupt  will
       trigger  as  slowly  as  possible,  but each time a scheduler throttles
       itself (or any other part of the kernel  needs  better  accuracy),  the
       rate  will  be  increased  as  needed / possible. The ceiling is either
       'timer frequency' if 'high resolution timer support' is  not  available
       or  not  compiled  in, or it's hardware dependent and can go far beyond
       the highest 'timer frequency' setting available.
       If 'tickless system' is not enabled, the timer will trigger at a  fixed
       rate  specified  by  'timer  frequency' - regardless if high resolution
       timers are or aren't available.
       This is important to keep those settings in mind, as in scenario  like:
       no tickless, no HR timers, frequency set to 100hz - throttling accuracy
       would be at 10ms. It doesn't automatically mean you would be limited to
       ~0.8Mbit/s (assuming packets at ~1KB) - as long as your queues are pre-
       pared to cover for timer inaccuracy. Of course, in case of e.g. locally
       generated  UDP  traffic  -  appropriate  socket size is needed as well.
       Short  example  to  make  it  more  understandable   (assume   hardcore
       anti-schedule settings - HZ=100, no HR timers, no tickless):
       tc qdisc add dev eth0 root handle 1:0 hfsc default 1
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10Mbit
       Assuming  packet  of  ~1KB size and HZ=100, that averages to ~0.8Mbit -
       anything beyond it (e.g. the above example with specified rate over 10x
       larger) will require appropriate queuing and cause bursts every ~10 ms.
       As you can imagine, any HFSC's RT guarantees will be seriously  invali-
       dated  by that.  Aforementioned example is mainly important if you deal
       with old hardware - as is particularly popular for home server  chores.
       Even then, you can easily set HZ=1000 and have very accurate scheduling
       for typical adsl speeds.
       Anything modern (apic or even hpet msi based timers  +  'tickless  sys-
       tem')  will  provide  enough  accuracy for superb 1Gbit scheduling. For
       example, on one of my cheap dual-core AMD boards I have  the  following
       settings:
       tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 300mbit
       And a simple:
       nc -u dst.host.com 54321 </dev/zero
       nc -l -p 54321 >/dev/null
       ...will yield the following effects over a period of ~10 seconds (taken
       from /proc/interrupts):
       319: 42124229   0  HPET_MSI-edge  hpet2 (before)
       319: 42436214   0  HPET_MSI-edge  hpet2 (after 10s.)
       That's roughly 31000/s. Now compare it with HZ=1000 setting. The  obvi-
       ous  drawback  of it is that cpu load can be rather high with servicing
       that many timer interrupts. The example with 300Mbit RT  service  curve
       on  1Gbit link is particularly ugly, as it requires a lot of throttling
       with minuscule delays.
       Also note that it's just an example showing the capabilities of current
       hardware.   The  above  example (essentially a 300Mbit TBF emulator) is
       pointless on an internal interface to begin with: you will pretty  much
       always  want  a  regular LS service curve there, and in such a scenario
       HFSC simply doesn't throttle at all.
       300Mbit RT service curve (selected columns from mpstat -P ALL 1):
       10:56:43 PM  CPU  %sys     %irq   %soft   %idle
       10:56:44 PM  all  20.10    6.53   34.67   37.19
       10:56:44 PM    0  35.00    0.00   63.00    0.00
       10:56:44 PM    1   4.95   12.87    6.93   73.27
       So, in the rare case you need those  speeds  with  only  a  RT  service
       curve, or with a UL service curve: remember the drawbacks.
CAVEAT: RANDOM ONLINE EXAMPLES
       For reasons unknown (though well guessed), many examples you can google
       love to overuse UL criterion and stuff it in every node possible.  This
       makes no sense and works against what HFSC tries to do (and does pretty
       damn well). Use UL where it makes sense: on the uppermost node to match
       upstream router's uplink capacity. Or in special cases, such as testing
       (limit certain subtree to some speed), or customers that must never get
       more  than  certain speed. In the last case you can usually achieve the
       same by just using a RT criterion without LS+UL on leaf nodes.
       As for the router case - remember it's good  to  differentiate  between
       "traffic  to  router"  (remote console, web config, etc.) and "outgoing
       traffic", so for example:
       tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002
       tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50Mbit
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2Mbit ul m2 2Mbit
       ... so "internet" tree under 1:1 and "router itself" as 1:999
LAYER2 ADAPTATION
       Please refer to tc-stab(8)
SEE ALSO
       tc(8), tc-hfsc(8), tc-stab(8)
       Please direct bugreports and patches to: <netdev AT vger.org>
AUTHOR
       Manpage created by Michal Soltys (soltys AT ziu.info)
iproute2                        31 October 2011                     TC-HFSC(7)