1 Generic SCSI target mid-level for Linux (SCST)
2 ==============================================
4 SCST is designed to provide unified, consistent interface between SCSI
5 target drivers and Linux kernel and simplify target drivers development
6 as much as possible. Detail description of SCST's features and internals
7 could be found in "Generic SCSI Target Middle Level for Linux" document
8 SCST's Internet page http://scst.sourceforge.net.
10 SCST supports the following I/O modes:
12 * Pass-through mode with one to many relationship, i.e. when multiple
13 initiators can connect to the exported pass-through devices, for
14 the following SCSI devices types: disks (type 0), tapes (type 1),
15 processors (type 3), CDROMs (type 5), MO disks (type 7), medium
16 changers (type 8) and RAID controllers (type 0xC)
18 * FILEIO mode, which allows to use files on file systems or block
19 devices as virtual remotely available SCSI disks or CDROMs with
20 benefits of the Linux page cache
22 * BLOCKIO mode, which performs direct block IO with a block device,
23 bypassing page-cache for all operations. This mode works ideally with
24 high-end storage HBAs and for applications that either do not need
25 caching between application and disk or need the large block
28 * User space mode using scst_user device handler, which allows to
29 implement in the user space virtual SCSI devices in the SCST
32 * "Performance" device handlers, which provide in pseudo pass-through
33 mode a way for direct performance measurements without overhead of
34 actual data transferring from/to underlying SCSI device
36 In addition, SCST supports advanced per-initiator access and devices
37 visibility management, so different initiators could see different set
38 of devices with different access permissions. See below for details.
43 To see your devices remotely, you need to add them to at least "Default"
44 security group (see below how). By default, no local devices are seen
45 remotely. There must be LUN 0 in each security group, i.e. LUs
46 numeration must not start from, e.g., 1.
48 It is highly recommended to use scstadmin utility for configuring
49 devices and security groups.
51 If you experience problems during modules load or running, check your
52 kernel logs (or run dmesg command for the few most recent messages).
54 IMPORTANT: Without loading appropriate device handler, corresponding devices
55 ========= will be invisible for remote initiators, which could lead to holes
56 in the LUN addressing, so automatic device scanning by remote SCSI
57 mid-level could not notice the devices. Therefore you will have
58 to add them manually via
59 'echo "- - -" >/sys/class/scsi_host/hostX/scan',
60 where X - is the host number.
62 IMPORTANT: Working of target and initiator on the same host is
63 ========= supported, except the following 2 cases: swap over target exported
64 device and using a writable mmap over a file from target
65 exported device. The latter means you can't mount a file
66 system over target exported device. In other words, you can
67 freely use any sg, sd, st, etc. devices imported from target
68 on the same host, but you can't mount file systems or put
69 swap on them. This is a limitation of Linux memory/cache
70 manager, because in this case an OOM deadlock like: system
71 needs some memory -> it decides to clear some cache -> cache
72 needs to write on target exported device -> initiator sends
73 request to the target -> target needs memory -> system needs
74 even more memory -> deadlock.
76 IMPORTANT: In the current version simultaneous access to local SCSI devices
77 ========= via standard high-level SCSI drivers (sd, st, sg, etc.) and
78 SCST's target drivers is unsupported. Especially it is
79 important for execution via sg and st commands that change
80 the state of devices and their parameters, because that could
81 lead to data corruption. If any such command is done, at
82 least related device handler(s) must be restarted. For block
83 devices READ/WRITE commands using direct disk handler look to
89 Device specific drivers (device handlers) are plugins for SCST, which
90 help SCST to analyze incoming requests and determine parameters,
91 specific to various types of devices. If an appropriate device handler
92 for a SCSI device type isn't loaded, SCST doesn't know how to handle
93 devices of this type, so they will be invisible for remote initiators
94 (more precisely, "LUN not supported" sense code will be returned).
96 In addition to device handlers for real devices, there are VDISK, user
97 space and "performance" device handlers.
99 VDISK device handler works over files on file systems and makes from
100 them virtual remotely available SCSI disks or CDROM's. In addition, it
101 allows to work directly over a block device, e.g. local IDE or SCSI disk
102 or ever disk partition, where there is no file systems overhead. Using
103 block devices comparing to sending SCSI commands directly to SCSI
104 mid-level via scsi_do_req()/scsi_execute_async() has advantage that data
105 are transferred via system cache, so it is possible to fully benefit from
106 caching and read ahead performed by Linux's VM subsystem. The only
107 disadvantage here that in the FILEIO mode there is superfluous data
108 copying between the cache and SCST's buffers. This issue is going to be
109 addressed in the next release. Virtual CDROM's are useful for remote
110 installation. See below for details how to setup and use VDISK device
113 SCST user space device handler provides an interface between SCST and
114 the user space, which allows to create pure user space devices. The
115 simplest example, where one would want it is if he/she wants to write a
116 VTL. With scst_user he/she can write it purely in the user space. Or one
117 would want it if he/she needs some sophisticated for kernel space
118 processing of the passed data, like encrypting them or making snapshots.
120 "Performance" device handlers for disks, MO disks and tapes in their
121 exec() method skip (pretend to execute) all READ and WRITE operations
122 and thus provide a way for direct link performance measurements without
123 overhead of actual data transferring from/to underlying SCSI device.
125 NOTE: Since "perf" device handlers on READ operations don't touch the
126 ==== commands' data buffer, it is returned to remote initiators as it
127 was allocated, without even being zeroed. Thus, "perf" device
128 handlers impose some security risk, so use them with caution.
133 There are the following compilation options, that could be change using
134 your favorit kernel configuration Makefile target, e.g. "make xconfig":
136 - CONFIG_SCST_DEBUG - if defined, turns on some debugging code,
137 including some logging. Makes the driver considerably bigger and slower,
138 producing large amount of log data.
140 - CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the
141 driver considerably bigger and leads to some performance loss.
143 - CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in
146 - CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator
147 supplied expected data transfer length and direction will be used only for
148 verification purposes to return error or warn in case if one of them
149 is invalid. Instead, locally decoded from SCSI command values will be
150 used. This is necessary for security reasons, because otherwise a
151 faulty initiator can crash target by supplying invalid value in one
152 of those parameters. This is especially important in case of
153 pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is defined, initiator
154 supplied expected data transfer length and direction will override
155 the locally decoded values. This might be necessary if internal SCST
156 commands translation table doesn't contain SCSI command, which is
157 used in your environment. You can know that if you have messages like
158 "Unknown opcode XX for YY. Should you update scst_scsi_op_table?" in
159 your kernel log and your initiator returns an error. Also report
160 those messages in the SCST mailing list
161 scst-devel@lists.sourceforge.net. Note, that not all SCSI transports
162 support supplying expected values.
164 - CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions
165 debugging, when on LUN 0 in the default access control group some of the
166 commands will be delayed for about 60 sec., so making the remote
167 initiator send TM functions, eg ABORT TASK and TARGET RESET. Also
168 define CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you
169 want that the device eventually become completely unresponsive, or
170 otherwise to circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG
173 - CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to
174 underlying SCSI device synchronously, one after one. This makes task
175 management more reliable, with cost of some performance penalty. This
176 is mostly actual for stateful SCSI devices like tapes, where the
177 result of command's execution depends from device's settings defined
178 by previous commands. Disk and RAID devices are stateless in the most
179 cases. The current SCSI core in Linux doesn't allow to abort all
180 commands reliably if they sent asynchronously to a stateful device.
181 Turned off by default, turn it on if you use stateful device(s) and
182 need as much error recovery reliability as possible. As a side
183 effect, no kernel patching is necessary for pass-through device
184 handlers (scst_disk, etc.)
186 - CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ - if defined, it will be
187 allowed to submit pass-through commands to real SCSI devices via the SCSI
188 middle layer using scsi_execute_async() function from soft IRQ
189 context (tasklets). This used to be the default, but currently it
190 seems the SCSI middle layer starts expecting only thread context on
191 the IO submit path, so it is disabled now by default. Enabling it
192 will decrease amount of context switches and improve performance. It
193 is more or less safe, in the worst case, if in your configuration the
194 SCSI middle layer really doesn't expect SIRQ context in
195 scsi_execute_async() function, you will get a warning message in the
198 - CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data
199 buffers. Undefining it (default) considerably improves performance
200 and eases CPU load, but could create a security hole (information
201 leakage), so enable it, if you have strict security requirements.
203 - CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined,
204 in case when TASK MANAGEMENT function ABORT TASK is trying to abort a
205 command, which has already finished, remote initiator, which sent the
206 ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED)
207 response for the ABORT TASK request. This is more logical response,
208 since, because the command finished, attempt to abort it failed, but
209 some initiators, particularly VMware iSCSI initiator, consider TASK
210 NOT EXIST response as if the target got crazy and try to RESET it.
211 Then sometimes get crazy itself. So, this option is disabled by
214 - CONFIG_SCST_MEASURE_LATENCY - if defined, provides in /proc/scsi_tgt/latency
215 file average commands processing latency. You can clear already
216 measured results by writing 0 in this file. Note, you need a
217 non-preemptible kernel to have correct results.
219 HIGHMEM kernel configurations are fully supported, but not recommended
220 for performance reasons, except for scst_user, where they are not
221 supported, because this module deals with user supplied memory on a
222 zero-copy manner. If you need to use it, consider change VMSPLIT option
223 or use 64-bit system configuration instead.
225 For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in
226 "make menuconfig" command set the following variables:
228 - General setup->Configure standard kernel features (for small systems): ON
230 - General setup->Prompt for development and/or incomplete code/drivers: ON
232 - Processor type and features->High Memory Support: OFF
234 - Processor type and features->Memory split: according to amount of
235 memory you have. If it is less than 800MB, you may not touch this
241 Module scst supports the following parameters:
243 - scst_threads - allows to set count of SCST's threads. By default it
246 - scst_max_cmd_mem - sets maximum amount of memory in Mb allowed to be
247 consumed by the SCST commands for data buffers at any given time. By
248 default it is approximately TotalMem/4.
250 SCST "/proc" commands
251 ---------------------
253 For communications with user space programs SCST provides proc-based
254 interface in "/proc/scsi_tgt" directory. It contains the following
257 - "help" file, which provides online help for SCST commands
259 - "scsi_tgt" file, which on read provides information of serving by SCST
260 devices and their dev handlers. On write it supports the following
263 * "assign H:C:I:L HANDLER_NAME" assigns dev handler "HANDLER_NAME"
264 on device with host:channel:id:lun. The recommended way to find out
265 H:C:I:L numbers is use of lsscsi utility.
267 - "sessions" file, which lists currently connected initiators (open sessions)
269 - "sgv" file provides some statistic about with which block sizes
270 commands from remote initiators come and how effective sgv_pool in
271 serving those allocations from the cache, i.e. without memory
272 allocations requests to the kernel. "Size" - is the commands data
273 size upper rounded to power of 2, "Hit" - how many there are
274 allocations from the cache, "Total" - total number of allocations.
276 - "threads" file, which allows to read and set number of SCST's threads
278 - "version" file, which shows version of SCST
280 - "trace_level" file, which allows to read and set trace (logging) level
281 for SCST. See "help" file for list of trace levels. If you want to
282 enable logging options, which produce a lot of events, like "debug",
283 to not loose logged events you should also:
285 * Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable
286 to much bigger value, then recompile it. For example, I use 25,
287 but to use it I needed to modify the maximum allowed value for
288 CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig.
290 * Change in your /etc/syslog.conf or other config file of your favorite
291 logging program to store kernel logs in async manner. For example,
292 I added in my rsyslog.conf line "kern.info -/var/log/kernel"
293 and added "kern.none" in line for /var/log/messages, so I had:
294 "*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages"
296 Each dev handler has own subdirectory. Most dev handler have only two
297 files in this subdirectory: "trace_level" and "type". The first one is
298 similar to main SCST "trace_level" file, the latter one shows SCSI type
299 number of this handler as well as some text description.
301 For example, "echo "assign 1:0:1:0 dev_disk" >/proc/scsi_tgt/scsi_tgt"
302 will assign device handler "dev_disk" to real device sitting on host 1,
303 channel 0, ID 1, LUN 0.
305 Access and devices visibility management (LUN masking)
306 ------------------------------------------------------
308 Access and devices visibility management allows for an initiator or
309 group of initiators to see different devices with different LUNs
310 with necessary access permissions.
312 SCST supports two modes of access control:
314 1. Target-oriented. In this mode you define for each target devices and
315 their LUNs, which are accessible to all initiators, connected to that
316 target. This is a regular access control mode, which people usually mean
317 thinking about access control in general. For instance, in IET this is
318 the only supported mode. In this mode you should create a security group
319 with name "Default_TARGET_NAME", where "TARGET_NAME" is name of the
320 target, like "Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz"
321 for target "iqn.2007-05.com.example:storage.disk1.sys1.xyz". Then you
322 should add to it all LUNs, available from that target.
324 2. Initiator-oriented. In this mode you define which devices and their
325 LUNs are accessible for each initiator. In this mode you should create
326 for each set of one or more initiators, which should access to the same
327 set of devices with the same LUNs, a separate security group, then add
328 to it available devices and names of allowed initiator(s).
330 Both modes can be used simultaneously. In this case initiator-oriented
331 mode has higher priority, than target-oriented.
333 When a target driver registers itself in SCST core, it tells SCST core
334 its name. Then, when there is a new connection from a remote initiator,
335 the target driver registers this connection in SCST core and tells it
336 the name of the remote initiator. Then SCST core finds the corresponding
337 devices for it using the following algorithm:
339 1. It searches through all defined groups trying to find group
340 containing the initiator name. If it succeeds, the found group is used.
342 2. Otherwise, it searches through all groups trying to find group with
343 name "Default_TARGET_NAME". If it succeeds, the found group is used.
345 3. Otherwise, the group with name "Default" is used. This group is
346 always defined, but empty by default.
348 Names of both target and initiator you can clarify in the kernel log. In
349 it SCST reports to which group each session is assigned.
351 In /proc/scsi_tgt each group represented as "groups/GROUP_NAME/"
352 subdirectory. In it there are files "devices" and "names". File
353 "devices" lists devices and their LUNs in the group, file "names" lists
354 names of initiators, which allowed to access devices in this group.
356 To configure access and devices visibility management SCST provides the
357 following files and directories under /proc/scsi_tgt:
359 - "add_group GROUP" to /proc/scsi_tgt/scsi_tgt adds group "GROUP"
361 - "del_group GROUP" to /proc/scsi_tgt/scsi_tgt deletes group "GROUP"
363 - "add H:C:I:L lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
364 device with host:channel:id:lun with LUN "lun" in group "GROUP". Optionally,
365 the device could be marked as read only. The recommended way to find out
366 H:C:I:L numbers is use of lsscsi utility.
368 - "del H:C:I:L" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
369 host:channel:id:lun from group "GROUP". The recommended way to find out
370 H:C:I:L numbers is use of lsscsi utility.
372 - "add V_NAME lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
373 device with virtual name "V_NAME" with LUN "lun" in group "GROUP".
374 Optionally, the device could be marked as read only.
376 - "del V_NAME" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
377 virtual name "V_NAME" from group "GROUP"
379 - "clear" to /proc/scsi_tgt/groups/GROUP/devices clears the list of devices
382 - "add NAME" to /proc/scsi_tgt/groups/GROUP/names adds name "NAME" to group
383 "GROUP". For NAME you can use simple DOS-type patterns, containing
384 '*' and '?' symbols. '*' means match all any symbols, '?' means
385 match only any single symbol. For instance, "blah.xxx" will match
388 - "del NAME" to /proc/scsi_tgt/groups/GROUP/names deletes name "NAME" from group
391 - "clear" to /proc/scsi_tgt/groups/GROUP/names clears the list of names
396 - "echo "add 1:0:1:0 0" >/proc/scsi_tgt/groups/Default/devices" will
397 add real device sitting on host 1, channel 0, ID 1, LUN 0 to "Default"
400 - "echo "add disk1 1" >/proc/scsi_tgt/groups/Default/devices" will
401 add virtual VDISK device with name "disk1" to "Default" group
404 - "echo "21:*:e0:?b:83:*'" >/proc/scsi_tgt/groups/LAB1/names" will
405 add a pattern, which matches WWNs of Fibre Channel ports from LAB1.
407 Consider you need to have an iSCSI target with name
408 "iqn.2007-05.com.example:storage.disk1.sys1.xyz" (you defined it in
409 iscsi-scst.conf), which should export virtual device "dev1" with LUN 0
410 and virtual device "dev2" with LUN 1, but initiator with name
411 "iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" should see only
412 virtual device "dev2" with LUN 0. To achieve that you should do the
415 # echo "add_group Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz" >/proc/scsi_tgt/scsi_tgt
416 # echo "add dev1 0" >/proc/scsi_tgt/groups/Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz/devices
417 # echo "add dev2 1" >/proc/scsi_tgt/groups/Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz/devices
419 # echo "add_group spec_ini" >/proc/scsi_tgt/scsi_tgt
420 # echo "add iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" >/proc/scsi_tgt/groups/spec_ini/names
421 # echo "add dev2 0" >/proc/scsi_tgt/groups/spec_ini/devices
423 It is highly recommended to use scstadmin utility instead of described
424 in this section low level interface.
429 There must be LUN 0 in each security group, i.e. LUs numeration must not
435 All the access control must be fully configured BEFORE load of the
436 corresponding target driver! When you load a target driver or enable
437 target mode in it, as for qla2x00t driver, it will immediately start
438 accepting new connections, hence creating new sessions, and those new
439 sessions will be assigned to security groups according to the
440 *currently* configured access control settings. For instance, to
441 "Default" group, instead of "HOST004" as you may need, because "HOST004"
442 doesn't exist yet. So, one must configure all the security groups before
443 new connections from the initiators are created, i.e. before target
446 Access controls can be altered after the target driver loaded as long as
447 the target session doesn't yet exist. And even in the case of the
448 session already existing, changes are still possible, but won't be
449 reflected on the initiator side.
451 So, the safest choice is to configure all the access control before any
452 target driver load and then only add new devices to new groups for new
453 initiators or add new devices to old groups, but not altering existing
459 After loading VDISK device handler creates in "/proc/scsi_tgt/"
460 subdirectories "vdisk" and "vcdrom". They have similar layout:
462 - "trace_level" and "type" files as described for other dev handlers
464 - "help" file, which provides online help for VDISK commands
466 - "vdisk"/"vcdrom" files, which on read provides information of
467 currently open device files. On write it supports the following
470 * "open NAME [PATH] [BLOCK_SIZE] [FLAGS]" - opens file "PATH" as
471 device "NAME" with block size "BLOCK_SIZE" bytes with flags
472 "FLAGS". "PATH" could be empty only for VDISK CDROM. "BLOCK_SIZE"
473 and "FLAGS" are valid only for disk VDISK. The block size must be
474 power of 2 and >= 512 bytes. Default is 512. Possible flags:
476 - WRITE_THROUGH - write back caching disabled. Note, this option
477 has sense only if you also *manually* disable write-back cache
478 in *all* your backstorage devices and make sure it's actually
479 disabled, since many devices are known to lie about this mode to
480 get better benchmark results.
482 - READ_ONLY - read only
484 - O_DIRECT - both read and write caching disabled. This mode
485 isn't currently fully implemented, you should use user space
486 fileio_tgt program in O_DIRECT mode instead (see below).
488 - NULLIO - in this mode no real IO will be done, but success will be
489 returned. Intended to be used for performance measurements at the same
490 way as "*_perf" handlers.
492 - NV_CACHE - enables "non-volatile cache" mode. In this mode it is
493 assumed that the target has a GOOD UPS with ability to cleanly
494 shutdown target in case of power failure and it is
495 software/hardware bugs free, i.e. all data from the target's
496 cache are guaranteed sooner or later to go to the media. Hence
497 all data synchronization with media operations, like
498 SYNCHRONIZE_CACHE, are ignored in order to bring more
499 performance. Also in this mode target reports to initiators that
500 the corresponding device has write-through cache to disable all
501 write-back cache workarounds used by initiators. Use with
502 extreme caution, since in this mode after a crash of the target
503 journaled file systems don't guarantee the consistency after
504 journal recovery, therefore manual fsck MUST be ran. Note, that
505 since usually the journal barrier protection (see "IMPORTANT"
506 note below) turned off, enabling NV_CACHE could change nothing
507 from data protection point of view, since no data
508 synchronization with media operations will go from the
509 initiator. This option overrides WRITE_THROUGH.
511 - BLOCKIO - enables block mode, which will perform direct block
512 IO with a block device, bypassing page-cache for all operations.
513 This mode works ideally with high-end storage HBAs and for
514 applications that either do not need caching between application
515 and disk or need the large block throughput. See also below.
517 - REMOVABLE - with this flag set the device is reported to remote
518 initiators as removable.
520 * "close NAME" - closes device "NAME".
522 * "resync_size NAME" - refreshes size of device "NAME". Intended to be
523 used after device resize.
525 * "change NAME [PATH]" - changes a virtual CD in the VDISK CDROM.
527 By default, if neither BLOCKIO, nor NULLIO option is supplied, FILEIO
530 For example, "echo "open disk1 /vdisks/disk1" >/proc/scsi_tgt/vdisk/vdisk"
531 will open file /vdisks/disk1 as virtual FILEIO disk with name "disk1".
533 CAUTION: If you partitioned/formatted your device with block size X, *NEVER*
534 ======== ever try to export and then mount it (even accidentally) with another
535 block size. Otherwise you can *instantly* damage it pretty
536 badly as well as all your data on it. Messages on initiator
537 like: "attempt to access beyond end of device" is the sign of
540 Moreover, if you want to compare how well different block sizes
541 work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE
542 **COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In
543 other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS**
544 AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block
545 sizes isn't like switching between FILEIO and BLOCKIO, after
546 changing block size all previously written with another block
547 size data MUST BE ERASED. Otherwise you will have a full set of
548 very weird behaviors, because blocks addressing will be
549 changed, but initiators in most cases will not have a
550 possibility to detect that old addresses written on the device
551 in, e.g., partition table, don't refer anymore to what they are
554 IMPORTANT: By default for performance reasons VDISK FILEIO devices use write
555 ========= back caching policy. This is generally safe from the consistence of
556 journaled file systems, laying over them, point of view, but
557 your unsaved cached data will be lost in case of
558 power/hardware/software failure, so you must supply your
559 target server with some kind of UPS or disable write back
560 caching using WRITE_THROUGH flag. You also should note, that
561 the file systems journaling over write back caching enabled
562 devices works reliably *ONLY* if the order of journal writes
563 is guaranteed or it uses some kind of data protection
564 barriers (i.e. after writing journal data some kind of
565 synchronization with media operations is used), otherwise,
566 because of possible reordering in the cache, even after
567 successful journal rollback, you very much risk to loose your
568 data on the FS. Currently, Linux IO subsystem guarantees
569 order of write operations only using data protection
570 barriers. Some info about it from the XFS point of view could
571 be found at http://oss.sgi.com/projects/xfs/faq.html#wcache.
572 On Linux initiators for EXT3 and ReiserFS file systems the
573 barrier protection could be turned on using "barrier=1" and
574 "barrier=flush" mount options correspondingly. Note, that
575 usually it's turned off by default (see http://lwn.net/Articles/283161).
576 You can check if it's turn on or off by looking in /proc/mounts.
577 Windows and, AFAIK, other UNIX'es don't need any special
578 explicit options and do necessary barrier actions on
579 write-back caching devices by default. Also note
580 that on some real-life workloads write through caching might
581 perform better, than write back one with the barrier
582 protection turned on.
583 Also you should realize that Linux doesn't provide a
584 guarantee that after sync()/fsync() all written data really
585 hit permanent storage, they can be then in the cache of your
586 backstorage device and lost on power failure event. Thus,
587 ever with write-through cache mode, you still need a good UPS
588 to protect yourself from your data loss (note, data loss, not
589 the file system integrity corruption).
591 IMPORTANT: Some disk and partition table management utilities don't support
592 ========= block sizes >512 bytes, therefore make sure that your favorite one
593 supports it. Currently only cfdisk is known to work only with
594 512 bytes blocks, other utilities like fdisk on Linux or
595 standard disk manager on Windows are proved to work well with
596 non-512 bytes blocks. Note, if you export a disk file or
597 device with some block size, different from one, with which
598 it was already partitioned, you could get various weird
599 things like utilities hang up or other unexpected behavior.
600 Hence, to be sure, zero the exported file or device before
601 the first access to it from the remote initiator with another
602 block size. On Window initiator make sure you "Set Signature"
603 in the disk manager on the imported from the target drive
604 before doing any other partitioning on it. After you
605 successfully mounted a file system over non-512 bytes block
606 size device, the block size stops matter, any program will
607 work with files on such file system.
612 This module works best for these types of scenarios:
614 1) Data that are not aligned to 4K sector boundaries and <4K block sizes
615 are used, which is normally found in virtualization environments where
616 operating systems start partitions on odd sectors (Windows and it's
619 2) Large block data transfers normally found in database loads/dumps and
622 3) Advanced relational database systems that perform their own caching
623 which prefer or demand direct IO access and, because of the nature of
624 their data access, can actually see worse performance with
625 non-discriminate caching.
627 4) Multiple layers of targets were the secondary and above layers need
628 to have a consistent view of the primary targets in order to preserve
629 data integrity which a page cache backed IO type might not provide
632 Also it has an advantage over FILEIO that it doesn't copy data between
633 the system cache and the commands data buffers, so it saves a
634 considerable amount of CPU power and memory bandwidth.
636 IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between
637 ========= them, if you try to use a device in both those modes simultaneously,
638 you will almost instantly corrupt your data on that device.
643 In the pass-through mode (i.e. using the pass-through device handlers
644 scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
645 are passed to local SCSI hardware on target as is, without any
646 modifications. As any other hardware, the local SCSI hardware can not
647 handle commands with amount of data and/or segments count in
648 scatter-gather array bigger some values. Therefore, when using the
649 pass-through mode you should note that values for maximum number of
650 segments and maximum amount of transferred data for each SCSI command on
651 devices on initiators can not be bigger, than corresponding values of
652 the corresponding SCSI devices on the target. Otherwise you will see
653 symptoms like small transfers work well, but large ones stall and
654 messages like: "Unable to complete command due to SG IO count
655 limitation" are printed in the kernel logs.
657 You can't control from the user space limit of the scatter-gather
658 segments, but for block devices usually it is sufficient if you set on
659 the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same
660 or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for
661 the corresponding devices on the target.
663 For not-block devices SCSI commands are usually generated directly by
664 applications, so, if you experience large transfers stalls, you should
665 check documentation for your application how to limit the transfer
668 Another way to solve this issue is to build SG entries with more than 1
669 page each. See the following patch as an example:
670 http://scst.sf.net/sgv_big_order_alloc.diff
672 User space mode using scst_user dev handler
673 -------------------------------------------
675 User space program fileio_tgt uses interface of scst_user dev handler
676 and allows to see how it works in various modes. Fileio_tgt provides
677 mostly the same functionality as scst_vdisk handler with the most
678 noticeable difference that it supports O_DIRECT mode. O_DIRECT mode is
679 basically the same as BLOCKIO, but also supports files, so for some
680 loads it could be significantly faster, than the regular FILEIO access.
681 All the words about BLOCKIO from above apply to O_DIRECT as well. See
682 fileio_tgt's README file for more details.
687 Before doing any performance measurements note that:
689 I. Performance results are very much dependent from your type of load,
690 so it is crucial that you choose access mode (FILEIO, BLOCKIO,
691 O_DIRECT, pass-through), which suits your needs the best.
693 II. In order to get the maximum performance you should:
697 - Disable in Makefile CONFIG_SCST_STRICT_SERIALIZING, CONFIG_SCST_EXTRACHECKS,
698 CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*, CONFIG_SCST_STRICT_SECURITY
700 - For pass-through devices enable
701 CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ.
703 2. For target drivers:
705 - Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING,
708 3. For device handlers, including VDISK:
710 - Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG.
713 IMPORTANT: Some of the above compilation options in the SCST SVN enabled by default,
714 ========= i.e. development version of SCST is optimized currently rather for
715 development and bug hunting, than for performance.
717 If you use SCST version taken directly from the SVN repository, you can
718 set the above options, except CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ,
719 in the needed values using debug2perf root Makefile target.
721 4. For other target and initiator software parts:
723 - Don't enable debug/hacking features in the kernel, i.e. use them as
726 - The default kernel read-ahead and queuing settings are optimized
727 for locally attached disks, therefore they are not optimal if they
728 attached remotely (SCSI target case), which sometimes could lead to
729 unexpectedly low throughput. You should increase read-ahead size to at
730 least 512KB or even more on all initiators and the target.
732 You should also limit on all initiators maximum amount of sectors per
733 SCSI command. This tuning is also recommended on targets with large
734 read-ahead values. To do it on Linux, run:
736 echo “64” > /sys/block/sdX/queue/max_sectors_kb
738 where specify instead of X your imported from target device letter,
741 To increase read-ahead size on Linux, run:
743 blockdev --setra N /dev/sdX
745 where N is a read-ahead number in 512-byte sectors and X is a device
748 Note: you need to set read-ahead setting for device sdX again after
749 you changed the maximum amount of sectors per SCSI command for that
752 Note2: you need to restart SCST after you changed read-ahead settings
755 - You may need to increase amount of requests that OS on initiator
756 sends to the target device. To do it on Linux initiators, run
758 echo “64” > /sys/block/sdX/queue/nr_requests
760 where X is a device letter like above.
762 You may also experiment with other parameters in /sys/block/sdX
763 directory, they also affect performance. If you find the best values,
764 please share them with us.
766 - On the target use CFQ IO scheduler. In most cases it has performance
767 advantage over other IO schedulers, sometimes huge (2+ times
768 aggregate throughput increase).
770 - It is recommended to turn the kernel preemption off, i.e. set
771 the kernel preemption model to "No Forced Preemption (Server)".
773 - Looks like XFS is the best filesystem on the target to store device
774 files, because it allows considerably better linear write throughput,
777 5. For hardware on target.
779 - Make sure that your target hardware (e.g. target FC or network card)
780 and underlaying IO hardware (e.g. IO card, like SATA, SCSI or RAID to
781 which your disks connected) don't share the same PCI bus. You can
782 check it using lspci utility. They have to work in parallel, so it
783 will be better if they don't compete for the bus. The problem is not
784 only in the bandwidth, which they have to share, but also in the
785 interaction between cards during that competition. This is very
786 important, because in some cases if target and backend storage
787 controllers share the same PCI bus, it could lead up to 5-10 times
788 less performance, than expected. Moreover, some motherboard (by
789 Supermicro, particularly) have serious stability issues if there are
790 several high speed devices on the same bus working in parallel. If
791 you have no choice, but PCI bus sharing, set in the BIOS PCI latency
794 6. If you use VDISK IO module in FILEIO mode, NV_CACHE option will
795 provide you the best performance. But using it make sure you use a good
796 UPS with ability to shutdown the target on the power failure.
798 Baseline performance numbers you can find in those measurements:
799 http://lkml.org/lkml/2009/3/30/283.
801 IMPORTANT: If you use on initiator some versions of Windows (at least W2K)
802 ========= you can't get good write performance for VDISK FILEIO devices with
803 default 512 bytes block sizes. You could get about 10% of the
804 expected one. This is because of the partition alignment, which
805 is (simplifying) incompatible with how Linux page cache
806 works, so for each write the corresponding block must be read
807 first. Use 4096 bytes block sizes for VDISK devices and you
808 will have the expected write performance. Actually, any OS on
809 initiators, not only Windows, will benefit from block size
810 max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE
811 is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size
812 on the underlying FS, on which the device file located, or 0,
813 if a device node is used. Both values are from the target.
814 See also important notes about setting block sizes >512 bytes
815 for VDISK FILEIO devices above.
817 In some cases, for instance working with SSD devices, which consume 100%
818 of a single CPU load for data transfers in their internal threads, to
819 maximize IOPS it can be needed to assign for those threads dedicated
820 CPUs using Linux CPU affinity facilities. No IRQ processing should be
821 done on those CPUs. Check that using /proc/interrupts. See taskset
822 command and Documentation/IRQ-affinity.txt in your kernel's source tree
823 for how to assign IRQ affinity to tasks and IRQs.
825 The reason for that is that processing of coming commands in SIRQ
826 context might be done on the same CPUs as SSD devices' threads doing data
827 transfers. As the result, those threads won't receive all the processing
828 power of those CPUs and perform worse.
831 Work if target's backstorage or link is too slow
832 ------------------------------------------------
834 Under high I/O load, when your target's backstorage gets overloaded, or
835 working over a slow link between initiator and target, when the link
836 can't serve all the queued commands on time, you can experience I/O
837 stalls or see in the kernel log abort or reset messages.
839 At first, consider the case of too slow target's backstorage. On some
840 seek intensive workloads even fast disks or RAIDs, which able to serve
841 continuous data stream on 500+ MB/s speed, can be as slow as 0.3 MB/s.
842 Another possible cause for that can be MD/LVM/RAID on your target as in
843 http://lkml.org/lkml/2008/2/27/96 (check the whole thread as well).
845 Thus, in such situations simply processing of one or more commands takes
846 too long time, hence initiator decides that they are stuck on the target
847 and tries to recover. Particularly, it is known that the default amount
848 of simultaneously queued commands (48) is sometimes too high if you do
849 intensive writes from VMware on a target disk, which uses LVM in the
850 snapshot mode. In this case value like 16 or even 8-10 depending of your
851 backstorage speed could be more appropriate.
853 Unfortunately, currently SCST lacks dynamic I/O flow control, when the
854 queue depth on the target is dynamically decreased/increased based on
855 how slow/fast the backstorage speed comparing to the target link. So,
856 there are 6 possible actions, which you can do to workaround or fix this
859 1. Ignore incoming task management (TM) commands. It's fine if there are
860 not too many of them, so average performance isn't hurt and the
861 corresponding device isn't getting put offline, i.e. if the backstorage
864 2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case
865 if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant
866 in scst_priv.h file until you stop seeing incoming TM commands.
867 ISCSI-SCST driver also has its own iSCSI specific parameter for that,
870 To decrease device queue depth on Linux initiators you can run command:
872 # echo Y >/sys/block/sdX/device/queue_depth
874 where Y is the new number of simultaneously queued commands, X - your
875 imported device letter, like 'a' for sda device. There are no special
876 limitations for Y value, it can be any value from 1 to possible maximum
877 (usually, 32), so start from dividing the current value on 2, i.e. set
878 16, if /sys/block/sdX/device/queue_depth contains 32.
880 3. Increase the corresponding timeout on the initiator. For Linux it is
882 /sys/devices/platform/host*/session*/target*:0:0/*:0:0:1/timeout. It can
883 be done automatically by an udev rule. For instance, the following
884 rule will increase it to 300 seconds:
886 SUBSYSTEM=="scsi", KERNEL=="[0-9]*:[0-9]*", ACTION=="add", ATTR{type}=="0|7|14", ATTR{timeout}="300"
888 By default, this timeout is 30 or 60 seconds, depending on your distribution.
890 4. Try to avoid such seek intensive workloads.
892 5. Increase speed of the target's backstorage.
894 6. Implement in SCST dynamic I/O flow control. This will be an ultimate
895 solution. See "Dynamic I/O flow control" section on
896 http://scst.sourceforge.net/contributing.html page for possible
899 Next, consider the case of too slow link between initiator and target,
900 when the initiator tries to simultaneously push N commands to the target
901 over it. In this case time to serve those commands, i.e. send or receive
902 data for them over the link, can be more, than timeout for any single
903 command, hence one or more commands in the tail of the queue can not be
904 served on time less than the timeout, so the initiator will decide that
905 they are stuck on the target and will try to recover.
907 To workaround/fix this issue in this case you can use ways 1, 2, 3, 6
908 above or (7): increase speed of the link between target and initiator.
909 But for some initiators implementations for WRITE commands there might
910 be cases when target has no way to detect the issue, so dynamic I/O flow
911 control will not be able to help. In those cases you could also need on
912 the initiator(s) to either decrease the queue depth (way 2), or increase
913 the corresponding timeout (way 3).
915 Note, that logged messages about QUEUE_FULL status are quite different
916 by nature. This is a normal work, just SCSI flow control in action.
917 Simply don't enable "mgmt_minor" logging level, or, alternatively, if
918 you are confident in the worst case performance of your back-end storage
919 or initiator-target link, you can increase SCST_MAX_TGT_DEV_COMMANDS in
920 scst_priv.h to 64. Usually initiators don't try to push more commands on
928 * Mark Buechler <mark.buechler@gmail.com> for a lot of useful
929 suggestions, bug reports and help in debugging.
931 * Ming Zhang <mingz@ele.uri.edu> for fixes and comments.
933 * Nathaniel Clark <nate@misrule.us> for fixes and comments.
935 * Calvin Morrow <calvin.morrow@comcast.net> for testing and useful
938 * Hu Gang <hugang@soulinfo.com> for the original version of the
941 * Erik Habbinga <erikhabbinga@inphase-tech.com> for fixes and support
942 of the LSI target driver.
944 * Ross S. W. Walker <rswwalker@hotmail.com> for the original block IO
945 code and Vu Pham <huongvp@yahoo.com> who updated it for the VDISK dev
948 * Michael G. Byrnes <michael.byrnes@hp.com> for fixes.
950 * Alessandro Premoli <a.premoli@andxor.it> for fixes
952 * Nathan Bullock <nbullock@yottayotta.com> for fixes.
954 * Terry Greeniaus <tgreeniaus@yottayotta.com> for fixes.
956 * Krzysztof Blaszkowski <kb@sysmikro.com.pl> for many fixes and bug reports.
958 * Jianxi Chen <pacers@users.sourceforge.net> for fixing problem with
961 * Bart Van Assche <bart.vanassche@gmail.com> for a lot of help
963 Vladislav Bolkhovitin <vst@vlnb.net>, http://scst.sourceforge.net