1 Generic SCSI target mid-level for Linux (SCST)
2 ==============================================
4 Version 1.0.1, X XXXX 2008
5 --------------------------
7 SCST is designed to provide unified, consistent interface between SCSI
8 target drivers and Linux kernel and simplify target drivers development
9 as much as possible. Detail description of SCST's features and internals
10 could be found in "Generic SCSI Target Middle Level for Linux" document
11 SCST's Internet page http://scst.sourceforge.net.
13 SCST supports the following I/O modes:
15 * Pass-through mode with one to many relationship, i.e. when multiple
16 initiators can connect to the exported pass-through devices, for
17 the following SCSI devices types: disks (type 0), tapes (type 1),
18 processors (type 3), CDROMs (type 5), MO disks (type 7), medium
19 changers (type 8) and RAID controllers (type 0xC)
21 * FILEIO mode, which allows to use files on file systems or block
22 devices as virtual remotely available SCSI disks or CDROMs with
23 benefits of the Linux page cache
25 * BLOCKIO mode, which performs direct block IO with a block device,
26 bypassing page-cache for all operations. This mode works ideally with
27 high-end storage HBAs and for applications that either do not need
28 caching between application and disk or need the large block
31 * User space mode using scst_user device handler, which allows to
32 implement in the user space virtual SCSI devices in the SCST
35 * "Performance" device handlers, which provide in pseudo pass-through
36 mode a way for direct performance measurements without overhead of
37 actual data transferring from/to underlying SCSI device
39 In addition, SCST supports advanced per-initiator access and devices
40 visibility management, so different initiators could see different set
41 of devices with different access permissions. See below for details.
46 Only vanilla kernels from kernel.org are supported, but it should work
47 on vendors' kernels, if you manage to successfully compile on them. The
48 main problem with vendor's kernels is that they often contain patches,
49 which will appear only in the next version of the vanilla kernel,
50 therefore it's quite hard to track such changes. Thus, if during
51 compilation for some vendor kernel your compiler complains about
52 redefinition of some symbol, you should either switch to vanilla kernel,
53 or change as necessary the corresponding to that symbol "#if
54 LINUX_VERSION_CODE" statement.
56 At first, make sure that the link "/lib/modules/`you_kernel_version`/build"
57 points to the source code for your currently running kernel.
59 Then, since in the mainstream kernels scsi_do_req()/scsi_execute_async()
60 work in LIFO order, instead of expected and required FIFO, SCST needs a
61 new functions scsi_do_req_fifo()/scsi_execute_async_fifo() to be added
62 in the kernel. Patch scst_exec_req_fifo.patch from "kernel" directory
63 does that. If it doesn't apply to your kernel, apply it manually, it
64 only adds one of those functions and nothing more. You may not patch the
65 kernel if you don't need pass-through support or CONFIG_SCST_STRICT_SERIALIZING is
66 defined during the compilation (see description below).
68 Then, to get the maximum performance you should apply export_alloc_io_context
69 patch. This patch simply makes alloc_io_context() function be available
70 for modules, not only for built-in in kernel code.
72 To compile SCST type 'make scst'. It will build SCST itself and its
73 device handlers. To install them type 'make scst_install'. The driver
74 modules will be installed in '/lib/modules/`you_kernel_version`/extra'.
75 In addition, scst.h, scst_debug.h as well as Module.symvers or
76 Modules.symvers will be copied to '/usr/local/include/scst'. The first
77 file contains all SCST's public data definition, which are used by
78 target drivers. The other ones support debug messages logging and build
81 Then you can load any module by typing 'modprobe module_name'. The names
85 - scst_disk - device handler for disks (type 0)
86 - scst_tape - device handler for tapes (type 1)
87 - scst_processor - device handler for processors (type 3)
88 - scst_cdrom - device handler for CDROMs (type 5)
89 - scst_modisk - device handler for MO disks (type 7)
90 - scst_changer - device handler for medium changers (type 8)
91 - scst_raid - device handler for storage array controller (e.g. raid) (type C)
92 - scst_vdisk - device handler for virtual disks (file, device or ISO CD image).
93 - scst_user - user space device handler
95 Then, to see your devices remotely, you need to add them to at least
96 "Default" security group (see below how). By default, no local devices
97 are seen remotely. There must be LUN 0 in each security group, i.e. LUs
98 numeration must not start from, e.g., 1.
100 It is highly recommended to use scstadmin utility for configuring
101 devices and security groups.
103 If you experience problems during modules load or running, check your
104 kernel logs (or run dmesg command for the few most recent messages).
106 IMPORTANT: Without loading appropriate device handler, corresponding devices
107 ========= will be invisible for remote initiators, which could lead to holes
108 in the LUN addressing, so automatic device scanning by remote SCSI
109 mid-level could not notice the devices. Therefore you will have
110 to add them manually via
111 'echo "- - -" >/sys/class/scsi_host/hostX/scan',
112 where X - is the host number.
114 IMPORTANT: Working of target and initiator on the same host isn't
115 ========= supported. This is a limitation of the Linux memory/cache
116 manager, because in this case an OOM deadlock like: system
117 needs some memory -> it decides to clear some cache -> cache
118 needs to write on a target exported device -> initiator sends
119 request to the target -> target needs memory -> problem is
122 IMPORTANT: In the current version simultaneous access to local SCSI devices
123 ========= via standard high-level SCSI drivers (sd, st, sg, etc.) and
124 SCST's target drivers is unsupported. Especially it is
125 important for execution via sg and st commands that change
126 the state of devices and their parameters, because that could
127 lead to data corruption. If any such command is done, at
128 least related device handler(s) must be restarted. For block
129 devices READ/WRITE commands using direct disk handler look to
132 To uninstall, type 'make scst_uninstall'.
137 Device specific drivers (device handlers) are plugins for SCST, which
138 help SCST to analyze incoming requests and determine parameters,
139 specific to various types of devices. If an appropriate device handler
140 for a SCSI device type isn't loaded, SCST doesn't know how to handle
141 devices of this type, so they will be invisible for remote initiators
142 (more precisely, "LUN not supported" sense code will be returned).
144 In addition to device handlers for real devices, there are VDISK, user
145 space and "performance" device handlers.
147 VDISK device handler works over files on file systems and makes from
148 them virtual remotely available SCSI disks or CDROM's. In addition, it
149 allows to work directly over a block device, e.g. local IDE or SCSI disk
150 or ever disk partition, where there is no file systems overhead. Using
151 block devices comparing to sending SCSI commands directly to SCSI
152 mid-level via scsi_do_req()/scsi_execute_async() has advantage that data
153 are transferred via system cache, so it is possible to fully benefit from
154 caching and read ahead performed by Linux's VM subsystem. The only
155 disadvantage here that in the FILEIO mode there is superfluous data
156 copying between the cache and SCST's buffers. This issue is going to be
157 addressed in the next release. Virtual CDROM's are useful for remote
158 installation. See below for details how to setup and use VDISK device
161 SCST user space device handler provides an interface between SCST and
162 the user space, which allows to create pure user space devices. The
163 simplest example, where one would want it is if he/she wants to write a
164 VTL. With scst_user he/she can write it purely in the user space. Or one
165 would want it if he/she needs some sophisticated for kernel space
166 processing of the passed data, like encrypting them or making snapshots.
168 "Performance" device handlers for disks, MO disks and tapes in their
169 exec() method skip (pretend to execute) all READ and WRITE operations
170 and thus provide a way for direct link performance measurements without
171 overhead of actual data transferring from/to underlying SCSI device.
173 NOTE: Since "perf" device handlers on READ operations don't touch the
174 ==== commands' data buffer, it is returned to remote initiators as it
175 was allocated, without even being zeroed. Thus, "perf" device
176 handlers impose some security risk, so use them with caution.
181 There are the following compilation options, that could be commented
184 - CONFIG_SCST_DEBUG - if defined, turns on some debugging code,
185 including some logging. Makes the driver considerably bigger and slower,
186 producing large amount of log data.
188 - CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the
189 driver considerably bigger and leads to some performance loss.
191 - CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in
194 - CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator
195 supplied expected data transfer length and direction will be used only for
196 verification purposes to return error or warn in case if one of them
197 is invalid. Instead, locally decoded from SCSI command values will be
198 used. This is necessary for security reasons, because otherwise a
199 faulty initiator can crash target by supplying invalid value in one
200 of those parameters. This is especially important in case of
201 pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is defined, initiator
202 supplied expected data transfer length and direction will override
203 the locally decoded values. This might be necessary if internal SCST
204 commands translation table doesn't contain SCSI command, which is
205 used in your environment. You can know that if you have messages like
206 "Unknown opcode XX for YY. Should you update scst_scsi_op_table?" in
207 your kernel log and your initiator returns an error. Also report
208 those messages in the SCST mailing list
209 scst-devel@lists.sourceforge.net. Note, that not all SCSI transports
210 support supplying expected values.
212 - CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions
213 debugging, when on LUN 0 in the default access control group some of the
214 commands will be delayed for about 60 sec., so making the remote
215 initiator send TM functions, eg ABORT TASK and TARGET RESET. Also
216 define CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you
217 want that the device eventually become completely unresponsive, or
218 otherwise to circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG
221 - CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to
222 underlying SCSI device synchronously, one after one. This makes task
223 management more reliable, with cost of some performance penalty. This
224 is mostly actual for stateful SCSI devices like tapes, where the
225 result of command's execution depends from device's settings defined
226 by previous commands. Disk and RAID devices are stateless in the most
227 cases. The current SCSI core in Linux doesn't allow to abort all
228 commands reliably if they sent asynchronously to a stateful device.
229 Turned off by default, turn it on if you use stateful device(s) and
230 need as much error recovery reliability as possible. As a side
231 effect, no kernel patching is necessary.
233 - CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ - if defined, it will be
234 allowed to submit pass-through commands to real SCSI devices via the SCSI
235 middle layer using scsi_execute_async() function from soft IRQ
236 context (tasklets). This used to be the default, but currently it
237 seems the SCSI middle layer starts expecting only thread context on
238 the IO submit path, so it is disabled now by default. Enabling it
239 will decrease amount of context switches and improve performance. It
240 is more or less safe, in the worst case, if in your configuration the
241 SCSI middle layer really doesn't expect SIRQ context in
242 scsi_execute_async() function, you will get a warning message in the
245 - CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data
246 buffers. Undefining it (default) considerably improves performance
247 and eases CPU load, but could create a security hole (information
248 leakage), so enable it, if you have strict security requirements.
250 - CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined,
251 in case when TASK MANAGEMENT function ABORT TASK is trying to abort a
252 command, which has already finished, remote initiator, which sent the
253 ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED)
254 response for the ABORT TASK request. This is more logical response,
255 since, because the command finished, attempt to abort it failed, but
256 some initiators, particularly VMware iSCSI initiator, consider TASK
257 NOT EXIST response as if the target got crazy and try to RESET it.
258 Then sometimes get crazy itself. So, this option is disabled by
261 - CONFIG_SCST_MEASURE_LATENCY - if defined, provides in /proc/scsi_tgt/latency
262 file average commands processing latency. You can clear already
263 measured results by writing 0 in this file. Note, you need a
264 non-preemtible kernel to have correct results.
266 HIGHMEM kernel configurations are fully supported, but not recommended
267 for performance reasons, except for scst_user, where they are not
268 supported, because this module deals with user supplied memory on a
269 zero-copy manner. If you need to use it, consider change VMSPLIT option
270 or use 64-bit system configuration instead.
272 For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in
273 "make menuconfig" command set the following variables:
275 - General setup->Configure standard kernel features (for small systems): ON
277 - General setup->Prompt for development and/or incomplete code/drivers: ON
279 - Processor type and features->High Memory Support: OFF
281 - Processor type and features->Memory split: according to amount of
282 memory you have. If it is less than 800MB, you may not touch this
288 Module scst supports the following parameters:
290 - scst_threads - allows to set count of SCST's threads. By default it
293 - scst_max_cmd_mem - sets maximum amount of memory in Mb allowed to be
294 consumed by the SCST commands for data buffers at any given time. By
295 default it is approximately TotalMem/4.
297 SCST "/proc" commands
298 ---------------------
300 For communications with user space programs SCST provides proc-based
301 interface in "/proc/scsi_tgt" directory. It contains the following
304 - "help" file, which provides online help for SCST commands
306 - "scsi_tgt" file, which on read provides information of serving by SCST
307 devices and their dev handlers. On write it supports the following
310 * "assign H:C:I:L HANDLER_NAME" assigns dev handler "HANDLER_NAME"
311 on device with host:channel:id:lun
313 - "sessions" file, which lists currently connected initiators (open sessions)
315 - "sgv" file provides some statistic about with which block sizes
316 commands from remote initiators come and how effective sgv_pool in
317 serving those allocations from the cache, i.e. without memory
318 allocations requests to the kernel. "Size" - is the commands data
319 size upper rounded to power of 2, "Hit" - how many there are
320 allocations from the cache, "Total" - total number of allocations.
322 - "threads" file, which allows to read and set number of SCST's threads
324 - "version" file, which shows version of SCST
326 - "trace_level" file, which allows to read and set trace (logging) level
327 for SCST. See "help" file for list of trace levels. If you want to
328 enable logging options, which produce a lot of events, like "debug",
329 to not loose logged events you should also:
331 * Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable
332 to much bigger value, then recompile it. For example, I use 25,
333 but to use it I needed to modify the maximum allowed value for
334 CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig.
336 * Change in your /etc/syslog.conf or other config file of your favorite
337 logging program to store kernel logs in async manner. For example,
338 I added in my rsyslog.conf line "kern.info -/var/log/kernel"
339 and added "kern.none" in line for /var/log/messages, so I had:
340 "*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages"
342 Each dev handler has own subdirectory. Most dev handler have only two
343 files in this subdirectory: "trace_level" and "type". The first one is
344 similar to main SCST "trace_level" file, the latter one shows SCSI type
345 number of this handler as well as some text description.
347 For example, "echo "assign 1:0:1:0 dev_disk" >/proc/scsi_tgt/scsi_tgt"
348 will assign device handler "dev_disk" to real device sitting on host 1,
349 channel 0, ID 1, LUN 0.
351 Access and devices visibility management (LUN masking)
352 ------------------------------------------------------
354 Access and devices visibility management allows for an initiator or
355 group of initiators to have different views of LUs/LUNs (security groups)
356 each with appropriate access permissions. It is highly recommended to
357 use scstadmin utility for that purpose instead of described in this
358 section low level interface.
360 Initiator is represented as an SCST session. The session is bound to
361 security group on its registration time by character "name" parameter of
362 the registration function, which provided by target driver, based on its
363 internal authentication. For example, for FC "name" could be WWN or just
364 loop ID. For iSCSI this could be iSCSI login credentials or iSCSI
365 initiator name. Each security group has set of names assigned to it by
366 system administrator. Session is bound to security group with provided
367 name. If no such groups found, the session bound to either
368 "Default_target_name", or "Default" group, depending from either
369 "Default_target_name" exists or not. In "Default_target_name" target
370 name means name of the target.
372 In /proc/scsi_tgt each group represented as "groups/GROUP_NAME/"
373 subdirectory. In it there are files "devices" and "names". File
374 "devices" lists all devices and their LUNs in the group, file "names"
375 lists all names that should be bound to this group.
377 To configure access and devices visibility management SCST provides the
378 following files and directories under /proc/scsi_tgt:
380 - "add_group GROUP" to /proc/scsi_tgt/scsi_tgt adds group "GROUP"
382 - "del_group GROUP" to /proc/scsi_tgt/scsi_tgt deletes group "GROUP"
384 - "add H:C:I:L lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
385 device with host:channel:id:lun as LUN "lun" in group "GROUP". Optionally,
386 the device could be marked as read only.
388 - "del H:C:I:L" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
389 host:channel:id:lun from group "GROUP"
391 - "add V_NAME lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
392 device with virtual name "V_NAME" as LUN "lun" in group "GROUP".
393 Optionally, the device could be marked as read only.
395 - "del V_NAME" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
396 virtual name "V_NAME" from group "GROUP"
398 - "clear" to /proc/scsi_tgt/groups/GROUP/devices clears the list of devices
401 - "add NAME" to /proc/scsi_tgt/groups/GROUP/names adds name "NAME" to group
404 - "del NAME" to /proc/scsi_tgt/groups/GROUP/names deletes name "NAME" from group
407 - "clear" to /proc/scsi_tgt/groups/GROUP/names clears the list of names
410 There must be LUN 0 in each security group, i.e. LUs numeration must not
415 - "echo "add 1:0:1:0 0" >/proc/scsi_tgt/groups/Default/devices" will
416 add real device sitting on host 1, channel 0, ID 1, LUN 0 to "Default"
419 - "echo "add disk1 1" >/proc/scsi_tgt/groups/Default/devices" will
420 add virtual VDISK device with name "disk1" to "Default" group
426 After loading VDISK device handler creates in "/proc/scsi_tgt/"
427 subdirectories "vdisk" and "vcdrom". They have similar layout:
429 - "trace_level" and "type" files as described for other dev handlers
431 - "help" file, which provides online help for VDISK commands
433 - "vdisk"/"vcdrom" files, which on read provides information of
434 currently open device files. On write it supports the following
437 * "open NAME [PATH] [BLOCK_SIZE] [FLAGS]" - opens file "PATH" as
438 device "NAME" with block size "BLOCK_SIZE" bytes with flags
439 "FLAGS". "PATH" could be empty only for VDISK CDROM. "BLOCK_SIZE"
440 and "FLAGS" are valid only for disk VDISK. The block size must be
441 power of 2 and >= 512 bytes. Default is 512. Possible flags:
443 - WRITE_THROUGH - write back caching disabled. Note, this option
444 has sense only if you also *manually* disable write-back cache
445 in *all* your backstorage devices and make sure it's actually
446 disabled, since many devices are known to lie about this mode to
447 get better benchmark results.
449 - READ_ONLY - read only
451 - O_DIRECT - both read and write caching disabled. This mode
452 isn't currently fully implemented, you should use user space
453 fileio_tgt program in O_DIRECT mode instead (see below).
455 - NULLIO - in this mode no real IO will be done, but success will be
456 returned. Intended to be used for performance measurements at the same
457 way as "*_perf" handlers.
459 - NV_CACHE - enables "non-volatile cache" mode. In this mode it is
460 assumed that the target has a GOOD UPS with ability to cleanly
461 shutdown target in case of power failure and it is
462 software/hardware bugs free, i.e. all data from the target's
463 cache are guaranteed sooner or later to go to the media. Hence
464 all data synchronization with media operations, like
465 SYNCHRONIZE_CACHE, are ignored in order to bring more
466 performance. Also in this mode target reports to initiators that
467 the corresponding device has write-through cache to disable all
468 write-back cache workarounds used by initiators. Use with
469 extreme caution, since in this mode after a crash of the target
470 journaled file systems don't guarantee the consistency after
471 journal recovery, therefore manual fsck MUST be ran. Note, that
472 since usually the journal barrier protection (see "IMPORTANT"
473 note below) turned off, enabling NV_CACHE could change nothing
474 from data protection point of view, since no data
475 synchronization with media operations will go from the
476 initiator. This option overrides WRITE_THROUGH.
478 - BLOCKIO - enables block mode, which will perform direct block
479 IO with a block device, bypassing page-cache for all operations.
480 This mode works ideally with high-end storage HBAs and for
481 applications that either do not need caching between application
482 and disk or need the large block throughput. See also below.
484 - REMOVABLE - with this flag set the device is reported to remote
485 initiators as removable.
487 * "close NAME" - closes device "NAME".
489 * "change NAME [PATH]" - changes a virtual CD in the VDISK CDROM.
491 By default, if neither BLOCKIO, nor NULLIO option is supplied, FILEIO
494 For example, "echo "open disk1 /vdisks/disk1" >/proc/scsi_tgt/vdisk/vdisk"
495 will open file /vdisks/disk1 as virtual FILEIO disk with name "disk1".
497 CAUTION: If you partitioned/formatted your device with block size X, *NEVER*
498 ======== ever try to export and then mount it (even accidentally) with another
499 block size. Otherwise you can *instantly* damage it pretty
500 badly as well as all your data on it. Messages on initiator
501 like: "attempt to access beyond end of device" is the sign of
504 Moreover, if you want to compare how well different block sizes
505 work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE
506 **COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In
507 other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS**
508 AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block
509 sizes isn't like switching between FILEIO and BLOCKIO, after
510 changing block size all previously written with another block
511 size data MUST BE ERASED. Otherwise you will have a full set of
512 very weird behaviors, because blocks addressing will be
513 changed, but initiators in most cases will not have a
514 possibility to detect that old addresses written on the device
515 in, e.g., partition table, don't refer anymore to what they are
518 IMPORTANT: By default for performance reasons VDISK FILEIO devices use write
519 ========= back caching policy. This is generally safe from the consistence of
520 journaled file systems, laying over them, point of view, but
521 your unsaved cached data will be lost in case of
522 power/hardware/software failure, so you must supply your
523 target server with some kind of UPS or disable write back
524 caching using WRITE_THROUGH flag. You also should note, that
525 the file systems journaling over write back caching enabled
526 devices works reliably *ONLY* if the order of journal writes
527 is guaranteed or it uses some kind of data protection
528 barriers (i.e. after writing journal data some kind of
529 synchronization with media operations is used), otherwise,
530 because of possible reordering in the cache, even after
531 successful journal rollback, you very much risk to loose your
532 data on the FS. Currently, Linux IO subsystem guarantees
533 order of write operations only using data protection
534 barriers. Some info about it from the XFS point of view could
535 be found at http://oss.sgi.com/projects/xfs/faq.html#wcache.
536 On Linux initiators for EXT3 and ReiserFS file systems the
537 barrier protection could be turned on using "barrier=1" and
538 "barrier=flush" mount options correspondingly. Note, that
539 usually it turned off by default and the status of barriers
540 usage isn't reported anywhere in the system logs as well as
541 there is no way to know it on the mounted file system (at
542 least no known one). Windows and, AFAIK, other UNIX'es don't
543 need any special explicit options and do necessary barrier
544 actions on write-back caching devices by default. Also note
545 that on some real-life workloads write through caching might
546 perform better, than write back one with the barrier
547 protection turned on.
548 Also you should realize that Linux doesn't provide a
549 guarantee that after sync()/fsync() all written data really
550 hit permanent storage, they can be then in the cache of your
551 backstorage device and lost on power failure event. Thus,
552 ever with write-through cache mode, you still need a good UPS
553 to protect yourself from your data loss (note, data loss, not
554 the file system integrity corruption).
556 IMPORTANT: Some disk and partition table management utilities don't support
557 ========= block sizes >512 bytes, therefore make sure that your favorite one
558 supports it. Currently only cfdisk is known to work only with
559 512 bytes blocks, other utilities like fdisk on Linux or
560 standard disk manager on Windows are proved to work well with
561 non-512 bytes blocks. Note, if you export a disk file or
562 device with some block size, different from one, with which
563 it was already partitioned, you could get various weird
564 things like utilities hang up or other unexpected behavior.
565 Hence, to be sure, zero the exported file or device before
566 the first access to it from the remote initiator with another
567 block size. On Window initiator make sure you "Set Signature"
568 in the disk manager on the imported from the target drive
569 before doing any other partitioning on it. After you
570 successfully mounted a file system over non-512 bytes block
571 size device, the block size stops matter, any program will
572 work with files on such file system.
577 This module works best for these types of scenarios:
579 1) Data that are not aligned to 4K sector boundaries and <4K block sizes
580 are used, which is normally found in virtualization environments where
581 operating systems start partitions on odd sectors (Windows and it's
584 2) Large block data transfers normally found in database loads/dumps and
587 3) Advanced relational database systems that perform their own caching
588 which prefer or demand direct IO access and, because of the nature of
589 their data access, can actually see worse performance with
590 non-discriminate caching.
592 4) Multiple layers of targets were the secondary and above layers need
593 to have a consistent view of the primary targets in order to preserve
594 data integrity which a page cache backed IO type might not provide
597 Also it has an advantage over FILEIO that it doesn't copy data between
598 the system cache and the commands data buffers, so it saves a
599 considerable amount of CPU power and memory bandwidth.
601 IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between
602 ========= them, if you try to use a device in both those modes simultaneously,
603 you will almost instantly corrupt your data on that device.
608 In the pass-through mode (i.e. using the pass-through device handlers
609 scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
610 are passed to local SCSI hardware on target as is, without any
611 modifications. As any other hardware, the local SCSI hardware can not
612 handle commands with amount of data and/or segments count in
613 scatter-gather array bigger some values. Therefore, when using the
614 pass-through mode you should note that values for maximum number of
615 segments and maximum amount of transferred data for each SCSI command on
616 devices on initiators can not be bigger, than corresponding values of
617 the corresponding SCSI devices on the target. Otherwise you will see
618 symptoms like small transfers work well, but large ones stall and
619 messages like: "Unable to complete command due to SG IO count
620 limitation" are printed in the kernel logs.
622 You can't control from the user space limit of the scatter-gather
623 segments, but for block devices usually it is sufficient if you set on
624 the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same
625 or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for
626 the corresponding devices on the target.
628 For not-block devices SCSI commands are usually generated directly by
629 applications, so, if you experience large transfers stalls, you should
630 check documentation for your application how to limit the transfer
633 Another way to solve this issue is to build SG entries with more than 1
634 page each. See the following patch as an example:
635 http://scst.sf.net/sgv_big_order_alloc.diff
637 User space mode using scst_user dev handler
638 -------------------------------------------
640 User space program fileio_tgt uses interface of scst_user dev handler
641 and allows to see how it works in various modes. Fileio_tgt provides
642 mostly the same functionality as scst_vdisk handler with the most
643 noticeable difference that it supports O_DIRECT mode. O_DIRECT mode is
644 basically the same as BLOCKIO, but also supports files, so for some
645 loads it could be significantly faster, than the regular FILEIO access.
646 All the words about BLOCKIO from above apply to O_DIRECT as well. See
647 fileio_tgt's README file for more details.
652 Before doing any performance measurements note that:
654 I. Performance results are very much dependent from your type of load,
655 so it is crucial that you choose access mode (FILEIO, BLOCKIO,
656 O_DIRECT, pass-through), which suits your needs the best.
658 II. In order to get the maximum performance you should:
662 - Disable in Makefile CONFIG_SCST_STRICT_SERIALIZING, CONFIG_SCST_EXTRACHECKS,
663 CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*, CONFIG_SCST_STRICT_SECURITY
665 - For pass-through devices enable
666 CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ.
668 2. For target drivers:
670 - Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING,
673 3. For device handlers, including VDISK:
675 - Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG.
677 - If your initiator(s) use dedicated exported from the target virtual
678 SCSI devices and have more or equal amount of memory, than the
679 target, it is recommended to use O_DIRECT option (currently it is
680 available only with fileio_tgt user space program) or BLOCKIO. With
681 them you could have up to 100% increase in throughput.
683 IMPORTANT: Some of the compilation options enabled by default, i.e. SCST
684 ========= is optimized currently rather for development and bug hunting,
685 than for performance.
687 If you use SCST version taken directly from the SVN repository, you can
688 set the above options, except CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ,
689 using debug2perf Makefile target.
691 4. For other target and initiator software parts:
693 - Don't enable debug/hacking features in the kernel, i.e. use them as
696 - The default kernel read-ahead and queuing settings are optimized
697 for locally attached disks, therefore they are not optimal if they
698 attached remotely (SCSI target case), which sometimes could lead to
699 unexpectedly low throughput. You should increase read-ahead size to at
700 least 512KB or even more on all initiators and the target.
702 You should also limit on all initiators maximum amount of sectors per
703 SCSI command. To do it on Linux initiators, run:
705 echo “64” > /sys/block/sdX/queue/max_sectors_kb
707 where specify instead of X your imported from target device letter,
710 To increase read-ahead size on Linux, run:
712 blockdev --setra N /dev/sdX
714 where N is a read-ahead number in 512-byte sectors and X is a device
717 Note: you need to set read-ahead setting for device sdX again after
718 you changed the maximum amount of sectors per SCSI command for that
721 - You may need to increase amount of requests that OS on initiator
722 sends to the target device. To do it on Linux initiators, run
724 echo “64” > /sys/block/sdX/queue/nr_requests
726 where X is a device letter like above.
728 You may also experiment with other parameters in /sys/block/sdX
729 directory, they also affect performance. If you find the best values,
730 please share them with us.
732 - On the target CFQ IO scheduler. In most cases it has performance
733 advantage over other IO schedulers, sometimes huge (2+ times
734 aggregate throughput increase).
736 - It is recommended to turn the kernel preemption off, i.e. set
737 the kernel preemption model to "No Forced Preemption (Server)".
739 - Looks like XFS is the best filesystem on the target to store device
740 files, because it allows considerably better linear write throughput,
743 5. For hardware on target.
745 - Make sure that your target hardware (e.g. target FC or network card)
746 and underlaying IO hardware (e.g. IO card, like SATA, SCSI or RAID to
747 which your disks connected) don't share the same PCI bus. You can
748 check it using lspci utility. They have to work in parallel, so it
749 will be better if they don't compete for the bus. The problem is not
750 only in the bandwidth, which they have to share, but also in the
751 interaction between cards during that competition. This is very
752 important, because in some cases if target and backend storage
753 controllers share the same PCI bus, it could lead up to 5-10 times
754 less performance, than expected. Moreover, some motherboard (by
755 Supermicro, particularly) have serious stability issues if there are
756 several high speed devices on the same bus working in parallel. If
757 you have no choice, but PCI bus sharing, set in the BIOS PCI latency
760 6. If you use VDISK IO module in FILEIO mode, NV_CACHE option will
761 provide you the best performance. But using it make sure you use a good
762 UPS with ability to shutdown the target on the power failure.
764 IMPORTANT: If you use on initiator some versions of Windows (at least W2K)
765 ========= you can't get good write performance for VDISK FILEIO devices with
766 default 512 bytes block sizes. You could get about 10% of the
767 expected one. This is because of the partition alignment, which
768 is (simplifying) incompatible with how Linux page cache
769 works, so for each write the corresponding block must be read
770 first. Use 4096 bytes block sizes for VDISK devices and you
771 will have the expected write performance. Actually, any OS on
772 initiators, not only Windows, will benefit from block size
773 max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE
774 is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size
775 on the underlying FS, on which the device file located, or 0,
776 if a device node is used. Both values are from the target.
777 See also important notes about setting block sizes >512 bytes
778 for VDISK FILEIO devices above.
780 What if target's backstorage is too slow
781 ----------------------------------------
783 If under high load you experience I/O stalls or see in the kernel log on
784 the target abort or reset messages, then your backstorage is too slow
785 comparing with your target link speed and amount of simultaneously
786 queued commands. On some seek intensive workloads even fast disks or
787 RAIDs, which able to serve continuous data stream on 500+ MB/s speed,
788 can be as slow as 0.3 MB/s. Another possible cause for that can be
789 MD/LVM/RAID on your target as in http://lkml.org/lkml/2008/2/27/96
790 (check the whole thread as well).
792 Thus, in such situations simply processing of one or more commands takes
793 too long time, hence initiator decides that they are stuck on the target
794 and tries to recover. Particularly, it is known that the default amount
795 of simultaneously queued commands (48) is sometimes too high if you do
796 intensive writes from VMware on a target disk, which uses LVM in the
797 snapshot mode. In this case value like 16 or even 8-10 depending of your
798 backstorage speed could be more appropriate.
800 Unfortunately, currently SCST lacks dynamic I/O flow control, when the
801 queue depth on the target is dynamically decreased/increased based on
802 how slow/fast the backstorage speed comparing to the target link. So,
803 there are only 5 possible actions, which you can do to workaround or fix
806 1. Ignore incoming task management (TM) commands. It's fine if there are
807 not too many of them, so average performance isn't hurt and the
808 corresponding device isn't put offline, i.e. if the backstorage isn't
811 2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case
812 if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant
813 in scst_priv.h file until you stop seeing incoming TM commands.
814 ISCSI-SCST driver also has its own iSCSI specific parameter for that.
816 3. Try to avoid such seek intensive workloads.
818 4. Insrease speed of the target's backstorage.
820 5. Implement in SCST dynamic I/O flow control. See "Dynamic I/O flow
821 control" section on http://scst.sourceforge.net/contributing.html page
822 for possible idea how to do it.
824 To decrease device queue depth on Linux initiators run command:
826 # echo Y >/sys/block/sdX/device/queue_depth
828 where Y is the new number of simultaneously queued commands, X - your
829 imported device letter, like 'a' for sda device. There are no special
830 limitations for Y value, it can be any value from 1 to possible maximum
831 (usually, 32), so start from dividing the current value on 2, i.e. set
832 16, if /sys/block/sdX/device/queue_depth contains 32.
834 Note, that logged messages about QUEUE_FULL status are quite different
835 by nature. This is a normal work, just SCSI flow control in action.
836 Simply don't enable "mgmt_minor" logging level, or, alternatively, if
837 you are confident in the worst case performance of your back-end
838 storage, you can increase SCST_MAX_TGT_DEV_COMMANDS in scst_priv.h to
839 64. Usually initiators don't try to push more commands on the target.
846 * Mark Buechler <mark.buechler@gmail.com> for a lot of useful
847 suggestions, bug reports and help in debugging.
849 * Ming Zhang <mingz@ele.uri.edu> for fixes and comments.
851 * Nathaniel Clark <nate@misrule.us> for fixes and comments.
853 * Calvin Morrow <calvin.morrow@comcast.net> for testing and useful
856 * Hu Gang <hugang@soulinfo.com> for the original version of the
859 * Erik Habbinga <erikhabbinga@inphase-tech.com> for fixes and support
860 of the LSI target driver.
862 * Ross S. W. Walker <rswwalker@hotmail.com> for the original block IO
863 code and Vu Pham <huongvp@yahoo.com> who updated it for the VDISK dev
866 * Michael G. Byrnes <michael.byrnes@hp.com> for fixes.
868 * Alessandro Premoli <a.premoli@andxor.it> for fixes
870 * Nathan Bullock <nbullock@yottayotta.com> for fixes.
872 * Terry Greeniaus <tgreeniaus@yottayotta.com> for fixes.
874 * Krzysztof Blaszkowski <kb@sysmikro.com.pl> for many fixes and bug reports.
876 * Jianxi Chen <pacers@users.sourceforge.net> for fixing problem with
879 * Bart Van Assche <bart.vanassche@gmail.com> for a lot of help
881 Vladislav Bolkhovitin <vst@vlnb.net>, http://scst.sourceforge.net