4 Reproduce (from qp_test, with opensm on sw269):
5 sw269: objchk_wnet_AMD64\amd64\qp_test.exe --daemon -p=1 --tcp_port=19017 --seed=1353948926
6 sw270: objchk_wnet_AMD64\amd64\qp_test.exe --ip=10.4.12.69 -p=1 --ts=RC --tcp_port=19017 --seed=1353948926 --mtu=2048 --qp=5 -i=5 --poll=sebit CLIENT RR 4 7642
8 - srq_test: i've excluded XRC part in poll_one, which is starnge. It seems like unbalanced change with kernel.
9 For some reson QP is qualified like XRC, which cause poll error.
11 - qp_test --ts=UD fails with pkt_size > 1024. Has to work with <=2048
12 qp_test.exe --ip=10.4.12.70 --ts=UD --mtu=256 --qp=1 --thread=1 --poll=block --oust=1 -i=2 [--grh] CLIENT SRI 1 1100
14 - qp_test atomic operations fails
15 qp_test.exe --ip=10.4.12.70 --ts=RC --mtu=256 --qp=1 --thread=1 --poll=block --oust=1 -i=10 CLIENT CS 1 8
16 mlx4: local QP operation err (QPN 26004a, WQE index 0, vendor syndrome 70, opcode = 5e)
17 \data_operation.c , 55 DATA_ERR : thread 0: wr_id: 0x064x
18 \data_operation.c , 56 DATA_ERR : thread 0: status: IB_WCS_LOCAL_OP_ERR
20 - qp_test with several threads fails (not the very simple run)
21 qp_test.exe --ip=10.4.12.70 --ts=RC --mtu=1024 --qp=5 --thread=5 --poll=block --oust=12 -i=10000 CLIENT SRI 4 32048 SERVER RW 8
23 \data_operation.c , 55 DATA_ERR : thread 3: wr_id: 0x7014064x
24 \data_operation.c , 56 DATA_ERR : thread 3: status: IB_WCS_WR_FLUSHED_ERR
25 \data_operation.c , 275 MISC_ERR : thread 3: failed to read completions: IB_ERROR
26 \qp_test.c , 70 HCA_ERR : Got QP async event. Type: IB_AE_SQ_ERROR(0x1), vendor_specific: 0x064x
28 - WSD freezes on long sends (packets lost ???). Do several times. (Tried with free version.)
29 SdpConnect.exe client 11.4.12.70 2222 0 1 0 2 2 1 40000 2000
30 SdpConnect.exe pingpong 11.4.12.70 2223 2000 4
32 - DDR: opensm can't do the work: active side is left in ARMED state, passive - in INITIALIZED.
33 (maybe, out of debug messages on the passive side)
37 mlx4_bus!mlx4_ib_db_free+0xad
38 mlx4_bus!mlx4_ib_destroy_cq+0x1c2
39 mlx4_hca!ib_destroy_cq+0x46
40 mlx4_hca!mlnx_destroy_cq+0x1d
42 ibbus!async_destroy_cb+0xb3
43 ibbus!__cl_async_proc_worker+0x61
44 ibbus!__cl_thread_pool_routine+0x41
45 ibbus!__thread_callback+0x28
47 - Driver unload on the passive side, while opensm is running, gets stuck !
48 Seems like because of the not released MADs:
49 [AL]print_al_obj() !ERROR!: AL object fffffade6c23b900(AL_OBJ_TYPE_H_MAD_SVC), parent: fffffade6cb4bd10 ref_cnt: 8
50 Maybe, it is an old problem !! Check with MTHCA.
52 - Crash on MAD send completion callback at driver download.
53 Check it by download driver while mad sending.
55 mlxburn -d mt25408_pci_cr0 -fw c:\tmp\mlx4\FW\966\fw-25408-rel.mlx -conf c:\tmp\mlx4\fw\MHGH28-XTC_A1.ini
59 - inverted LID in QP1 trafic of IBAL;
60 - ib_send_bw.exe -a -c UC
62 - ib_read_bw gets stuck
63 mlx4: local QP operation err (QPN 0a004a, WQE index 0, vendor syndrome 70, opcode = 5e)
64 Completion wth error at client:
65 Failed status 2: wr_id 3 syndrom 0x70
66 - crash on query_qp after create qp failed.
67 Seams like an IBAL problem - it calls mlnx_query_qp with h_qp = 0
70 qp_test.exe --ip=10.4.12.70 [--ts=UC] --mtu=256 --qp=1 --thread=1 --poll=block --oust=1 -i=2 [--grh] CLIENT SRI 1 256
71 create_qp for some reason failes with invalid_max_sge ??!
75 1. ALIGN_PAGE in Linux means aliign to the next page, in Windows - to the previous;
76 2. IBAL uses modify_av to change SA_SVC AV slid. When not used, it sends SA requests with LID = 1 in host order, as it put it to itself by default.
77 3. There is no way to convey CQ completion event to user space. So mlx4_cq_event, incrementing arm_sn, can't be called.
78 One needs to add an arrangement to make kernel code increment this variable.
83 +0. add support for livefish.
84 1. review eq.c - all interrupt handling stuff;
86 3. look through all the // TODO
87 4. Check all the MACROS: whether there are more ones, that have different meaning in Windows and Linux;
91 0. Add propagation of vendor_specific error code to IBAL callbacks like in MTHCA
93 2. Work with DMA addresses and not with physical ones (a very serious change!);
99 1.1 Command go-bit timeout is set to CMD_WAIT_USECS. Not clear the influence.
100 1.2 Radix tree is implemented via COMPLIB calls, that seems like none used. It can be slow and even buggy.
101 1.3 Different things can be done in built_in assempbler (e.g., bit operations like fls()).
103 2. WPP support disabled. Problems:
104 - at compilation phase;
105 - Linux prints in MLX4_BUS are not supported by WPP (mlx4_err, mlx4_dbg, dev_err, dev_info ...);
106 - check WPP guids - both in MLX4_BUS and MLX4_HCA - are redefined and not copyed from the previous code.
108 3. (General) Several places are marked with '// TODO:'. They have to be reviewed from time to time, especially - upon problems.
112 1. PAGE_ALIGN has different meaning in Linux and Windows;
113 2. There is no way to convey CQ completion event to user space. So mlx4_cq_event, incrementing arm_sn, can't be called.
114 One needs to add an arrangement to make kernel code increment this variable.
115 3. IBAL uses modify_av to change SA_SVC AV slid, set by default to LITTLE_ENDIAN 1. The problem is seen as inverted LID on the sniffer.
116 The solution - implement modify_av.