9 år sedan · 860ca5277c
--- a/docs/HARDWARE
+++ b/docs/HARDWARE
@@ -0,0 +1,88 @@
 
															+BIOS
														
 
															+====
														
 
															+ The important options in BIOS:
														
 
															+ - IOMMU (Intel VT-d) 			- Enable hardware translation between physcal and bus addresses
														
 
															+ - No Snoop				- Disables hardware cache coherency between DMA and CPU
														
 
															+ - Max Payload (MMIO Size)		- Maximal (useful) payload for PCIe protocol
														
 
															+ - Above 4G Decoding			- This seesm to allow bus addresses wider than 32-bit
														
 
															+ - Memory performance 			- Frequency, Channel-interleaving, Hardware prefetcher affect memory performance
														
 
															+ 
														
 
															+ 
														
 
															+IOMMU
														
 
															+=====
														
 
															+ - As many PCI-devices can address only 32-bit memory, for DMA operation some address
														
 
															+ translation mechanism is required (also it helps with security limiting PCI devices 
														
 
															+ to only allowed address range). There are several methods to achieve this.
														
 
															+ * Linux provides so called Bounce Buffers (or SWIOTLB). This is just a small memory
														
 
															+ buffer in the lower 4 GB of memory. The DMA is actually performed into this buffer
														
 
															+ and data is, then, copied to the appropriate location. One problem with SWIOTLB
														
 
															+ is that it does not gurantee 4K aligned address when mapping memory pages (to
														
 
															+ optimally use space). This is not properly supported neither by NWLDMA nor by IPEDMA.
														
 
															+ * Alternatively hardware IOMMU can be used which will provide hardware address 
														
 
															+ translation between physical and bus addresses. To allow it, we need to 
														
 
															+ allow the technology in the BIOS and in the kernel. 
														
 
															+    + Intel VT-d or AMD-Vi (AMD IOMMU) virtualization technologies have to be enabled
														
 
															+    + Intel is enabled with  "intel_iommu=on" kernel parameter (alternative is to build kernel with CONFIG_INTEL_IOMMU_DEFAULT_ON)
														
 
															+    + Checking: dmesg | grep -e IOMMU -e DMAR -e PCI-DMA
														
 
															+ 
														
 
															+DMA Cache Coherency
														
 
															+===================
														
 
															+ DMA API distinguishes two types of memory coherent and non-coherent. 
														
 
															+ - For the coherent memory, the hardware will care for cache consistency. This is often
														
 
															+ achieved by snooping (No Snoop should be disabled in the BIOS). Alternatively, the same
														
 
															+ effect can be achieved by using non-cached memory. There is architectures with 100%
														
 
															+ cache coherent memory and others where only part of memory is kept cache coherent.
														
 
															+ For such architectures the coherent memory can be allocated with
														
 
															+    dma_alloc_coheretnt(...) / dma_alloc_attrs(...)
														
 
															+ * However, the coherent memory could be slow (especially on large SMP systems). Also
														
 
															+ minimal allocation unit may be restricted to page. Therefore, it is useful to group
														
 
															+ consistent mapping into the groups.
														
 
															+ 
														
 
															+ - On other hand, it is possible to allocate streaming DMA memory which are synchronized
														
 
															+ using:
														
 
															+    pci_dma_sync_single_for_device / pci_dma_sync_single_for_cpu
														
 
															+
														
 
															+ - It may happen that all memory is coherent anyway and we do not need to call this 2
														
 
															+ functions. Currently, it seems not required on x86_64 which may indicate that snooping
														
 
															+ is performed for all available memory. On other hand,  may be only because nothing
														
 
															+ was get cached luckely so far.
														
 
															+
														
 
															+
														
 
															+PCIe Payload
														
 
															+============
														
 
															+ - Kind of MTU for PCI protocol. Higher the value, the lower will be slow down due to
														
 
															+ protocol headers while streaming large amount of data. The current values can be checked
														
 
															+ with 'lspci -vv'. For each device, there is 2 values:
														
 
															+ * MaxPayload under DevCap which indicates MaxPayload supported by the dvice
														
 
															+ * MaxPayload under DevCtl indicates MaxPayload negotiated between device and chipset.
														
 
															+ Negotiated MaxPayload is a minimal value among all the infrastructure between the device 
														
 
															+ chipset. Normally, it is limited by the MaxPaylod supported by the PCIe root port on 
														
 
															+ the chipset. Most systems currently restricted to 256 bytes.
														
 
															+
														
 
															+
														
 
															+Memory Performance
														
 
															+==================
														
 
															+ - Memory performance is quite critical as we currently tripple the PCIe bandwidth:
														
 
															+ DMA writes to memory, we read memory (it is not in cache), we write memory.
														
 
															+ - The most important to enable Channel Interleaving (otherwise a single-channel copy
														
 
															+ will be performed). On other hand, Rank Interleaving does not matter much.
														
 
															+ - On some motherboards (Asrock X79 for instance), when the memory speed is set 
														
 
															+ manually, the interleaving is switched off in AUTO mode. So, it is safer to set 
														
 
															+ interleaving manually on.
														
 
															+ - Hardware prefetching helps a little bit and should be turned on
														
 
															+ - Faster memory frequency helps. As we are streaming I guess this is more important
														
 
															+ compared even to slighly higher CAS & RAS latencies, but I have not checked. 
														
 
															+ - The memory bank conflicts sometimes may significant harm performance. Bank conflict
														
 
															+ will happen if we read and write from/to different rows of the same bank (also there 
														
 
															+ could be conflict with DMA operation). I don't have a good idea how to prevent this
														
 
															+ now.
														
 
															+ - The most efficient memcpy performance depends on CPU generation. For latest models,
														
 
															+ AVX seems to be most efficient. Filling all AVX registers before writting increases
														
 
															+ performance. It also gives quite much of performance, if multiple pages copied in 
														
 
															+ parallel (still first we reading from multiple pages and then writting to multiple
														
 
															+ pages, see ssebench). 
														
 
															+ - Usage of HugePages makes performance more stable. Using page-locked memory does not
														
 
															+ help at all.
														
 
															+ - This still will give about 10 - 15 GB/s at max. On multiprocessor systems about 5 GB/s,
														
 
															+ because of performance penalties due to snooping. Therefore, copying with multiple
														
 
															+ threads is preferable.