Monday, November 30, 2009
对优化的一点感想
优化分成很多层次, 1)设计层次,2)C代码实现,3)汇编代码实现. 其中的3)基本上由编译器做掉了, 程序员可以控制的不多,最多是选下优化策略,比如GCC的O1,O2,O3. 或者VS的速度优先还是节省空间优先.
在优化之前,首先要明确目标. 节省空间和提高效率并不一定可以同时做到,有时还会互相矛盾. 所以你要明确到底是节省空间重要,还是提高速度更重要.
其次,要明确程序的什么地方是效率最低的,需要优化的. 我觉得大多数情况下, 20/80原理基本适用. 也就是说20%的代码占掉了80%的空间或者时间. 所以重点是优化那20%的代码.其他部分没必要太费心. 类似的,程序员要知道那些操作是浪费时间或者空间的. 费时的操作有:I/O, SYSTEM CALL,没用的PRINTF等. 浪费空间的的例子比如一个很大的数组,实际上用到的很少几个元素. 关于如何找到效率最低的代码部分,可以利用各种PROFILE工具.
最后,举1,2个设计阶段优化的例子.例子1, 假设你的程序要多次处理一个文件,这时有2种做法. 做法1,每次要处理文件时打开文件,处理完关闭. 做法2,只打开一次文件,然后把内容保存到内存里.后面直接对内存里的数据操作,最后再看情况是否写回文件.做法1节省空间,但是浪费时间. 做法2浪费一些空间但是节省时间. 最后怎么做就看文件的大小以及打开的次数到底多不多来看了.
例子2, 假设你有一个定时器,每过一段时间就做一些事情.那么这个定时器的间隔时间就是一个重要的参数. 间隔1秒和间隔10的差别不言而喻. 在满足设计要求的情况下,间隔时间越长,占用的CPU时间越少.
总之,设计的时候先明确要实现的目标,然后在满足设计目标的前提下进行优化.
在优化之前,首先要明确目标. 节省空间和提高效率并不一定可以同时做到,有时还会互相矛盾. 所以你要明确到底是节省空间重要,还是提高速度更重要.
其次,要明确程序的什么地方是效率最低的,需要优化的. 我觉得大多数情况下, 20/80原理基本适用. 也就是说20%的代码占掉了80%的空间或者时间. 所以重点是优化那20%的代码.其他部分没必要太费心. 类似的,程序员要知道那些操作是浪费时间或者空间的. 费时的操作有:I/O, SYSTEM CALL,没用的PRINTF等. 浪费空间的的例子比如一个很大的数组,实际上用到的很少几个元素. 关于如何找到效率最低的代码部分,可以利用各种PROFILE工具.
最后,举1,2个设计阶段优化的例子.例子1, 假设你的程序要多次处理一个文件,这时有2种做法. 做法1,每次要处理文件时打开文件,处理完关闭. 做法2,只打开一次文件,然后把内容保存到内存里.后面直接对内存里的数据操作,最后再看情况是否写回文件.做法1节省空间,但是浪费时间. 做法2浪费一些空间但是节省时间. 最后怎么做就看文件的大小以及打开的次数到底多不多来看了.
例子2, 假设你有一个定时器,每过一段时间就做一些事情.那么这个定时器的间隔时间就是一个重要的参数. 间隔1秒和间隔10的差别不言而喻. 在满足设计要求的情况下,间隔时间越长,占用的CPU时间越少.
总之,设计的时候先明确要实现的目标,然后在满足设计目标的前提下进行优化.
Wednesday, November 25, 2009
Wednesday, November 18, 2009
Monday, November 16, 2009
Intel e1000 网卡发包过程小记
Intel e1000 网卡发包过程小记
最近对 Intel 1G网卡的发包过程比较感兴趣,大概研究了下代码,特此记录一下。
参考资料:
1。Intel 82547 网卡开发手册。其他Intel网卡的手册应该也可以从网上下载到。http://linux.chinaunix.net/bbs/thread-1142051-1-2.html
2。Linux e1000网卡驱动。http://lxr.linux.no/#linux+v2.6.30/drivers/net/e1000/e1000_main.c
发包过程:
1。linux os会调用网卡的start_xmit()函数。在e1000里,对应的函数是 e1000_xmit_frame,
2。e1000_xmit_frame又会调用e1000_tx_queue(adapter, tx_ring, tx_flags, count)。这里的tx_queue指的是发送Descriptor的queue。
3。e1000_tx_queue 在检查了一些参数后,最终调用 writel(i, hw->hw_addr + tx_ring->tdt)。这里的tx_ring->tdt中的tdt全写为 tx_descriptor_tail。从网卡的开发手册中可以查到,如果写了descriptor tail,那么网卡就会自动读取 descriptor,然后把包发送出去。
descriptor的主要内容是addr pointer和length。前者是要发送的包的起始物理地址。后者是包的长度。有了这些,硬件就可以通过dma来读取包并发出去了。其他网卡也基本会用descriptor的结构。
TSO:
INTEL E1000相对来说是一个比较复杂,功能繁多的网卡. 相反, 老的 RTL8139网卡就简单很多.早期的 RTL8139 网卡的功能很少. 它就是把OS发给它的包放到网卡上发出去. 最高速度好像也就是10Mbit 或者100Mbit. 随着科技的进步, INTEL 1000支持的功能又多了. 很明显的一个就是TCP SEGMENTATION OFFLOADING (缩写 TSO, 在驱动的代码中经常可以看到).
下面先解释下什么是TSO:我们知道网络是分成很多层的, TCP 在中间,下面又有IP, ETHERNET 协议 (对应不同的层). TCP 可以发一个很大的包,比如说2K B. 但是ETHERNET 可能不支持. 比如ETHERNET 只支持1.5K B. 那么怎么发送2K B的TCP包呢? 简单的办法就是把它分成2个. 第一个是1.5KB. 第2个是0.5KB. 这个过程就叫TCP SEGMENTATION (我不清楚中文是怎么翻译的. 不好意思).那么OFFLOADING 是什么意思呢? 它的本义大概是"卸载". 在这里可以理解为"放下来". "下来"是哪里呢? 由于一般说软件(OS) 是跑在硬件"上面"的, 所以"下来"也就是下到硬件(网卡)上来.所以TSO的含义就是把TCP SEGMENTATION 放到了网卡上来做. 这些工作本来是OS做的. 现在网卡硬件可以做了,结果就是OS更简单了. 而且硬件实现一般来说也会更快速一些. 所以INTEL E1000 支持1Gbit.
INTEL E1000 另外一个和RTL8139不同的地方在于对发送包 (SKB) 的处理. 8139 的驱动里, 先通过pci_alloc_consistent (2.6.18. 到了2.6.29又变了)来分配一块可以用来DMA的内存,然后调用 skb_copy_and_csum_dev 把OS传来的数据复制到可以DMA的内存那里. 这个复制的过程要消耗一些时间,影响效率.在INTEL E1000里采用了另外一种做法. 在e1000_tx_queue之前,又调用了 e1000_tx_map(). 这个函数的主要功能就是为SKB里的数据建立一个可以DMA的地址. 这样就不用复制内存了. 建立一个DMA地址的过程似乎比较快(我猜的), 所以效率应该也提高了.
From the intel manual:
3.5.2 Transmission Process
The transmission process for regular (non-TCP Segmentation packets) involves:
• The protocol stack receives from an application a block of data that is to be transmitted.
• The protocol stack calculates the number of packets required to transmit this block based on the MTU size of the media and required packet headers.
• For each packet of the data block:
• Ethernet, IP and TCP/UDP headers are prepared by the stack.
• The stack interfaces with the software device driver and commands the driver to send the individual packet.
• The driver gets the frame and interfaces with the hardware.
• The hardware reads the packet from host memory (via DMA transfers).
• The driver returns ownership of the packet to the operating system when the hardware has completed the DMA transfer of the frame (indicated by an interrupt).
The transmission process for the Ethernet controller TCP segmentation offload implementation involves:
• The protocol stack receives from an application a block of data that is to be transmitted.
• The stack interfaces to the software device driver and passes the block down with the appropriate header information.
• The software device driver sets up the interface to the hardware (via descriptors) for the TCP Segmentation context.
• The hardware transfers the packet data and performs the Ethernet packet segmentation and transmission based on offset and payload length parameters in the TCP/IP context descriptor including:
— Packet encapsulation
— Header generation & field updates including IP and TCP/UDP checksum generation
— The driver returns ownership of the block of data to the operating system when the hardware has completed the DMA transfer of the entire data block (indicated by an interrupt).
TIPS:
1) E1000 支持3种DESCRIPTOR. 可以由多个DESCRIPTOR可以组成一个PACKET. 对于RTL8139, 一个DESCRIPTOR就对应一个ETHERNET PACKET.
2) 几个关键数据结构简介:
struct e1000_tx_ring {
178 /* pointer to the descriptor ring memory */
179 void *desc;
180 /* physical address of the descriptor ring */
181 dma_addr_t dma;
182 /* length of descriptor ring in bytes */
183 unsigned int size;
184 /* number of descriptors in the ring */
185 unsigned int count;
186 /* next descriptor to associate a buffer with */
187 unsigned int next_to_use;
188 /* next descriptor to check for DD status bit */
189 unsigned int next_to_clean;
190 /* array of buffer information structs */
191 struct e1000_buffer *buffer_info;
192
193 spinlock_t tx_lock;
194 uint16_t tdh;
195 uint16_t tdt;
196 boolean_t last_tx_tso;
197};
/* wrapper around a pointer to a socket buffer,
164 * so a DMA handle can be stored along with the buffer */
165struct e1000_buffer {
166 struct sk_buff *skb;
167 dma_addr_t dma;
168 unsigned long time_stamp;
169 uint16_t length;
170 uint16_t next_to_watch;
171};
3) E1000 最多支持64K TX DESCRIPTORS. 并且支持RX DESCRIPTORS. 相反,RTL8139只支持4个TX DESCRIPTOR, 不支持RX DESCRIPTOR. 一般的E1000 PACKET都由多个DESCRIPTOR组成(平均为4).
Another version:
http://linux.chinaunix.net/bbs/thread-1144212-1-3.html
最近对 Intel 1G网卡的发包过程比较感兴趣,大概研究了下代码,特此记录一下。
参考资料:
1。Intel 82547 网卡开发手册。其他Intel网卡的手册应该也可以从网上下载到。http://linux.chinaunix.net/bbs/thread-1142051-1-2.html
2。Linux e1000网卡驱动。http://lxr.linux.no/#linux+v2.6.30/drivers/net/e1000/e1000_main.c
发包过程:
1。linux os会调用网卡的start_xmit()函数。在e1000里,对应的函数是 e1000_xmit_frame,
2。e1000_xmit_frame又会调用e1000_tx_queue(adapter, tx_ring, tx_flags, count)。这里的tx_queue指的是发送Descriptor的queue。
3。e1000_tx_queue 在检查了一些参数后,最终调用 writel(i, hw->hw_addr + tx_ring->tdt)。这里的tx_ring->tdt中的tdt全写为 tx_descriptor_tail。从网卡的开发手册中可以查到,如果写了descriptor tail,那么网卡就会自动读取 descriptor,然后把包发送出去。
descriptor的主要内容是addr pointer和length。前者是要发送的包的起始物理地址。后者是包的长度。有了这些,硬件就可以通过dma来读取包并发出去了。其他网卡也基本会用descriptor的结构。
TSO:
INTEL E1000相对来说是一个比较复杂,功能繁多的网卡. 相反, 老的 RTL8139网卡就简单很多.早期的 RTL8139 网卡的功能很少. 它就是把OS发给它的包放到网卡上发出去. 最高速度好像也就是10Mbit 或者100Mbit. 随着科技的进步, INTEL 1000支持的功能又多了. 很明显的一个就是TCP SEGMENTATION OFFLOADING (缩写 TSO, 在驱动的代码中经常可以看到).
下面先解释下什么是TSO:我们知道网络是分成很多层的, TCP 在中间,下面又有IP, ETHERNET 协议 (对应不同的层). TCP 可以发一个很大的包,比如说2K B. 但是ETHERNET 可能不支持. 比如ETHERNET 只支持1.5K B. 那么怎么发送2K B的TCP包呢? 简单的办法就是把它分成2个. 第一个是1.5KB. 第2个是0.5KB. 这个过程就叫TCP SEGMENTATION (我不清楚中文是怎么翻译的. 不好意思).那么OFFLOADING 是什么意思呢? 它的本义大概是"卸载". 在这里可以理解为"放下来". "下来"是哪里呢? 由于一般说软件(OS) 是跑在硬件"上面"的, 所以"下来"也就是下到硬件(网卡)上来.所以TSO的含义就是把TCP SEGMENTATION 放到了网卡上来做. 这些工作本来是OS做的. 现在网卡硬件可以做了,结果就是OS更简单了. 而且硬件实现一般来说也会更快速一些. 所以INTEL E1000 支持1Gbit.
INTEL E1000 另外一个和RTL8139不同的地方在于对发送包 (SKB) 的处理. 8139 的驱动里, 先通过pci_alloc_consistent (2.6.18. 到了2.6.29又变了)来分配一块可以用来DMA的内存,然后调用 skb_copy_and_csum_dev 把OS传来的数据复制到可以DMA的内存那里. 这个复制的过程要消耗一些时间,影响效率.在INTEL E1000里采用了另外一种做法. 在e1000_tx_queue之前,又调用了 e1000_tx_map(). 这个函数的主要功能就是为SKB里的数据建立一个可以DMA的地址. 这样就不用复制内存了. 建立一个DMA地址的过程似乎比较快(我猜的), 所以效率应该也提高了.
From the intel manual:
3.5.2 Transmission Process
The transmission process for regular (non-TCP Segmentation packets) involves:
• The protocol stack receives from an application a block of data that is to be transmitted.
• The protocol stack calculates the number of packets required to transmit this block based on the MTU size of the media and required packet headers.
• For each packet of the data block:
• Ethernet, IP and TCP/UDP headers are prepared by the stack.
• The stack interfaces with the software device driver and commands the driver to send the individual packet.
• The driver gets the frame and interfaces with the hardware.
• The hardware reads the packet from host memory (via DMA transfers).
• The driver returns ownership of the packet to the operating system when the hardware has completed the DMA transfer of the frame (indicated by an interrupt).
The transmission process for the Ethernet controller TCP segmentation offload implementation involves:
• The protocol stack receives from an application a block of data that is to be transmitted.
• The stack interfaces to the software device driver and passes the block down with the appropriate header information.
• The software device driver sets up the interface to the hardware (via descriptors) for the TCP Segmentation context.
• The hardware transfers the packet data and performs the Ethernet packet segmentation and transmission based on offset and payload length parameters in the TCP/IP context descriptor including:
— Packet encapsulation
— Header generation & field updates including IP and TCP/UDP checksum generation
— The driver returns ownership of the block of data to the operating system when the hardware has completed the DMA transfer of the entire data block (indicated by an interrupt).
TIPS:
1) E1000 支持3种DESCRIPTOR. 可以由多个DESCRIPTOR可以组成一个PACKET. 对于RTL8139, 一个DESCRIPTOR就对应一个ETHERNET PACKET.
2) 几个关键数据结构简介:
struct e1000_tx_ring {
178 /* pointer to the descriptor ring memory */
179 void *desc;
180 /* physical address of the descriptor ring */
181 dma_addr_t dma;
182 /* length of descriptor ring in bytes */
183 unsigned int size;
184 /* number of descriptors in the ring */
185 unsigned int count;
186 /* next descriptor to associate a buffer with */
187 unsigned int next_to_use;
188 /* next descriptor to check for DD status bit */
189 unsigned int next_to_clean;
190 /* array of buffer information structs */
191 struct e1000_buffer *buffer_info;
192
193 spinlock_t tx_lock;
194 uint16_t tdh;
195 uint16_t tdt;
196 boolean_t last_tx_tso;
197};
/* wrapper around a pointer to a socket buffer,
164 * so a DMA handle can be stored along with the buffer */
165struct e1000_buffer {
166 struct sk_buff *skb;
167 dma_addr_t dma;
168 unsigned long time_stamp;
169 uint16_t length;
170 uint16_t next_to_watch;
171};
3) E1000 最多支持64K TX DESCRIPTORS. 并且支持RX DESCRIPTORS. 相反,RTL8139只支持4个TX DESCRIPTOR, 不支持RX DESCRIPTOR. 一般的E1000 PACKET都由多个DESCRIPTOR组成(平均为4).
Another version:
http://linux.chinaunix.net/bbs/thread-1144212-1-3.html
Wednesday, November 11, 2009
Friday, November 6, 2009
Thursday, November 5, 2009
Using qemu to find out physical address of a given virtual address for Xen
Using qemu to find out physical address of a given virtual address for Xen
Environment: Xen 3.3 32bit PAE is installed as a QEMU virtual machine.
Input: 0xc0100000 (the virtual address of domain 0 kernel)
Output: the physical address of 0xc0100000.
Process:
1. Use "info registers" cmd in qemu monitor to get cr3. cr3 is 0x29cd00. This is the physical base addr of page directory pointer table (PDPT).
2. Get the top two bits of virtual address; it is the index for page directory pointer entry.
For 0xc0100000, the highest btye is 0xc, which is 1100(b). So the index is 11(b) = 3 .
3. The length of one entry of PDPT is 64bits (intel cpu manual 3a,3.8.5) = 8 byte. 3*8=24(d) = 0x18.
4. cr3+0x18 contains the entry for page directory table.
cr3+0x18 = 0x0029cd00+0x18 = 0x0029cd18.
xp /20hx 0x0029cd18 = 0x390b 6001. This is the base addr for page dir table.
5. Bits 21 to 29 of virtual address is the index for page dir table.
For 0xc010000, the top 4 bytes are 1100,0000,0001,0000 (b). Bits 21 to 29 are:00,0000,000(b). That is 0. So the index for page dir table is 0.
6. xp /20hx 0x390b,6000 (the lower byte(s) contains some flags, just ignore them for now.)
The output is 0x3dbc,a067.
7. Bits 20 to 12 of virtual address is the index for page table. (For 2MB pages, it is different)
That is 1,0000, 0000(b), which is 0x100. Since each entry is 8 byte (64bits). The position for page tabe is 0x100*8= 0x800.
8. The lower bits of 0x3dbc, a067 are some flags. Just ignore 067 for now. The physical address for page table is
0x3dbc,a000 + 0x800 = 0x3dbc,a800.
xp /20hx 0x3dbc, a800. The output is 0x3d10,0063. Again, the lower bits are flags. So we get final result: 0x3d10,0000. (I skipped the computation for offset with a page.)
Note: To verify that they are actually point the same data, use "x" and "xp" cmd in qemu monitor to show their content. E.g. x /20hx 0xc0100000 , and xp /20hx 0x3d100000. The output should be the same.
Reference:
Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
Environment: Xen 3.3 32bit PAE is installed as a QEMU virtual machine.
Input: 0xc0100000 (the virtual address of domain 0 kernel)
Output: the physical address of 0xc0100000.
Process:
1. Use "info registers" cmd in qemu monitor to get cr3. cr3 is 0x29cd00. This is the physical base addr of page directory pointer table (PDPT).
2. Get the top two bits of virtual address; it is the index for page directory pointer entry.
For 0xc0100000, the highest btye is 0xc, which is 1100(b). So the index is 11(b) = 3 .
3. The length of one entry of PDPT is 64bits (intel cpu manual 3a,3.8.5) = 8 byte. 3*8=24(d) = 0x18.
4. cr3+0x18 contains the entry for page directory table.
cr3+0x18 = 0x0029cd00+0x18 = 0x0029cd18.
xp /20hx 0x0029cd18 = 0x390b 6001. This is the base addr for page dir table.
5. Bits 21 to 29 of virtual address is the index for page dir table.
For 0xc010000, the top 4 bytes are 1100,0000,0001,0000 (b). Bits 21 to 29 are:00,0000,000(b). That is 0. So the index for page dir table is 0.
6. xp /20hx 0x390b,6000 (the lower byte(s) contains some flags, just ignore them for now.)
The output is 0x3dbc,a067.
7. Bits 20 to 12 of virtual address is the index for page table. (For 2MB pages, it is different)
That is 1,0000, 0000(b), which is 0x100. Since each entry is 8 byte (64bits). The position for page tabe is 0x100*8= 0x800.
8. The lower bits of 0x3dbc, a067 are some flags. Just ignore 067 for now. The physical address for page table is
0x3dbc,a000 + 0x800 = 0x3dbc,a800.
xp /20hx 0x3dbc, a800. The output is 0x3d10,0063. Again, the lower bits are flags. So we get final result: 0x3d10,0000. (I skipped the computation for offset with a page.)
Note: To verify that they are actually point the same data, use "x" and "xp" cmd in qemu monitor to show their content. E.g. x /20hx 0xc0100000 , and xp /20hx 0x3d100000. The output should be the same.
Reference:
Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
Tuesday, November 3, 2009
ACPI resources
http://en.wikipedia.org/wiki/ACPI
http://rdist.root.org/2008/10/17/all-about-acpi/
http://www.usenix.org/events/usenix02/tech/freenix/full_papers/watanabe/watanabe_html/index.html http://www.columbia.edu/~ariel/acpi/acpi_howto.txt
http://www.acpica.org/documentation/faq.php
http://rdist.root.org/2008/10/17/all-about-acpi/
http://www.usenix.org/events/usenix02/tech/freenix/full_papers/watanabe/watanabe_html/index.html http://www.columbia.edu/~ariel/acpi/acpi_howto.txt
http://www.acpica.org/documentation/faq.php
Subscribe to:
Posts (Atom)