介绍的网址:http://effbot.org/zone/celementtree.htm
以前一直是使用Python库自带的minidom,自己也封装了一些函数。不过minidom用来读取配置文件还是不错的,但是用来读取一些大文件,感觉就相当的差劲了,个人感觉好像和MS的XMLDocument的封装后的效果差不多。不过他们既然是DOM的实现,效率肯定是差一些的,毕竟要载入文件,构建DOM树等操作,这些都是需要时间和内存的。
前段时间,从邮件列表里面得知XML使用cElementTree可以得到极大的性能提高,刚好赶上这次XML文件封装需要重新处理,于是先看看cElementTree的资料先。cElementTree是ElementTree的C语言实现,后者可以参考:http://www-900.ibm.com/developerWorks/cn/xml/x-matters/part28/index.shtml。
下面是摘自官网的一段文字:
------------------------------------
Benchmarks
Here are some benchmark figures, using a number of popular XML toolkits to parse a 3405k document-style XML file, from disk to memory:
| library |
time |
space |
notes |
| xml.dom.minidom (Python 2.1) |
6.3 s |
80000k |
(1) |
| gnosis.objectify |
2.0 s |
22000k |
(5) |
| xml.dom.minidom (Python 2.4) |
1.4 s |
53000k |
(1) |
| ElementTree 1.2 |
1.6 s |
14500k |
|
| ElementTree 1.2.4/1.3 |
1.1 s |
14500k |
|
| cDomlette (C extension) |
0.540 s |
20500k |
(1) |
| PyRXPU (C extension) |
0.175 s |
10850k |
(2) |
| lxml.etree (C extension) |
(4) |
(4) |
(3) |
| libxml2 (C extension) |
0.098 s |
16000k |
(3) |
| readlines (read as utf-8) |
0.093 s |
8850k |
|
| cElementTree (C extension) |
0.047 s |
4900k |
|
| readlines (read as ascii) |
0.032 s |
5050k |
|
The figures may of course vary somewhat depending on version, compiler, and platform. The above was measured with 2.4, using prebuilt Windows installers (as published by the maintainers) for all C extensions. If you want further details about the tests, drop me a line.
Several other toolkits were tested, but failed to parse the test file (which uses both non-ASCII characters and namespaces). One very slow toolkit was removed after threats from its lead developer.
Toolkits that parse namespaces but don't handle them properly are included, though (see notes 2 and 5, below).
For comparision, here are some benchmarks for event-based parsers (using the same file as above, and enough dummy handlers to be able to handle complete elements and their character data contents):
| library |
time |
throughput |
| xml.sax (Python 2.1) |
0.330 s |
10300 k/s |
| xml.sax (Python 2.4) |
0.292 s |
11700 k/s |
| xml.parsers.expat |
0.184 s |
18500 k/s |
| cElementTree XMLParser |
0.124 s |
27500 k/s |
| sgmlop |
0.092 s |
37000 k/s |
| cElementTree iterparse |
0.071 s |
48000 k/s |
Note 1) For these toolkits, the looping variant of my benchmark behaves very badly, resulting in unexpected memory growth and wildly varying parsing times (typically 150-300% of the values in the table). Strategic use of gc.collect() will usually make things better. Be careful.
Note 2) Even with namespace handling enabled, PyRXPU returns namespace prefixes instead of namespace URI:s, which makes it pretty much useless for namespace-aware XML processing. I've included it anyway, since it's often put forth as the fastest XML parser you can get for .
Note 3) Tests on other platforms indicate that libxml2 is closer to cElementTree than this benchmark indicates. This is most likely a compiler-related issue (I'm using "official" Windows binaries for this benchmark, but so will most other users).
Note 4) There are no Windows binaries for lxml.etree yet, but it uses libxml2's parser and object model, so the timings for this test should be very close to those for libxml2.
Note 5) An undocumented function (config_nspace_sep) must be called to enable namespace parsing. With that in place, the library parses the file without problems, but the resulting data structure depends on the namespace prefixes used in the file, rather than the namespace URI:s (also see note 2).