source: proto/CSV/csv2xml/Report.txt @ 4177

Last change on this file since 4177 was 2611, checked in by linmengl, 7 years ago

finish test with PERF_SEC

File size: 2.1 KB
Line 
1Report on csv2xml project:
2
3Hi, Rob
4
5I implemented the csv2xml function in Bit Stream, and its usage is
6
7csv [-tab] [infile] [outfile]
8
9If -tab is on, it will use tab symbol as delimiters, otherwise comma.
10
11I found 8 small real csv files from Google and tested them with standard csv2xml in xmlsh. The generated xmls are the same except the following two differences:
12
13    1. For inputs like
14    "Price
15    With late fee"
16    My program treats them as a whole column content, which means end-of-line is cancelled in the quote, while xmlsh treats them as a column and a new row. I think mine is better.
17
18    2. My program generates some <row><col></col></row> for blank lines while standard code generates <row/> instead. I think they are both OK.
19
20I also found 2 big real csv files and generated 2 big files with a lot of duplicated rows. The details are listed in the Appendix. I used shell command "time" to record performance, because I could not find a more careful way to measure the standard code. So the running time below includes I/O time and other OS events. Only running time of large csv files are useful and we can see the parabix version is 2~4 times faster.
21
22Appendix:
23
24FileName         (Size)    Std      My
25
262006scores       (1.8MB)
27salary_data      (387.2KB)
28eso_eagle_awards (14.1KB)  0.006s
29sticker-price    (214.7KB) 0.031s
30FN               (20.3KB)  0.007s
31TLTD_holdings    (94.2KB)  0.021s
32GUNR_holdings    (10.5KB)  0.004s
33scaledwps        (2.0MB)   0.213s
34
35L2_2012-01       (103.2MB) 17.397s  4.760s
36L2_2012-02       (130.5MB) 20.273s  5.848s
37gen1000          (28MB)    7.320s   3.424s
38gen10000         (280MB)   59.132s  34.122s
39
40(Large CSV files are from http://openurl.ac.uk/doc/data/thedata.html)
41
42Appendix I:
43Change all strings into direct output
44
45L2_2012-01       (103.2MB) 16.761s  0.560s
46L2_2012-02       (130.5MB) 20.117s  0.812s
47gen1000          (28MB)    7.480s   0.564s
48gen10000         (280MB)   59.472s  6.116s
49
50L2_2012-01       (103.2MB) 16.761s  0.868s
51L2_2012-02       (130.5MB) 20.117s  1.240s
52gen1000          (28MB)    7.480s   0.824s
53gen10000         (280MB)   59.472s  9.077s
Note: See TracBrowser for help on using the repository browser.