Copyright © 2024 Ashok P. Nadkarni. All rights reserved.
1. Introduction
The tclcsv
extension for Tcl provides a fast and flexible means of
reading and writing text files in Comma Separated Value (CSV) format.
Tcllib also has a package csv that is capable
of reading and writing CSV files. It has the advantage of being
a pure Tcl package but conversely has much lower performance which
is an issue only for larger files. It is also a little less flexible
in terms of input syntax.
|
The extension requires Tcl version 8.6 or later.
2. Downloads and installation
Prebuilt 32- and 64-bit Windows binaries for Tcl 8.6 and 9.0 are available
from https://sourceforge.net/projects/tclcsv/files/. Unzip the distribution
into a directory that is included in your Tcl auto_path
.
For *ix systems, a TEA-compliant source distribution can be downloaded from from the same location.
3. General usage
To use the extension, load it with package require
:
package require tclcsv
→ 2.4.3
3.1. Reading data
The package provides two ways to read CSV data from a channel:
Both forms take various options that indicate the specific CSV dialect as well as options that limit which rows of the data are returned.
Although the
and then create channels from string data as shown in the examples in this documentation. |
3.3. CSV dialects
The exact form of CSV data can vary. CSV ''dialects'' may differ in terms of delimiter character, the use of quotes, treatment of leading whitespace, header lines and so on. The dialect command returns appropriate values for options to be passed to the csv_read and reader commands to handle well-known dialects such as Excel.
In addition, the tclcsv
package
provides convenience commands that are primarily intended
for interactive use when the dialect used for the CSV data is not known.
-
The sniff command uses heuristics to determine the format of the CSV data and returns a list of appropriate options required for parsing it with the csv_read or reader commands.
-
The sniff_header command uses heuristics to determine the types of the columns in the CSV data and whether the data is prefixed with a header line.
3.3.1. Interactive dialect configuration
As an aid for both the programmer as well as the end-user to correctly select the various options for a dialect when the exact dialect for a file is not known, the package also provides the dialectpicker Tk widget. The widget allows the user to set various parameters, such as file encoding, delimiter etc. simultaneously previewing the first few lines. These settings can be programmatically retrieved and passed to one of the CSV read commands to correctly parse the data.
4. Command reference
All commands are located in the tclcsv
namespace.
4.1. Commands
4.1.1. csv_read ?OPTIONS? CHANNEL
The command reads data from the specified channel (which must not be non-blocking) and returns a Tcl list each element of which is a list corresponding to one row in the read CSV data. The caller should have appropriately positioned the channel read pointer and configured its encoding before calling this command.
The command will normally read all data from the channel until EOF is encountered and return the corresponding rows. The following options modify this behaviour:
The following options collectively specify the dialect of the CSV data.
The command does not require that all rows have the same number of fields. If required, the caller has to check that all returned rows have the same number of elements. |
4.1.2. csv_write ?OPTIONS? CHANNEL ROWS
The command writes ROWS to the specified channel CHANNEL. ROWS must be a list each of whose elements is a sublist corresponding to a single record. The caller should have appropriately positioned the channel write pointer and configured its encoding before calling this command.
The CSV dialect used for writing is controlled through the options in the table below.
4.1.3. dialect NAME ?DIRECTION?
Returns the appropriate values for options
for the CSV dialect NAME.
Currently, NAME must be excel
or excel-tab
which correspond
to the CSV formats supported by Excel. The former uses commas
and the latter tabs.
If DIRECTION is read
(default), returned options are
suitable for passing to
csv_read and
reader. If DIRECTION is write
the options are suitable for
csv_write.
% tclcsv::dialect excel
→ -delimiter , -quote {"} -doublequote 1 -skipleadingspace 0
4.1.4. reader SUBCOMMAND ?OPTIONS?
This command takes one of the two forms shown below.
reader create CMDNAME ?OPTIONS? CHANNEL
reader new ?OPTIONS? CHANNEL
Each form creates a command object that will incrementally parse CSV data from the specified channel (which must not be non-blocking). The caller should have appropriately positioned the channel read pointer and configured its encoding before calling this command.
The reader create
command allows the caller
to specify the name of this command object whereas reader new
will
generate a new unique name. Both return the name of the created command.
Options are as detailed for the csv_read
command with the exception of the -nrows
option which is not relevant
for this interface.
The methods supported by the reader command objects are detailed below.
READER destroy
Destroys the READER command object. Note that closing the attached channel is the caller’s responsibility.
READER eof
Returns 1 if there are no more rows and 0 otherwise.
READER next ?COUNT?
Returns one or more rows. If COUNT is not specified, the return value is a list corresponding to a single row. If COUNT is specified, the return value is a list of up to COUNT sublists each of which corresponds to a row. Fewer than COUNT rows may be returned if that many are not available.
Note that READER next
is not the same as READER next 1
.
The former returns a single row, the latter returns a list containing
a single row.
When no more rows are available, the method returns an empty list.
This is not distinguishable from an empty line in the CSV input
if the -skipblanklines
option was specified as false
. The
eof method may be used to distinguish
the two cases.
The following is an example of parsing using reader
objects.
% set fd [tcl::chan::string { \
r0c0, r0c1, r0c2
r1c0, r1c1, r1c2
r2c0, r2c1, r2c2
r3c0, r3c1, r3c2
}]
→ rc1
% set reader [tclcsv::reader new -skipleadingspace 1 $fd]
→ ::tclcsv::reader1
% $reader next
→ r0c0 r0c1 r0c2
% $reader next 1 (1)
→ {r1c0 r1c1 r1c2}
% $reader next 2
→ {r2c0 r2c1 r2c2} {r3c0 r3c1 r3c2}
% $reader next
% $reader eof
→ 1
% $reader destroy
% close $fd
1 | Note difference in return value from previous command |
4.1.5. sniff ?-delimiters DELIMITERS? CHANNEL
Attempts to guess the format of the data in the channel and returns a list of appropriate options to be passed to csv_read ?OPTIONS? CHANNEL. The command uses heuristics that may not work for all files and as such is intended for interactive use.
The channel must be seekable and the command always returns the channel in the same position it was in when the command was called. This is true for both normal returns as well as exceptions.
% set fd [tcl::chan::string { \
r0c0, r0c1, r0c2
r1c0, r1c1, r1c2
r2c0, r2c1, r2c2
}]
→ rc2
% set opts [tclcsv::sniff $fd]
→ -delimiter , -skipleadingspace 1
% tclcsv::csv_read {*}$opts $fd
→ {r0c0 r0c1 r0c2} {r1c0 r1c1 r1c2} {r2c0 r2c1 r2c2}
% close $fd
% set fd [tcl::chan::string { \
'r0;c0';'r0c1';'r0c2'
'r1c0'; 'r1c1'; 'r1c2'
'r2c0'; 'r2c1'; 'r2c2'
}]
→ rc3
% set opts [tclcsv::sniff $fd]
→ -delimiter {;} -skipleadingspace 1 -quote '
% tclcsv::csv_read {*}$opts $fd
→ {{r0;c0} r0c1 r0c2} {r1c0 r1c1 r1c2} {r2c0 r2c1 r2c2}
% close $fd
4.1.6. sniff_header ?OPTIONS? CHANNEL
Attempts to guess whether the CSV data contained in the channel includes a header. It also attempts to guess the type of the data in each column of the CSV file. OPTIONS specify the CSV dialect of the data. See CSV read format options.
If the data includes a header, the command returns a list with two elements, the first of which is a list containing the deduced type of each column, and the second element being a list containing the header fields for each column. If the command deduces that the data does not contain a header, the returned list does not contain the second element.
The deduced type of each column is one of integer
, real
or
string
. Note that integer
type check is done as a decimal
string and thus hexadecimal values are treated as strings
and values like 08
(invalid octal) are accepted as valid
integer values.
The command uses heuristics that may not work for all files and as such is intended for interactive use.
The channel must be seekable and the command always returns the channel in the same position it was in when the command was called. This is true for both normal returns as well as exceptions.
The following examples show the return values with or without a header being present.
% set fd [tcl::chan::string { \
City, Longitude, Latitude
New York, 40.7127, 74.0059
London, 51.5072, 0.1275
}]
→ rc4
% tclcsv::sniff_header $fd
→ {string real real} {{ City} { Longitude} { Latitude}}
% close $fd
%
% set fd [tcl::chan::string { \
New York, 40.7127, 74.0059
London, 51.5072, 0.1275
}]
→ rc5
% tclcsv::sniff_header $fd
→ {string real real}
% close $fd
Note that when a header is present, you can use the -skiplines
option
to csv_read to skip the header.
5. Widget reference
The package provides a single widget, dialectpicker
, for configuring
the dialect settings used to parse CSV data.
5.1. Widgets
5.1.1. dialectpicker WIDGET ?OPTIONS? DATASOURCE
The dialectpicker
widget allows interactive configuration of
the dialect settings for parsing CSV data from the specified
file or channel. The widget
presents controls for the various settings and permits the user
to modify them and inspect the results of parsing the CSV data
from the channel using the configured settings.
WIDGET should be the Tk window path for the widget. This is also the return value of the command.
DATASOURCE should be either the path to a file or the name of the channel from which the CSV data is to be read. In the case of a channel, the configuration, including the seek position and encoding, of the channel is restored to its original when the widget is destroyed.
In addition to Tk, the widget requires the snit
package,
available as part of
tcllib,
to be installed.
An example invocation is shown below.
dialectpicker
widgetThe top half of the widget contains the various settings related to parsing of CSV data. The bottom half displays a preview table which is updated as these settings are modified by the user.
On creation, the widget sets the initial values by
sniffing the channel. These can be overridden
by specifying options to dialectpicker
when the
widget is created. These options are
The current settings for the widget can be retrieved through
two method calls encoding
and dialect
.
WIDGET encoding
Returns the character encoding name currently selected in the widget.
WIDGET dialect
Returns a dictionary of CSV dialect options with values as
set in the widget. The dictionary contains all the options
shown in dialectpicker options. If the user
has deselected any of the Included
checkboxes for any
column in the preview pane, the dictionary also includes
an -includedfields
option specifying the subset of fields to be
read from the data. If the Header is present
checkbox is
selected, the dictionary includes a -startline 1
option
indicating the first line should be skipped when reading data.
5.1.2. Example
The following code uses the widget::dialog
dialog widget from
the tklib package
to read CSV data using user-selected settings.
package require tclcsv
package require widget::dialog
widget::dialog .dlg -type okcancel
tclcsv::dialectpicker .dlg.csv qb.csv
.dlg setwidget .dlg.csv
set response [.dlg display] (1)
if {$response eq "ok"} {
set fd [open qb.csv]
set encoding [.dlg.csv encoding]
chan configure $fd -encoding $encoding (2)
set opts [.dlg.csv dialect]
set rows [tclcsv::csv_read {*}$opts $fd]
close $fd
}
destroy .dlg
1 | User response will "ok" or "cancel" |
2 | Note we have to explicitly set encoding prior to calling csv_read |
6. Source code
The source code is available from its repository at https://github.com/apnadkarni/tcl-csv.
7. Reporting bugs
Report any bugs at https://github.com/apnadkarni/tcl-csv/issues.
8. License
See the file license.terms
in the distribution or in the
src
directory in the source repository.
9. Acknowledgements
The core of the CSV parsing code is adapted from the CSV parser implemented by the Python pandas library.
The hashing code is from attractivechaos.
10. Version history
-
Tcl 9 support.
-
Added csv_write for writing.
-
40% faster parsing.
-
Modify
dialectpicker
to accept either a file path or a channel -
Latent support for the tarray package
-
Added
dialectpicker
widget -
Added options
-includefields
and-excludefields
. -
Tweaks to
sniff_header
to improve type and header heuristics