REBOL

REBOL3 - Parse (Discussion of PARSE dialect [web-public])

Return to Index Page
Most recent messages (300 max) are listed first.

#UserMessageDate
6087EndoThank you, I don't need an exact regexp library, but would be nice to have some regexp functionality.5-Jan 11:47
6086RebolekEndo, I will try to find newest version and let you know. But do not expect it to translate every regular expession.5-Jan 10:18
6085EndoAny one knows how do I find rebolek's R2E2 - REBOL Regular Expressions Engine. This link is dead I think http://bolek.techno.cz/reb/regex.r

I saw it on http://www.rebol.org/documentation.r?script=regset.r

5-Jan 9:42
6084EndoI'm working with SQL Server for a long time, if anything I can help or test for you, feel free to ask if you need.20-Dec 18:24
6083EndoThe biggest problem would be the different datatypes for different versions of SQL Server, if there is no good documentation for the native format. But BCP does the job quite well. I CALL it when necessary and try to FIND if any error output. There is XML format files as well, easier to understand but no functional differencies betwenn non-XML format files.20-Dec 18:23
6082BrianHI figure it might be worth it (for me at some point) to do some test exports in native format in order to reverse-engineer the format, then write some code to generate that format ourselves. I have to do a lot of work with SQL Server, so it seems inevitable that such a tool will be useful at some point, or at least the knowledge gained in the process of writing it.20-Dec 18:17
6081EndoNative formats runs well if you export from one SQL server and import from other.20-Dec 18:09
6080EndoIt uses a format file, it is very strict, but no chance to set a quote char for fields.20-Dec 18:09
6079BrianHHave you looked into the native type formatting of bcp? It might be easier to make a more precise data file that way.20-Dec 18:05
6078BrianHNote that that was a first-round mockup of the R3 version, Endo. If you want to make an R2 version, download the latest script and edit it similarly.20-Dec 18:03
6077EndoThanks BrianH20-Dec 17:58
6076HenrikThanks.20-Dec 16:54
6075BrianHUpdated, Henrik.20-Dec 16:45
6074BrianHWeirdly enough, = and =? return true in that case in R2, but only == returns false; false is what I would expect for =? at least.20-Dec 16:40
6073BrianHNope, that's a bug in the R2 version only. Change this: :x = #"^(22)" [{""""}] to this: :x == #"^(22)" [{""""}]

Another incompatibility between R2 and R3 that I forgot :( I'll update the script on REBOL.org.

20-Dec 16:30
6072HenrikIs this related to what you wrote above?

>> to-csv [34] == {""""}

20-Dec 16:22
6071BrianHBe careful, if you don't quote string values then the character set of your values can't include cr, lf or your delimiter. It requires so many changes that it would be more efficient to add new formatter functions to the associated FUNCT/with object, then duplicate the code in TO-CSV that calls the formatter. Like this:

to-csv: funct/with [ "Convert a block of values to a CSV-formatted line in a string." data [block!] "Block of values" /with "Specify field delimiter (preferably char, or length of 1)" delimiter [char! string! binary!] {Default ","} ; Empty delimiter, " or CR or LF may lead to corrupt data /no-quote "Don't quote values (limits the characters supported)" ] [ output: make block! 2 * length? data delimiter: either with [to-string delimiter] [","] either no-quote [ unless empty? data [append output format-field-nq first+ data] foreach x data [append append output delimiter format-field-nq :x] ] [ unless empty? data [append output format-field first+ data] foreach x data [append append output delimiter format-field :x] ] to-string output ] [ format-field: func [x [any-type!] /local qr] [ ; Parse rule to put double-quotes around a string, escaping any inside qr: [return [insert {"} any [change {"} {""} | skip] insert {"}]] case [ none? :x [""] any-string? :x [parse copy x qr] :x = #"^(22)" [{""""}] char? :x [ajoin [{"} x {"}]] money? :x [find/tail form x "$"] scalar? :x [form x] date? :x [to-iso-date x] any [any-word? :x binary? :x any-path? :x] [parse to-string :x qr] 'else [cause-error 'script 'expect-set reduce [ [any-string! any-word! any-path! binary! scalar! date!] type? :x ]] ] ] format-field-nq: func [x [any-type!]] [ case [ none? :x [""] any-string? :x [x] money? :x [find/tail form x "$"] scalar? :x [form x] date? :x [to-iso-date x] any [any-word? :x binary? :x any-path? :x] [to-string :x] 'else [cause-error 'script 'expect-set reduce [ [any-string! any-word! any-path! binary! scalar! date!] type? :x ]] ] ] ]

If you want to add error checking to make sure the data won't be corrupted, you'll have to pass in the delimiter to format-field-nq and trigger an error if it, cr or lf are found in the field data.

20-Dec 16:12
6070EndoI'm using it to prepare data to bulk insert into a SQL Server table using BCP command line tool. I need to make some changes like /no-quote to not quote string values. Because there is no option in BCP to tell my data has quoted string values.20-Dec 13:03
6069BrianHAdded a TO-CSV /with delimiter option, in case commas aren't your thing. It only specifies the field delimiter, not the record delimiter, since TO-CSV only makes CSV lines, not whole files.20-Dec 7:19
6068GrahamCYeah, generally math is faster than using logic. And old Forth trick.20-Dec 5:30
6067BrianHUpdated on REBOL.org to use new method.19-Dec 18:48
6066BrianHTwice the speed using your method :)19-Dec 18:42
6065GrahamCand the outcome was?19-Dec 18:40
6064BrianHIt's worth timing. I'll try both, in R2 and R3.19-Dec 3:43
6063GrahamCeg. next form 100 + date/month19-Dec 1:21
6062GrahamCdunno if it's faster but to left pad days and months, I add 100 to the value and then do a next, followed by a form ie. regarding you p0 function19-Dec 1:20
6061HenrikThanks18-Dec 17:18
6060BrianHTO-ISO-DATE fixed on REBOL.org18-Dec 17:10
6059BrianHHaving to put an explicit conversion from blocks, parens, objects, maps, errors, function types, structs, routines and handles, reminds you that you would need to explicitly convert them back when you LOAD-CSV. Or more often, triggers valuable errors that tell you that unexpected data made it in to your output.18-Dec 17:05
6058BrianHAs for that TO-ISO-DATE behavior, yes, it's a bug. Surprised I didn't know that you can't use /hour, /minute and /second on date! values with times in them in R2. It can be fixed by changing the date/hour to date/time/hour, etc. I'll update the script on REBOL.org.18-Dec 16:45
6057BrianHYeah, blocks for cells are so far outside the data model of everything else that uses CSV files that TO-CSV was written to assume that you forgot to put an explicit translation to a string or binary in there (MOLD, FORM, TO-BINARY), or more likely that the block got in there by accident. Same goes for functions and a few other types.18-Dec 16:38
6056HenrikAlso it seems that TO-CSV does not like blocks for cells.18-Dec 16:08
6055HenrikBrianH, testing csv-tools.r now.

Is this a bug?:

>> to-iso-date 18-Dec-2011/14:57:11 ** Script Error: Invalid path value: hour ** Where: ajoin ** Near: p0 date/hour 2 ":" p0 >> system/version == 2.7.8.3.1

18-Dec 13:59
6054SunandaCrude maybe, yet looks effective -- thanks!8-Dec 12:48
6053PeterWoodVery crudely adding an additional space if the last character is space:

>> s: " a " == " a " >> if #" " = last s [append s " " ] == " a " >> parse/all s " " == ["" "a" ""]

8-Dec 12:27
6052SunandaDebugging some live code here .... I wasn't expecting 'parse to drop the last space in the second case here: parse/all " a" " " == ["" "a"] parse/all " a " " " == ["" "a"] So after the parse, it seems that " a" = " a "

Any thoughts on a quick work around? Thanks!

8-Dec 12:13
6051EndoSELECT ... INTO Chart.gif Nice addition to SQL :)8-Dec 11:47
6050PekrBrianH: one of my guys returned from some MS training, and he pointed me out to LogParser. It seems even some guys at MS are kind of dialecting :-) It looks like SQL, and you can query logs, and do some nice stuff around them ....

http://technet.microsoft.com/en-us/library/ee692659.aspx

8-Dec 11:07
6049BrianHThe RFC is fairly loose and incomplete documentation of the observed behavior of most CSV handling tools. Excel's behavior is the real defacto standard, for better or worse.7-Dec 15:07
6048BrianHI considered making a /strict option to make it trigger errors in that case, but then reread the RFC and checked the behavior again, and realized that noone took the spec that strictly. Most tools either behave exactly the same as my LOAD-CSV (because that's how Excel behaves), or completely fail when there are any quotes in the file, like PARSE data "," and PARSE/all data ",".7-Dec 15:04
6047BrianHThe values are only considered to be surrounded by quotes if those quotes are directly next to the commas; otherwise, the quotes are data. In the case you give above, according to the spec the quotes in the data should not be allowed - they are bad syntax. However, since the spec in the RFC doesn't define what to do in the case of data that doesn't match the spec, I decided to match the error fallback behavior of Excel, the most widely used CSV handler. Most of the other tools I've tried match the same behavior.7-Dec 15:00
6046ChristianEDo you consider LOAD-CSV { " a " , " b " , " c " } yielding [[{ " a " } { " b " } { " c " }]] to be on spec? It says that spaces are part of a field's value, yet it states that fields may be enclosed in double quotes. I'd rather expected [[" a " " b " " c "]] as a result. The way it is, LOAD-CSV in such cases parses unescaped double quotes as part of the value, IMHO that's not conforming with the spec.7-Dec 9:26
6045BrianHJust tweaked it to add any-path! support to the after parameter in the R3 version, since R3 supports SET any-path!.6-Dec 20:42
6044BrianHThis pass-by-word convention is a little too C-like for my tastes. If only we had multivalue return without overhead, like Lua and Go.6-Dec 19:55
6043BrianHI was a little concerned about making /part take two parameters, since it doesn't anywhere else, but the only time you need that continuation value is when you do /part, and you almost always need it then. Oh well, I hope it isn't too confusing :)6-Dec 19:53
6042GreggThanks for posting that Brian!6-Dec 19:46
6041BrianHhttp://www.rebol.org/view-script.r?script=csv-tools.r updated, with the new LOAD-CSV /part option.

The LOAD-CSV /part option takes two parameters: - count: The maximum number of decoded lines you want returned. - after: A word that will be set to the position of the data after the decoded portion, or none.

If you are loading from a file or url then the entire data is read, and after is set to a position in the read data. If you are converting from binary then in R2 after is set an offset of an as-string alias of the binary, and in R3 after is set to an offset of the original binary. R3 does binary conversion on a per-value basis to avoid having to allocate a huge chunk of memory for a temporary, and R2 just does string aliasing for the same reason. Be careful to expect that if you are passing the value assigned to after to anything else than LOAD-CSV (which can handle it either way).

6-Dec 19:17
6040BrianHLOAD has the same problem on R2 and R3. The continuation returned would be an offset reference to the entire data in the file, at the position after the part parsed.5-Dec 23:09
6039HenrikI need to go to bed. If there are more questions, I'll be back tomorrow.5-Dec 23:09
6038HenrikThat's fine by me, as I read the file into memory once due to the need for one-time UTF-8 conversion, so that will happen outside LOAD-CSV.5-Dec 23:09
6037BrianHThe main problem with /part is that the current code reads the whole file into memory before parsing, and the parsing itself has miniscule overhead compared to the file overhead. Really doing /part properly might require incremental file reading, to the extent that that works (how well does it work for the http scheme on R3?).5-Dec 23:07
6036BrianHThe latter makes chaining of the data to other functions easier, but requires a variable to hold the continuation; however, you usually use a variable for that anyway. The former makes it easier to chain both values (and looks nicer to R2 fans), but the only function you normally chain both values to is SET, so that's of limited value.5-Dec 23:01
6035Henrikoutput: load-csv/into/next data output 'data5-Dec 22:58
6034Henriksecond one looks ok5-Dec 22:58
6033BrianHSorry that first one was: set [output data] load-csv/into/next data output5-Dec 22:56
6032BrianHWhich do you prefer as a /next style? set [output data] load-csv/into data output or output: load-csv/into/next data output 'data5-Dec 22:55
6031Henrik(better response time, when the user abuses import adjustment buttons)5-Dec 22:52
6030HenrikWell, sure, but I like to have complete control over things like that, so I usually settle for showing only the first 100 lines.5-Dec 22:52
6029BrianHFunny, for my purposes it has to get over 100000 lines before it starts to be large :)5-Dec 22:50
6028HenrikI don't really need anything but having the ability to parse the first 100 lines of a file and doing that many times, so I don't care so much about continuation. This is for real-time previews of large CSV files (> 10000 lines).5-Dec 22:49
6027BrianHYes, that is a possibility, but there yet. Resuming would be a problem because you'd have to either save a continuation position or reparse. Maybe something like LOAD/next would work here, preferably like the way R3's LOAD/next was before it was removed in favor of TRANSCODE/next. Making the /into option work with /next and /part would be interesting.5-Dec 22:47
6026Henrik"since you can't just READ/lines a CSV file" - yes, mine does that, and that's no good.5-Dec 22:47
6025Henrikcan it be told to stop parsing after N lines instead? as far as I can tell from the source: (output: insert/only output line), it could do that.5-Dec 22:42
6024BrianHIt doesn't do conversion from string (or even from binary with LOAD-CSV/binary). This doesn't have a /part option but that is a good idea, especially since you can't just READ/lines a CSV file because it treats newlines differently depending on whether the value is in quotes or not. If you want to load incrementally (and can break up the lines yourself, for now) then LOAD-CSV supports the standard /into option.5-Dec 22:37
6023HenrikWell, now, Brian, this looks very convenient. :-) I happen to be needing a better CSV parser, than the one I have here, but it needs to not convert cell values away from string, and I also need to parse partially, or N number of lines. Is this possible with this one?5-Dec 22:32
6022BrianHNonetheless, this LOAD-CSV even handles multichar field delimiter options; in R2 that requires some interesting PARSE tricks :)5-Dec 22:16
6021BrianHMaking the end-of-line delimiter an option turned out to be really tricky, too tricky to be worth it. The code and time overhead from just processing the option itself was pretty significant. It would be a better idea to make that kind of thing into a separate function which requires the delimiters to be specified, or a generator that takes a set of delimiters and generates a function to handle that specific set.5-Dec 22:13
6020BrianHFull version with other CSV functions posted here: http://www.rebol.org/view-script.r?script=csv-tools.r5-Dec 22:02
6019BrianHThe one above misses one of the Excel-like bad data handling patterns. Plus, I've added a few features, like multi-load, more option error checking , and R3 versions. I'll post them on REBOL.org today.4-Dec 20:19
6018GreggThanks for posting Brian. I second Steeve's suggestion, though I'll snag it here for testing.4-Dec 20:14
6017SteeveDon't forget to post your script on rebol.org when finished :-)3-Dec 19:47
6016BrianHThe R3 version will be simpler and faster because of the PARSE changes and better binary handling. However, url handling might be trickier because READ/string is ignored by all schemes at the moment.3-Dec 19:08
6015BrianH>> load-csv {^M^/" a""", a""^Ma^/^/} == [[""] [{ a"} { a""}] ["a"] [""]] >> load-csv/binary to-binary {^M^/" a""", a""^Ma^/^/} == [[#{}] [#{206122} #{20612222}] [#{61}] [#{}]]3-Dec 19:04
6014BrianHHere's the R2 version, though I haven't promoted the emitter to an option yet:

load-csv: funct [ "Load and parse CSV-style delimited data. Returns a block of blocks." [catch] source [file! url! string! binary!] /binary "Don't convert the data to string (if it isn't already)" /with "Use another delimiter than comma" delimiter [char! string! binary!] /into "Insert into a given block, rather than make a new one" output [block!] "Block returned at position after the insert" ] [ ; Read the source if necessary if any [file? source url? source] [throw-on-error [ source: either binary [read/binary source] [read source] ]] unless binary [source: as-string source] ; No line conversion ; Use either a string or binary value emitter emit: either binary? source [:as-binary] [:as-string] ; Set up the delimiter unless with [delimiter: #","] valchar: remove/part charset [#"^(00)" - #"^(FF)"] join crlf delimiter ; Prep output and local vars unless into [output: make block! 1] line: [] val: make string! 0 ; Parse rules value: [ ; Value surrounded in quotes {"} (clear val) x: to {"} y: (insert/part tail val x y) any [{"} x: {"} to {"} y: (insert/part tail val x y)] {"} (insert tail line emit copy val) | ; Raw value x: any valchar y: (insert tail line emit copy/part x y) ] ; as-string because R2 doesn't parse binary that well parse/all as-string source [any [ end break | (line: make block! length? line) value any ["," value] [crlf | cr | lf | end] (output: insert/only output line) ]] also either into [output] [head output] (source: output: line: val: x: y: none) ; Free the locals ]

All my tests pass, though they're not comprehensive; maybe you'll come up with more. Should I add support for making the row delimiter an option too?

3-Dec 19:03
6013BrianHThere's an ad-hoc defacto standard, but it's pretty widely supported. I admit, the binary support came as a bit of a surprise :)3-Dec 18:51
6012GreggAs far as standards compliance, I didn't know there was a single standard. ;-)3-Dec 18:47
6011BrianHBecause of R2's crappy binary parsing (yes, you can put binary data in CSV files) I used an emitter function in the R2 version. This could easily be exported to an option, to let you provide your own emiter function which does whatever conversion you want.3-Dec 18:07
6010KajSounds like a job for a dialect that specifies what is supposed to be in the columns3-Dec 17:13
6009BrianHI'm putting LOAD-CSV in the %rebol.r of my dbtools, treating it like a mezzanine. That's why I need R2 and R3 versions, because they use the same %rebol.r with mostly the same functions. My version is a little more forgiving than the RFC above, allowing quotes to appear in non-quoted values. I'm making sure that it is exactly as forgiving on load as Excel, Access and SQL Server, resulting in exactly the same data, spaces and all, because my REBOL scripts at work are drop-in replacements for office automation processes. If anything, I don't want the loader to do value conversion because those other tools have been a bit too presumptuous about that, converting things to numbers that weren't meant to be. It's better to do the conversion explicitly, based on what you know is supposed to go in that column.3-Dec 16:44
6008BrianHI figure that dealing with malformed data, or even converting the strings to other values, is best done post-process. Might as well take advantage of modifiable blocks.3-Dec 16:40
6007Ashley"it doesn't work if it trims the values." - that may not be the standard, but when you come across values like:

1, 2, 3

the intent is quite clear (they're numbers) ... if we retained the leading spaces then we'd be treating these values (erroneously) as strings. There's a lot of malformed CSV out there! ;)

3-Dec 8:59
6006BrianHIt needs to handle "" escaping too, but only in the case where values are quoted. Anyway, I have the function mostly done. I'll polish it up tomorrow.3-Dec 8:15
6005BrianHBut it doesn't assure that "a , b" -> ["a " " b"]. It doesn't work if it trims the values.3-Dec 8:12
6004AshleyActually, 4) above is easily solved by adding an additional switch case:

#" " [all [not s poke data i #"^""]]

This will ensure "a , b" -> ["a" "b"]

3-Dec 5:49
6003BrianHI'm working on a fully standards-compliant full-file LOAD-CSV - actually two, one for R2 and one for R3. Need them both for work. For now I'm reading the entire file into memory before parsing it, but I hope to eventually make the reading incremental so there's more room in memory for the results.3-Dec 0:31
6002BrianHIndividual values should not be trimmed if you want the loader to be CSV compatible. However, since TRIM is modifying you can post-process the values pretty quickly if you like.3-Dec 0:18
6001Ashleyload-csv fails to deal with these 3 simple (and for me, common) cases:

1,"a b" 2,"a""b" 3,

>> load-csv %test.csv == [["1" "a"] [{b"}] ["2" "a" "b"] ["3"]]

I've reverted to an in situ brute force approach:

c: make function! [data /local s] [ all [find data "|" exit] s: false repeat i length? trim data [ switch pick data i [ #"^"" [s: complement s] #"," [all [not s poke data i #"|"]] #"^/" [all [s poke data i #" "]] ] ] remove-each char data [char = #"^""] all [#"|" = last data insert tail data #"|"] ; only required if we're going to parse the data parse/all data "|^/" ]

which has 4 minor limitations: 1) the data can't contain the delimter you're going to use ("|" in my case) 2) it replaces quoted returns with another character (" " in my code) 3) it removes all quote (") characters (to allow SQLite .import and parse/all to function correctly) 4) Individual values are not trimmed (e.g.c "a ,b" -> ["a " "b"])

If you can live with these limitations then the big benefit is that you can omit the last two lines and have a string that is import friendly for SQLite (or SQL Server) ... this is especially important when dealing with large (100MB+) CSV files! ;)

2-Dec 23:40
6000GreggI did head down the path of trying to handle all the things REBOL does wrong with quoted fields and such, but I have always found a way to avoid dealing with it.2-Dec 21:29
5999Greggload-csv: func [ "Load and parse a delimited text file." source [file! string!] /with delimiter /local lines ][ if not with [delimiter: ","] lines: either file? source [read/lines source] [parse/all source "^/"] remove-each line lines [empty? line] if empty? lines [return copy []] head forall lines [ change/only lines parse/all first lines delimiter ] ]2-Dec 21:25
5998GreggArgh. Shouldn't just post the first one I find. Ignore that. It doesn't handle file!.2-Dec 21:24
5997Greggload-csv: func [ "Parse newline delimited CSV records" input [file! string!] /local p1 p2 lines ] [ lines: collect line [ parse input [ some [p1: [to newline | to end] p2: (line: copy/part p1 p2) skip] ] ] collect/only rec [ foreach line lines [ if not empty? line [rec: parse/all line ","] ] ] ]2-Dec 21:21
5996BrianHFor the purposes of discussion I'll put the CSV data inside {}, so you can see the ends, and the results in a block of line blocks.

This: { "a" } should result in this: [[{ "a" }]]

This: { "a b" } should result in this: [[{ "a}] [{b" }]]

This: {"a b"} should result in this: [[{a b}]]

This: {"a ""b"" c"} should result in this: [[{a "b" c}]]

This: {a ""b"" c} should result in this: [[{a ""b"" c}]]

This: {"a", "b"} should result in this: [["a" { "b"}]]

2-Dec 16:31
5995BrianHCSV is not supposed to be forgiving of spaces around commas. Even the "" escaping to get a " character in the middle of a " surrounded value is supposed to be turned off when the comma, beginning of line, or end of line have spaces next to them.2-Dec 16:18
5994BrianHMy func handles 100% of the CSV standard - http://tools.ietf.org/html/rfc4180 - at least for a single line. To really parse CSV you need a full-file parser, because you have to consider that newlines in values surrounded by quotes are counted as part of the value, but if the value is not surrounded completely by quotes (including leading and trailing spaces) then newlines are treated as record separators.2-Dec 16:13
5993BrianHIf there is a space after the comma and before the ", the " is part of the value. The " character is only used as a delimiter if it is directly next to the comma.2-Dec 16:08
5992EndoThese are also a bit strange: >> parse-csv {"a", "b"} == ["a" { "b"}] >> parse-csv { "a" ,"b"} == [{ "a" } "b"] >> parse-csv {"a" ,"b"} == ["a"]2-Dec 9:52
5991AshleyAlso this case:

{"a,b" ,"c,d"} ; space *before* comma

This case

"a, b"

can be dealt with by replacing "keep any" with "keep trim any" ... but Brian's func handles 95% of the real-life test cases I've thrown at it so far, so a big thanks from me.

2-Dec 9:38
5990EndoBrianH: I tested parsing csv (R2 version) there is just a little problem with space between coma and quote:

parse-csv: func [a][ use [value x] [collect [value: [{"} copy x [to {"} any [{""} to {"}]] {"} (keep replace/all any [x ""] {""} {"}) | copy x [to "," | to end] (keep any [x ""])] parse/all a [value any ["," value]]]]]

parse-csv {"a,b", "c,d"} ;there is space after coma == ["a,b" { "c} {d"}] ;wrong result.

I know it is a problem on CSV input, but I think you can easily fix it and then parse-csv function will be perfect.

2-Dec 9:27
5989PeterWoodBrian - it may be here - http://snippets.dzone.com/posts/show/12812-Dec 7:37
5988BrianHI copied Ashley's example data into a file and checked against several commercial CSV loaders, including Excel and Access. Same results as the parsers above.2-Dec 6:50
5987BrianHThat operation would be a great thing to add to the R3 Parse Proposals :)2-Dec 6:24
5986BrianHI'm sure that the proposed PARSE for Topaz would allow the rule to be even smaller than the R3 version, because it includes COLLECT [KEEP] as PARSE operations.2-Dec 6:22
5985BrianHHere's a version that works in R3, tested against your example code: >> a: deline read clipboard:// == {a, b ,"c","d1 d2",a ""quote"",",",} >> use [x] [collect [parse/all a [some [[{"} copy x [to {"} any [{""} to {"}]] {"} (keep replace/all x {""} {"}) | copy x [to "," | to end] (keep x)] ["," | end]]]]] == ["a" " b " "c" "d1^/d2" {a ""quote""} "," ""]

But it didn't work in R2, leading to an endless loop. So here's the version refactored for R2 that also works in R3 >> use [value x] [collect [value: [{"} copy x [to {"} any [{""} to {"}]] {"} (keep replace/all any [x ""] {""} {"}) | copy x [to "," | to end] (keep any [x ""])] parse/all a [value any ["," value]]]] == ["a" " b " "c" "d1^/d2" {a ""quote""} "," ""]

Note that if you get the b like "b" then it isn't CSV compatible, nor is it if you escape the {""} in values that aren't themselves escaped by quotes. However, you aren't supposed to allow newlines in values that aren't surrounded by quotes, so you can't do READ/lines and parse line by line, you have to parse the whole file.

2-Dec 6:13
5984BrianHGregg, could you post your LOAD-CSV ?2-Dec 5:25
5983BrianHEspecially since I forgot that APPEND isn't native in R2 :(2-Dec 5:21
5982BrianHHere's the R2 version of TO-CSV and TO-ISO-DATE (Excel compatible):

to-iso-date: funct/with [ "Convert a date to ISO format (Excel-compatible subset)" date [date!] /utc "Convert zoned time to UTC time" ] [ if utc [date: date + date/zone date/zone: none] ; Excel doesn't support the Z suffix either date/time [ajoin [ p0 date/year 4 "-" p0 date/month 2 "-" p0 date/day 2 " " ; or T p0 date/hour 2 ":" p0 date/minute 2 ":" p0 date/second 2 ; or offsets ]] [ajoin [ p0 date/year 4 "-" p0 date/month 2 "-" p0 date/day 2 ]] ] [ p0: func [what len] [ ; Function to left-pad a value with 0 head insert/dup what: form :what "0" len - length? what ] ]

to-csv: funct/with [ "Convert a block of values to a CSV-formatted line in a string." [catch] data [block!] "Block of values" ] [ output: make block! 2 * length? data unless empty? data [append output format-field first+ data] foreach x data [append append output "," format-field get/any 'x] to-string output ] [ format-field: func [x [any-type!]] [case [ none? get/any 'x [""] any-string? get/any 'x [ajoin [{"} replace/all copy x {"} {""} {"}]] get/any 'x = #"^"" [{""""}] char? get/any 'x [ajoin [{"} x {"}]] scalar? get/any 'x [form x] date? get/any 'x [to-iso-date x] any [any-word? get/any 'x any-path? get/any 'x binary? get/any 'x] [ ajoin [{"} replace/all to-string :x {"} {""} {"}] ] 'else [throw-error 'script 'invalid-arg get/any 'x] ]] ]

There is likely a faster way to do these. I have R3 variants of these too.

2-Dec 5:18
5981BrianHI use a TO-CSV function that does type-specific value formatting. The dates in particular, to be Excel-compatible. Was about to make a LOAD-CSV function - haven't needed it yet.2-Dec 5:01
5980GreggAshley, not sure exactly what you're after. I use simple LOAD-CSV and BUILD-DLM-STR funcs to convert each direction.1-Dec 23:11
5979EndoGeomol: It would be nice if trim/with supports charsets. And also I would love if I have "trace/parse" just like trace/net, which gives info about parse steps instead of all trace output. Hmm I should add this to wish list I think :)1-Dec 13:14
5978AshleyAnyone written anything to parse csv into an import-friendly stream?

Something like:

a, b ,"c","d1 d2",a ""quote"",",",

a|b|c|d1^/d2|a "quote"|,|

(I'm trying to load CSV files dumped from Excel into SQLite and SQL Server ... these changes will be in the next version of my SQLite driver)

1-Dec 11:46
5977EndoDoc: Thank you. I tried to do that way (advancing the series position) but couldn't. I may add some more things so I wish to do it by parse instead of other ways. And want to learn parse more :) Thanks for all!1-Dec 10:49
5976EndoStrange but I tried to remove the whole part in one time, but its slower than the other:

aaa: [t: "abc56def7" parse/all t [some [x: some non-digit y: (remove/part x y) :x | skip]] head t] bbb: [t: "abc56def7" parse/all t [some [x: non-digit (remove x) :x | skip]] head t] >> benchmark2 aaa bbb ;(executes block 10'000'000 times.) Execution time for the #1 job: 0:00:11.719 Execution time for the #2 job: 0:00:11.265 #1 is slower than #2 by factor ~ 1.04030181979583

1-Dec 10:46
5975DockimbelEndo: in your first attempt, your second rule in SOME block is not making the input advance when the end of the string is reached because (remove "") == "", so it enters an infinite loop. A simple fix could be:

t: "abc56xyz" parse/all t [any [digit (prin "d") | x: skip (prin "." remove x) :x]]

(remember to correctly reset the input cursor when modifying the parsed series)

As others have suggested, they are more optimal ways to achieve this trimming.

1-Dec 10:41
5974EndoOh I think no need to "back" t: "abc56xyz" parse/all t [some [x: non-digit (remove x) :x | skip]] head t1-Dec 10:34
5973EndoIt depends on the input, but if it's a long text with many multiple chars to insert/remove your way will be faster. Thanks1-Dec 10:32
5972Gabrielenote that copying the whole thing is probably faster than removing multiple times. also, doing several chars at once instead of one at a time is faster.1-Dec 10:29
5971Endoa bit more clear: t: "abc56xyz" parse/all t [some [x: non-digit (x: back remove x) :x | skip]] head t1-Dec 10:29
5970EndoI just did the same thing:

t: "abc56xyz" parse/all t [some [x: non-digit (prin first x remove x x: back x) :x | skip]] head t

1-Dec 10:28
5969Gabriele(mm, not sure why the copy/past was messed up. i hope you get the idea anyway.)1-Dec 10:28
5968Gabriele>> s: "abc56xyz" == "abc56xyz" >> digit: charset "1234567890" == make bitset! #{ 000000000000FF03000000000000000000000000000000000000000000000000 } >> non-digit: complement digit == make bitset! #{ FFFFFFFFFFFF00FCFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF } >> parse/all s [(o: copy "") any [mk1: some digit mk2: (insert/part tail o mk1 mk2) | some non-digit]] o == "56"1-Dec 10:27
5967EndoNice way, thank you. But still curios about how to do it with parse.1-Dec 10:23
5966GeomolAlternative not using parse:

>> t: "abc56xyz" == "abc56xyz" >> non-digit: "" == "" >> for c #"a" #"z" 1 [append non-digit c] == "abcdefghijklmnopqrstuvwxyz" >> for c #"A" #"Z" 1 [append non-digit c] == {abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ} >> trim/with t non-digit == "56"

1-Dec 10:18
5965EndoI want to keep the digits and remove all the rest, t: "abc56xyz" parse/all t [some [digit (prin "d") | x: (prin "." remove x)]] print head t this do the work but never finish. If I add a "skip" to the second part the result is "b56y". How do I do?1-Dec 10:15
5964LadislavHmm, to not complicate matters and hoping that it is the simpler variant I modified the CASE/NO-CASE proposal to use the

CASE RULE

and

NO-CASE RULE

syntax, since it really looks like simpler to implement than other possible alternatives.

15-Nov 21:18
5963GreggI like the idea of a CASE option. There haven't been many times I've needed it, but a few. Other things are higher on my priority list for R3, but I wouldn't complain if this made its way in there.15-Nov 17:48
5962BrianHBacktracking often happens within blocks too, but yes, that does limit the scope of the problems caused (it doesn't eliminate the problem, it just limits its scope). Mode operations also don't interact well with flow control operations like OPT, NOT and AND. What would NOT CASE mean if CASE has effect on subsequent code without being tied to it? As a comparison, NOT CASE "a" has a much clearer meaning.15-Nov 17:36
5961BrianHO(n) isn't bad if n is small, especially compared to other parts of the process. Most of my apps are bound by database or filesystem speed.15-Nov 17:29
5960Ladislav(which is exactly the case of the #localize-on / -off directives as well)15-Nov 6:59
5959LadislavRegarding CASE and backtracking: it is not a problem when the effect of the keyword is limited to the nearest enclosing block.15-Nov 6:59
5958LadislavAnyway, I am happy this does not influence my code15-Nov 6:50
5957Ladislav"I need CHANGE too, and the full version with the value you're changing to be an expression in a paren" - this changing during parsing is known to be O(n), i.e. highly inefficient. For any serious code it is a disaster15-Nov 6:49
5956BrianH(bad English grammar day)15-Nov 1:42
5955BrianHLadislav, multitasking and recursion is not the same thing as backtracking. We already have backtracking bugs, we don't need to mandate more.15-Nov 1:40
5954BrianHI would definitely not make that choice. I need CHANGE too, and the full version with the value you're changing to be an expression in a paren - the last part of the proposal that isn't implemented yet. That's at the top of my list.15-Nov 1:38
5953LadislavRegarding a KEEP keyword: may be a reasonable addition. I surely prefer KEEP, when choosing between KEEP and CHANGE.15-Nov 0:43
5952LadislavBTW, the limitation of CASE to just the next rule is not exactly necessary. I would like to point you e.g. to the description of the #localize-on #localize-off user-defined directive pair, which is defined so, that it will not have any problem with multitasking or recursion, yet the directives are not limited to just the subsequent value. (Robert plans to publish the source code and the documentation soon)15-Nov 0:41
5951BrianHWhat do you think of the KEEP operation from Topaz? A good idea, or out of scope for PARSE?14-Nov 21:55
5950LadislavWill have a look, and, will also use one ticket to let Carl know.14-Nov 21:45
5949BrianHIt's especially important to document the denied proposals, since the reasons for their denial would be instructive.14-Nov 21:45
5948BrianHarticle -> page14-Nov 21:44
5947BrianHWe really should go over that article and note which of the proposals was implemented, in which version, and which were denied and why.14-Nov 21:43
5946BrianHSure :)14-Nov 21:42
5945LadislavOK, so, do you think I should put the CASE proposal (mentioning your variant) to the article?14-Nov 21:42
5944BrianHI liked it at the time, at least the bounded modifier version, but of the unimplemented proposals it's not my highest priority.14-Nov 21:41
5943LadislavBut, CASE should be a simpler case ;-)14-Nov 21:40
5942LadislavWell, I am not pushing for it.14-Nov 21:40
5941BrianHThe biggest of which is that it hasn't been implemented yet :(14-Nov 21:39
5940LadislavHmm, REVERSE has more issues, I think14-Nov 21:38
5939BrianHOK, cool. You have to be careful with the "mode" term though. That tripped up some of the last round of parse proposals, such as REVERSE.14-Nov 21:38
5938Ladislav"OK, but you wouldn't need NO-CASE to end a CASE." - What I did propose was just the existence of such keywords, the exact implementation should be the one that is the simplest to implement, which may well be the one you mention.14-Nov 21:35
5937BrianHOK, but you wouldn't need NO-CASE to end a CASE. It would be another modifier, not a mode. Modes like that don't work with backtracking very well. So it would be like this: case ["a" no-case "b" "c"] not like this: case "a" no-case "b" case "c" no-case The two directives would be implemented as flags, like NOT.14-Nov 21:31
5936Ladislav"How about a CASE operation that applies to the next rule, which could be a block? No NO-CASE operation required" - that is an error, even in that case you *would* need NO-CASE14-Nov 20:46
5935BrianHYou'd miss the /into option for incremental collecting and preallocation, but at least you wouldn't need to BIND/copy your rules.14-Nov 17:56
5934BrianHWhile we're at it, the KEEP operation from Topaz would be useful. I use PARSE wrapped in COLLECT, calling KEEP in parens, quite a bit.14-Nov 17:40
5933BrianHIt would be a modifier, like OPT or 1.14-Nov 17:36
5932BrianHHow about a CASE operation that applies to the next rule, which could be a block? No NO-CASE operation required, and better to integrate with backtracking.14-Nov 17:34
5931Ladislav(for switching to case-sensitive mode, and e.g. a NO-CASE for switching to case-insensitive mode)14-Nov 16:00
5930LadislavI think, that it should not be overly complicated to achieve the goal e.g. by using a CASE keyword in PARSE.14-Nov 15:59
5929LadislavAnother Parse discussion subject:

It looked to me like a good idea to be able in one Parse pass to sometimes match some strings in a case-sensitive way and other strings in a case-insensitive way. This is not possible using the /CASE refinement, since the refinement makes all comparison case sensitive, or if not used, all comparisons are case insensitive. Wouldn't it be good to be able to adjust the comparison sensitivity on-the-fly during parsing?

14-Nov 15:57
5928LadislavSorry for not continuing with it, Sunanda, but when I gave it a second thought, it did not look like a possible speed-up could be worth the source code complication.14-Nov 15:53
5927SunandaYes please!2-Nov 0:04
5926LadislavAre you still interested?1-Nov 23:57
5925LadislavBTW, I think, that there is a possible optimization not using the charset you mention1-Nov 23:55
5924Ladislavyes, right, that is an issue1-Nov 23:54
5923SunandaYes, it is doable with map! -- but, as I said awkward.

Another issue (or perhaps just unfixed bug) is the lack of case sensitivity with map! select/case make map! ["A" true] "a" == true

The current work-around is to use binary rather than string data: select make map! reduce [to-binary "A" true] to-binary "a" == none

1-Nov 23:54
5922LadislavAnother solution is to use a sorted block and a binary search, which should be about the same speed as hash1-Nov 23:48
5921LadislavI think it is OK that way1-Nov 23:44
5920Ladislavso, keys are the entities, and the value is either true (for an entity) or none1-Nov 23:42
5919Ladislavas follows:

entities-map: make map! [] foreach entity entities-block [entities-map/:entity: true]

1-Nov 23:41
5918SunandaThat's true, but map! isa bit awkward for just looking up an item in a list.....Map! is optimised for retrieving a value associated with a key.1-Nov 23:40
5917Ladislav"Using a Hash! contributed a lot to Ladislav's speed -- when I tried it as a Block! it was only slightly faster than Geomol's.....What a pity R3 removes hash!" - no problem, in R3 you can use map!1-Nov 23:35
5916SunandaMy test data was heavily weighted towards the live conditions I expect to encounter (average text length 2000. Most texts are unlikely to have more than 1 named entity).

All three scripts produced the same results -- so top marks for meeting the spec!

Under my test conditions, Ladislav was fastest, followed by Geomol, followed by Peter.

Other test conditions changed those rankings....So nothing is absolute.

Using a Hash! contributed a lot to Ladislav's speed -- when I tried it as a Block! it was only slightly faster than Geomol's.....What a pity R3 removes hash!

Thanks for contributing these solutions -- I've enjoyed looking at your code and marvelling at the different approaches REBOL makes possible.

1-Nov 20:24
5915SunandaI've put aside looking at the powermezz for now, and simply decided to use one of the three case-specific solutions offered here.

I made some tweaks to ensure the comparisons I was making were fair (and met a previously unstated condition). -- each in a func -- each works case sensitively (as previously unstated) -- use the complete entity set as defined by the WC3 -- changed Ladislav's Charset as some named entities have digits in their names -- moved Peter's set-up of his entity list out of the function and into one-off init code.

It's been a fun hour of twiddling other people's code.....If you want your modifed code -- please kust ask.

Timing results next .....

1-Nov 20:15
5914SunandaWow -- thanks Gabriele. For me, your powermezz is a much overlooked gem.

I fear I have, in effect, badly implemented chunks of your functionality over the past few months while I've worked on an application that takes unconstrained text and constrains it to look okay in a web page and when printed via LaTeX.

I should have read the documentation first!

1-Nov 18:51
5913GabrieleSunanda, note that this is already available in the text encoding module: http://www.rebol.it/power-mezz/mezz/text-encoding.html1-Nov 10:47
5912LadislavRegarding the optimizations:

- my code is optimized for the case when there are many entities. (hash! search, as Andreas suggested as well) When the number of entities is small, this optimization does not help - my code is optimized for the case when the TEXT is large (append is much faster than in place insert), for small texts this optimization does not help

1-Nov 10:32
5911Ladislav(not that it cannot be applied, but, it is not efficient, in my opinion)1-Nov 10:27
5910Ladislav'The "skip-it" technique could also be applied to Ladislav's code.' - I do not think so1-Nov 10:26
5909PeterWoodAlso I feel using skip could be very slow if the text contains a lot of "non-matching text". The "skip-it" technique could also be applied to Ladislav's code.1-Nov 4:49
5908PeterWoodThat should read head text at the end of the function.1-Nov 4:45
5907PeterWoodPerhaps building a parse rule from the list of entities may be faster if there is a lot of text to process:

This assumes the entities are provided as strings in a block.

escape-amps: func [ text [string!] entities [block!] ][ skip-it: complement charset [#"&"] entity: copy [] foreach ent entities [ insert entity compose [(ent) |]] head remove back tail entity parse/all text [ any [ entity | "&" pos: (insert pos "amp;" pos: skip pos 4) :pos | some skip-it ] ] head tex t ]

1-Nov 4:44
5906LadislavThis alternative does not use the COPY call, so, it has to be faster:

alpha: make bitset! [#"a" - #"z" #"A" - #"Z"] escape-amps: func [ text [string!] entities [hash!] /local result pos1 pos2 pos3 ][ result: copy "" parse/all text [ pos1: any [ ; find the next amp thru #"&" pos2: [ ; entity check some alpha pos3: #";" ( ; entity candidate unless find entities copy/part pos2 pos3 [ ; not an entity insert insert/part tail result pos1 pos2 "amp;" pos1: pos2 ] ) | ( ; not an entity insert insert/part tail result pos1 pos2 "amp;" pos1: pos2 ) ] | (insert tail result pos1) end skip ; no amp found ] ] result ]

1-Nov 1:02
5905AndreasTwo suggestions:

- store your named entities as a hash! (order of magnitude speedup for FIND)

- if you have loooong "words", restrict Ladislav's `some alpha` to the maximum length of a valid entity

1-Nov 0:44
5904SunandaThanks Ladislav and Geomol. Both your solutions work with my test data -- that's always a good sign :)

I'll do some timing tests with large entity lists ..... But I won't be able to do that for 24 hours.

Other approaches still welcome!

31-Oct 22:31
5903GeomolPekr, yeah, probably because I left out the /all refinement. Makes sense.31-Oct 22:02
5902LadislavWith TEXT defined:

>> text: "To send, press the ← arrow & then press ↵.&susp;123"

31-Oct 21:49
5901LadislavThis is how it works:

>> probe escape-amps text named-entities {To send, press the ← arrow & then press ↵.&susp;123} == {To send, press the ← arrow & then press ↵.&susp;123}

31-Oct 21:49
5900LadislavErr: pos3 should be added as a local31-Oct 21:46
5899Ladislav(= inefficient)31-Oct 21:44
5898Ladislav(in place inserts are too slow)31-Oct 21:43
5897LadislavI guess, that this should be efficient:

alpha: make bitset! [#"a" - #"z" #"A" - #"Z"] escape-amps: func [ text [string!] entities [hash!] /local result pos1 pos2 ][ result: copy "" parse/all text [ pos1: any [ ; find the next amp thru #"&" pos2: [ ; entity check some alpha pos3: #";" ( ; entity candidate unless find entities copy/part pos2 pos3 [ ; not an entity insert insert tail result copy/part pos1 pos2 "amp;" pos1: pos2 ] ) | ( ; not an entity insert insert tail result copy/part pos1 pos2 "amp;" pos1: pos2 ) ] | (insert tail result pos1) end skip ; no amp found ] ] result ]

31-Oct 21:43
5896PekrGeomol - your code basically works, no? Just use parse/all:

>> parse/all s [any [thru #"&" [ne | mark: (insert mark "amp;")]]] == false >> s == {To send, press the ← arrow & then press ↵.}

31-Oct 21:07
5895LadislavYes, OK, I just wanted to know31-Oct 21:02
5894Sunandaecceptable ==> acceptable31-Oct 21:01
5893SunandaLadislav -- if it is not in the list, then I'd like it escaped, please. Think of it as a whitelist of ecceptable named entities. All others are suspect :)31-Oct 21:01
5892SunandaThe aim --- Basically, yes, Petr.31-Oct 20:59
5891Ladislav'I want to escape every "&" in the text, unless it is part of a named entity' - just to make sure: if the entity is not in the ENTITIES list, like e.g. " and it is encountered in the given TEXT, what exactly should happen?31-Oct 20:59
5890Pekralso remember - parse does not count spaces in. You are better in using parse/all31-Oct 20:58
5889Pekrnot fluent with html escaping, what's the aim? To replace stand-alone #"&" with "&amp"?31-Oct 20:58
5888SunandaThanks for the quick contributions, geomol. I see a different result too -- a space between the "&" and the "amp"31-Oct 20:58
5887GeomolThat's strange. My 2nd suggestion gives a different result:

ne: ["larr;" | "crarr;"] s: "To send, press the ← arrow & then press ↵." parse s [ any [ thru #"&" [ne | mark: (insert mark "amp;")] ] ] s

== {To send, press the ← arrow & amp;then press ↵.}

Seems like a bug, or am I just tired?

31-Oct 20:34
5886GeomolIt may be faster to drop the & from the entities and change the rule to:

any [thru #"&" [ne | mark: (insert mark "amp;")]

31-Oct 20:31
5885Geomolne: ["←" | "↵"] ; and the rest of the named entities s: "To send, press the ← arrow & then press ↵." parse s [ any [ to #"&" [ne | skip mark: (insert mark "amp;")] ] ] s

== {To send, press the ← arrow & then press ↵.}

31-Oct 20:26
5884SunandaCan anyone gift me an effecient R2 'parse solution for this problem (I am assuming 'parse will out-perform any other approach):

SET UP I have a huge list of HTML named character entities, eg (a very short example): named-entities: ["nbsp" "cent" "agrave" "larr" "rarr" "crarr" ] ;; etc And I have some text that may contain some named entities, eg: text: "To send, press the ← arrow & then press ↵." PROBLEM I want to escape every "&" in the text, unless it is part of a named entity, eg (assuming a function called escape-amps): probe escape-amps text entities == "To send, press the ← arrow & then press ↵." TO MAKE IT EASY.... You can can assume a different set up for the named-entities block if you want; eg, this may be better for you: named-entities: [" " "¢" "à" "←" "→" "↵" ] ;; etc Any help on this would be much appreciated!

31-Oct 18:39
5883onetomParse (YC S11): A Heroku For Mobile Apps. Great name for a startup... http://techcrunch.com/2011/08/04/yc-funded-parse-a-heroku-for-mobile-apps/4-Aug 20:36
5882SunandaDone, thanks.18-Jun 14:42
5881Steeveyep ;-)18-Jun 14:38
5880SunandaWant me to post it for you?18-Jun 14:38
5879Steevecan't post the response18-Jun 14:31
5878Steeveonly the second string is checked. Should be: ['apple some [and string! into ["a" some "b" ]]]18-Jun 14:30
5877SunandaQuestion on string and block parsing: http://stackoverflow.com/questions/639253318-Jun 14:12
5876Maximeh, didn't know it didn't ! yeah that sucks.15-May 6:35
5875onetomit should also honor line breaks within strings then15-May 6:08
5874Maximrule, when given as a string is used to specify the CSV separator.15-May 4:09
5873Maximparse/all string none actually is a CSV loader. its not a split functions. I always found this dumb, but its the way Carl implemented it.15-May 4:08
5872TomcAlthough gerneral happy with the default parse seperators find it neglegent to not permit overriding them. and like Max finds, block parsing ia a rarity when working with real world data streams.15-May 2:39
5871onetom>> parse/all {"asd qwe" zxc} none == ["asd qwe" " zxc"]

>> parse/all {"asd qwe" zxc} " " == ["asd qwe" "zxc"]

it's nice, but it also means there is no plain "split-by-a-character" function in rebol, which is just as annoying as missing a join-by-a-character

14-May 2:51
5870onetomit would make sense to settle w some CSV parser, but not as a default behaviour. i was already surprised that parse handles double quotes too...14-May 2:49
5869onetomthis is exactly the reason why CSV was it a really fucked up idea. comas are there in sentences and multivalued fields, not just numbers. i always use TSV.14-May 2:48
5868GeomolBut it should be possible to take care of those numbers with commas, and ignore all other commas, I think. As we don't ever write 42, but always something like 42,00 if it's a decimal. So if 42, is seen, it can just be read as integer 42 and ignore the comma (if using load/all for example).13-May 13:48
5867GeomolI almost agree. Here we use comma as decimal point. A few countries does that. So all data with money amounts have numbers with comma as decimal point here.13-May 13:45
5866Maximso a comma would be an exact alias for a space, when its not within a string.13-May 13:43
5865Maximyes, I always thought that commas should be removed of decimals, and simply ignored when loaded.

in mechanical data, commas are never used for decimals. because apps need to load it back and all software accept that dots are for decimals and commas for separating lists. why should REBOL try to be different, its just alienating itself from all the data it could gobble up effortlessly.

13-May 13:42
5864GeomolAnd without space, comma should maybe split the text? Like: >> load/all "hello,world!" == [hello world!]13-May 13:39
5863GeomolSo do you suggest, load/all "hello, world!" should return [hello world!] ? (Notice no comma.)13-May 13:38
5862GeomolI have wondered sometimes, what effects it would have, if such commas was just ignored. We need commas in numbers, but maybe commas could just be ignored beside that.13-May 13:32
5861Maximmore like:

parse load/all "hello, world!" [2 word!]

13-May 13:30
5860Maximits happened often yes. less lately, since I'm dealing more with XML and less with raw data.13-May 13:30
5859GeomolDo you mean, you want to be able to parse like this?

>> parse [hello, world!] [2 word!]

13-May 13:28
5858Maximits because I do A LOT more parsing on strings than on blocks.... one of the reasons is that Carl won't allow us to ignore commas in string data. so the vast majority of data which could be read directly by rebol is incompatible.

this is still one of my pet peeves in rebol. trying to be pure, sometimes, just isn't usefull in real life. PARSE is about managing external data, I hate the fact that PARSE isn't trying to be friendly with the vast majority of data out there.

13-May 13:25
5857GeomolMaxim, you asked for a function version of string parse. Was that because of situations like this?13-May 6:04
5856Maximyes it should. :-(13-May 1:24
5855onetom>> parse/all "/docs/rfq/" "/" == ["" "docs" "rfq"]

shouldn't this be either ["docs" "rfq"] or ["" "docs" "rfq" ""] for the sake of consistency?

13-May 1:22
5854BrianHIf you're going to make a better parse, it might be good to take into account the efforts that have already started to improve it in R3. The R3 improvements need a little work in some cases, but the thought that went into the process is quite valuable.

[set end ...] or [copy end ...]: In R3, using any PARSE keyword (not just 'end) in a rule for other reasons triggers an error. >> parse [a] [set end skip] ** Script error: PARSE - command cannot be used as variable: end

[any end] or [some end]: What Ladislav said.

[opt end]: The point of the combination is [opt [end (do something)]]. [opt anything] is no more useless than [opt end]. Don't exclude something that has no effect just for that reason. Remember, [none] has no effect as well, but it's still valuable for making rules more readable.

4-May 17:22
5853LadislavI mean SOME and ANY4-May 8:48
5852LadislavAs to the WHILE keyword: some people may never use it, being content with SOME and AND as they work in R34-May 8:48
5851Geomolok :)4-May 8:44
5850LadislavThis is much simpler than your exception:

- actually working, your exception does not - not slowing down parsing

4-May 8:43
5849GeomolI try to keep it simple.4-May 8:42
5848GeomolHere: http://www.rebol.com/r3/docs/concepts/parsing-summary.html#section-11

"Input position must change". And the solution was to invent a new keyword, WHILE. Hm...

4-May 8:42
5847LadislavYou should rather look up how the "infinite loop problem" when using ANY and SOME was solved4-May 8:34
5846LadislavWhat you suggest is just a bunch of exceptions in the behaviour, which is always bad4-May 8:33
5845Ladislav"[opt end] ...I suggest to make it produce an error."

- not reasonable, the rule *is* legitimate, as you noted

4-May 8:30
5844Ladislav[any end]and [some end]As we don't have warnings, I suggest these to produce errors.

- it is impossible to trigger errors every time an infinite loop is encountered - this case has been discussed and the solution was found already

4-May 8:27
5843GeomolThese are just suggestions to make a better PARSE. I've learnt, it's a good idea to not allow most combinations of keywords in R2 parse. Another example:

>> parse [] [opt into ['a]] == false >> bparse [] [opt into ['a]] ** User Error: Invalid argument: into

The PARSE result is wrong, as I see it. My BPARSE produce an error. Better?

4-May 7:12
5842Geomol[any end]Êand [some end] As we don't have warnings, I suggest these to produce errors. They can produce endless loops, and that should be pointed out in the docs, if they don't produce errors. [opt end] Yes, it's legit, but what's the point of this combination? At best, the programmer knows, what she does, and the combination will do nothing else than slowing the program down. At worst, the programmer misinterpret this combination, and since it doesn't produce an error or anything, it's a source of confusion. I suggest to make it produce an error. [into end] Produces an error today, so fine. [set end ...] and [copy end ...] I wasn't thinking of [set var end], but about setting a var named end to something, like [set end integer!]. Problem with this is, that now the var, end, can be used and looks exactly like the keyword, end, maybe leading to confusion. But after a second thought, maybe this being allowed is ok. [thru end] Making this produce an error will solve the problem with the confusion around, what this combination mean. And in the first place, it's a bad way to produce a 'fail' rule (in R2, in R3 it has the value true, and parsing continues). It's slow compared to e.g. [end skip].4-May 6:57
5841BrianHSo you want to allow COPY, SET and OPT. Warn about THRU (because of the bug), ANY and SOME, because of R3 compatibility. Trigger an error for INTO if its argument rule isn't a block or a word referring to a block, but nothing special if that rule is END.2-May 18:20
5840BrianH[set var end] sets the var to none; [copy var end] sets to none in R2, the empty string/block in R3; [thru end] doesn't match, so it should just get a warning in case the rules were written to expect that; [opt end] is definitely legit; perhaps [any end] and [some end] should get warnings for R2, but keep in mind that rules like [any [end]] and [some [end]] are much more common, have the same effect, and are more difficult to detect; [into end] properly trigers an error in R2 and R3 because the end is not in a block, while [into [end]] is legit and safe.2-May 18:14
5839GeomolMaybe it would be a good idea to make all these combination trigger an invalid argument error?

any end some end opt end into end set end ... copy end ... thru end

and then only let to end be valid.

2-May 9:43
5838GeomolIn parse, NONE is a keyword unless it comes after TO or THRU, then it's looked up.

>> parse [#[none!]] [none] ; as a keyword == false >> parse [#[none!]] [thru none] ; looked up == true

Same behaviour in R2 and R3.

2-May 6:43
5837GeomolIt can't be stopped using PARSE, it seems.2-May 6:12
5836LadislavNobody should expect an infinite cycle to stop.2-May 6:11
5835LadislavYes, but that is OK, it is just an infinite cycle2-May 6:11
5834GeomolWith bparse, this hangs:

bparse [a b c] [some [none]]

but it can be stopped by hitting <Esc>.

2-May 6:10
5833Ladislav(which does not look good as well)2-May 6:10
5832LadislavNevertheless, I messed it up. The NONE rule probably cannot fail, but it can consume some input.2-May 6:10
5831LadislavThat is not related2-May 6:09
5830GeomolMaybe the last section here: http://en.wikibooks.org/wiki/REBOL_Programming/Language_Features/Parse/Parse_expressions#Troubleshooting2-May 6:07
5829LadislavNevermind, I do not remember. The NONE rule is described in the wikibook, so it can be found in there, I guess.2-May 6:06
5828GeomolCan't remember. Give me an example.2-May 6:03
5827LadislavBTW (looks a unlucky to me), do you know, that in REBOL the NONE rule can fail?2-May 6:02
5826GeomolFrom your idioms it can also be seen, that OPT can be dropped easily.2-May 6:00
5825LadislavAnother variant that comes to mind is

empty: quote ()

2-May 5:53
5824LadislavFor strings, the

empty: ""

should work as well, but it does not.

2-May 5:52
5823GeomolIt could be interesting to creat an absolutely minimal PARSE function, that can handle all we expect from such a function but with as little code as possible (as few keywords as possible).2-May 5:51
5822LadislavHmm, as it looks, we could do without the empty string, we could use the rule like:

empty: []

2-May 5:49
5821GeomolIs the "empty string rule" covered by butting a | without anything after it? Like in:

>> parse [] ['a |] == true >> parse [] ['a | none] == true

2-May 5:48
5820LadislavYou can find something in the Wikipedia:

http://en.wikipedia.org/wiki/Parsing_expression_grammar¨

http://en.wikipedia.org/wiki/Top-down_parsing_language

2-May 5:48
5819GeomolOk, what is a good source of information to read about parsing in general? The Top Down Parsing Language family etc.?2-May 5:46
5818LadislavThe "empty string rule" (represented by the NONE keyword in REBOL) is absolutely necessary to have. All other members of the Top Down Parsing Language family have it as well.2-May 5:44
5817GeomolYes, and that should work in all cases, if the b rule is found, complex or not. And this will return true, if b is END, because END is a repeatable rule (you can't go past it with SKIP).

NONE is also repeatable, and if you look in the code, I have to take care of this too separately. This mean, we can't parse none of datatype none! by using the NONE keyword, but we can using a datatype:

>> parse reduce [none] [none] == false >> parse reduce [none] [none!] == true

So it raises the question, if the NONE keyword should be there? What are the consequences, if we drop NONE as a keyword? And are there other repeatable rules beside END and NONE? In R2 or R3.

2-May 5:40
5816LadislavBut, the recursive description:

a: [b | skip a]

is quite natural.

2-May 5:22
5815Geomolyeah :)2-May 5:21
5814Ladislav"because it does in many cases" - should rather be "because THRU is so limited, that it is unable to handle many cases"2-May 5:20
5813GeomolIn R2 parsing a block:

>> parse ["abc"] [to "abc" skip] == true >> parse ["abc"] [thru "abc"] == true

I know, it's different when parsing a string instead of a block. My comparison of [thru rule] to the alternatives was meant as a loose comparison, not to be taken literally. So it's easy to think of THRU to work this way, because it does in many cases, therefore the confusion.

2-May 5:16
5812LadislavGeomol:

[to rule skip]

does not mean the same as

[thru rule]

, as can be demonstrated when comparing the behaviour of

thru rule

for

rule = "abc"

It is quite a surprise for me, that you don't see the difference.

1-May 21:50
5811BrianHWe're probably fine with the wording we got. Though strangely enough, | is the ELSE of the IF operation. ELSE is a more descriptive name for | than OR in general.1-May 21:10
5810Geomol& and ! maybe?1-May 21:08
5809BrianHWe used up that luck though when we called the lookahead-match operation AND, and the lookahead-non-match operation NOT.1-May 21:06
5808Geomolparse [a b c] ['aAND'bAND'cEND] hmm, yeah, you've got a point.1-May 21:03
5807BrianHConsidering that the space character is the closest thing to AND if | is OR, we should consider ourselves to have gotten off lucky :)1-May 21:01
5806GeomolRight, just wondered, now rebol call e.g. floats for decimals etc. many attempts to make the language more humane.1-May 20:59
5805BrianHParsing tradition. And it's not really OR, it's backtracking alternation.1-May 20:55
5804GeomolWhen programming it, I also wondered, why the or keyword is | and not OR. Do you know the reason?1-May 20:52
5803Geomolyes1-May 20:50
5802BrianH(or the context equivalent of modules for R2)1-May 20:49
5801BrianHFor the mezzanine version, two functions might be better, though they can share code in the same module. Maybe just have one exported word for a dispatch function though.1-May 20:48
5800Maximnah, it would just use up another word. there is no ambiguity in the case of parse, as lets say ADD. where the same datatype may mean two things.1-May 20:47
5799Maximand I always prefix my rules to have them stand out from keywords.1-May 20:46
5798GeomolI in general very much like the idea, that many rebol functions can take different datatypes and work anyway. But I was thinking, if parsing blocks and parsing strings is so different, that it should be two functions?1-May 20:46
5797BrianHMost people tend to not use 'skip as a variable anyways, because of the SKIP function.1-May 20:46
5796Geomolok1-May 20:45
5795BrianHThat doesn't work with string parsing.1-May 20:45
5794GeomolHaving skip as a keyword mean, you can't use that word as a variable.1-May 20:44
5793MaximI'd drop any-type! :-)1-May 20:44
5792GeomolI think about downgrading. :-) You know, keep it simple. Like dropping SKIP as it's the same as any-type! etc. If I want SKIP, I can just define it then: skip: :any-type!1-May 20:43
5791Maxim(to red or R3 parse, depending on how you see "upgrade" ;-)1-May 20:43
5790Maximbah, I'd just stick with R3 parsing for Red. it'll be a good incentive for some to upgrade.1-May 20:41
5789BrianHIt would also be useful to have an R3-compatible PARSE for R2. And both for Red.1-May 20:41
5788GeomolSome day probably. Let's see, how it goes with bparse first.1-May 20:40

Return to Index Page