problem extracting quoted strings from a tsv format fileRSS

6 replies

Last post May 24, 2018 07:29 PM by carehart

  • problem extracting quoted strings from a tsv format file

    Oct 07, 2008 12:19 PM|carehart|LINK

    Folks, I'm a long-time fan (and indeed evangelist) of Log Parser. I tell people about it all the time, but I've been stumped today and would appreciate any help from another pair of eyes.

    I process several different logs from a vendor whose columns are space-separated, so I use the TSV format with -iseperator:space. That works great, except that one of the logs has a field which tracks a string (within one of the columns), so the vendor has chosen to use a double-quote around that string. Unfortunately, that breaks log parser's recognition of the value in that column. Here's an example of a line that might be in the file:

    col1vala col2val "col3 valuea"
    col1valb col2valb "col3 valueb"

    And I could use: "select field1, field2" and get things ok. but if I add field3, the values shown for that column for both records is just:

    "col3

    Because it saw the space within the string as being the space separator.

    Of course, I want as the answer instead:

    "col3 valuea"
    "col3 valueb"

    or better:

    col3 valuea
    col3 valueb

    Is there any way to do this? I saw an earlier note (from 2004) where someone wrote that TSV files are not supposed to have quoted values. OK, but I can't get the vendor to change that, and what they did made at least a little sense. I suppose if they changed to using a tab to separate the columns instead, I'd be ok.

    But short of getting them to do that, can anyone think of a solution? Thanks in advance.

    /charlie
  • Re: problem extracting quoted strings from a tsv format file

    Oct 07, 2008 03:03 PM|robmcm|LINK

    Using a tab character as a separator is probably your best option, and it also has the added advantage of allowing you to open the TSV file in Excel if you need to. ;-]

    Robert McMurray [MSFT]
  • Re: problem extracting quoted strings from a tsv format file

    Oct 07, 2008 03:34 PM|KristoferG|LINK

    Hi!

    I cannot figure out any other way than either ask the vendor to change the format, or change it yourself (in one or another way). Personally, i would just write a piece of code to change the space separator to a comma, and then use the CSV input format with the iDQuotes parameter.

     

    Kristofer Gafvert

    http://www.it-notebook.org
  • Re: problem extracting quoted strings from a tsv format file

    Oct 07, 2008 03:47 PM|carehart|LINK

    Just to be clear, while I recognize that getting the vendor to change the file to use tabs (or commas) would be one solution, as I noted, that won't be trivial.

    I just want to keep this open in case anyone else may think of another solution, short of them doing that, and for my current files (and those of others using this software). Thanks.

    /charlie
  • Re: problem extracting quoted strings from a tsv format file

    Oct 08, 2008 05:16 AM|yellowdog.dave|LINK

    Sorry about all the editing and reposting to you, it seems that in Chrome the formatting gets a bit poked, so went back to FF - not that you IIS people would care ;-)

    Hi Care,

    We would need to understand how consistent this behaviour is in the log and what you want to  do with the results to give you fully workable answers.

    The following method will work in the instance that you have described and could work globally provided your log file consistently has 2 words separated by a space inside the quotes or no additional trailing fields;

    logparser "select field1,field2,strcat(replace_chr(field3,'\"',''),replace_chr(field4,'\"','')) as newfield from quotes.tsv" -iseparator:space -headerRow:OFF

    This gives you the result below;
    field1   field2   newfield
    -------- -------- ----------
    col1vala col2val  col3valuea
    col1valb col2valb col3valueb

    However, although more longwinded, this would be the best way to get it all out consistently - in this instance, I added a fourth field after your quoted field

    This statement;

    logparser "select extract_token(text,0,' ') as field1,extract_token(text,1,' ') as field2,extract_token(text,1,'\"') as field3,extract_token(text,2,'\"') as field4 from quotes.tsv" -i:textline

    Gives these results;

    field1   field2   field3      field4
    -------- -------- ----------- ---------
    col1vala col2val  col3 valuea  col4val
    col1valb col2valb col3 valueb  col4vala

    I hope that this helps, if you can give a sample of real data I will work it further for you.

    Cheers, Dave

  • Re: problem extracting quoted strings from a tsv format file

    Apr 02, 2010 07:45 PM|carehart|LINK

    Wow, Dave, if you ever read this, I owe you an apology. You responded to my query above a couple of years ago but I either wasn't notified or never noticed, and I'm really sorry.

    As for your answer, well, it would work for many, I'm sure, so I'm very glad you shared it, but there are two reasons neither work for me. For the first one, sadly, I *do* have fields that follow the one in question. I just didn't think to mention that in my example. For the second one, sadly, I have other fields that precede the one in question which may or may not also have quoted values, so I can't rely on finding the given pair as the "first" or "second", and so on. So I am still stuck.

    Still, perhaps others have benefited from your suggestion.

    If anyone is interested in resolving what I've riased, here's a better example of what I have and want (the quoted string "I want this"):

    col1vala col2val "I want this" "col4val"
    col1valb "" "I want this" "col4val"

    Stumper. :-)

    /charlie

    /charlie
  • Re: problem extracting quoted strings from a tsv format file

    May 24, 2018 07:29 PM|carehart|LINK

    I'd like to bump my question from 2008. There was never a solution (though a few kind folks tried back then.) Perhaps someone seeing it anew may have a suggestion. (I simply had to stop using LP against this file, but the need has come up again, and if I could it would be otherwise perfect for the analysis I must do.)

    carehart

    Folks, I'm a long-time fan (and indeed evangelist) of Log Parser. I tell people about it all the time, but I've been stumped today and would appreciate any help from another pair of eyes.

    I process several different logs from a vendor whose columns are space-separated, so I use the TSV format with -iseperator:space. That works great, except that one of the logs has a field which tracks a string (within one of the columns), so the vendor has chosen to use a double-quote around that string. Unfortunately, that breaks log parser's recognition of the value in that column. Here's an example of a line that might be in the file:

    col1vala col2val "col3 valuea"
    col1valb col2valb "col3 valueb"

    And I could use: "select field1, field2" and get things ok. but if I add field3, the values shown for that column for both records is just:

    "col3

    Because it saw the space within the string as being the space separator.

    Of course, I want as the answer instead:

    "col3 valuea"
    "col3 valueb"

    or better:

    col3 valuea
    col3 valueb

    Is there any way to do this? I saw an earlier note (from 2004) where someone wrote that TSV files are not supposed to have quoted values. OK, but I can't get the vendor to change that, and what they did made at least a little sense. I suppose if they changed to using a tab to separate the columns instead, I'd be ok.

    But short of getting them to do that, can anyone think of a solution? Thanks in advance.

    /charlie