Previous Next

Thread: Subquery in WHERE clause

Last post 08-22-2008 2:35 PM by deathwagon. 3 replies.

Average Rating Rate It (5)

RSS

Page 1 of 1 (4 items)

Sort Posts:

  • 08-20-2008, 4:38 PM

    Subquery in WHERE clause

    Hi all,

    I would like to know which directories on my servers have more than 100 files or more than 50MB worth of files and of the directories that exceed either of those parameters I want to know which extension types (and how many files of each and sum(size) of those files) are in the directory. 

    The below query kinda works however it only returns paths where a single file type has more than 100 files or has a sum(size) or greater than 52428800 Bytes (50MB). So if I have a directory (eg Z:PUB) that uses 1GB and 500MB is doc files, 451MB is xls files and 49MB is txt files it will only return results on the doc and xls files and I will have no idea what other file types are in the directory. I need to base my size and file total criteria off of the directory as a whole not the specific file types within.

    This will tell me that Z:\PUB contains 500MB of doc files and 451MB of xls files however it will return nothing about the 49MB of txt files.  

    SELECT count(*) as Files, TO_LOWERCASE(EXTRACT_PATH(Path)) as Path, TO_LOWERCASE(EXTRACT_EXTENSION(Path)) as Extension,
    MIN(LastWriteTime) as MinLastModified, MAX(LastWriteTime) as MaxLastModified,
    DIV(DIV(MUL(1.0, SUM(Size)), 1024), 1024) as SizeMB, DIV(MUL(1.0, SUM(Size)), 1024) as SizeKB, SUM(Size) as SizeBytes, SYSTEM_TIMESTAMP() as LastUpdate
    INTO Filecount
    FROM %source%
    GROUP BY TO_LOWERCASE(EXTRACT_PATH(Path)), TO_LOWERCASE(EXTRACT_EXTENSION(Path))
    HAVING Sizebytes > 52428800
    OR Files > 100

    Adding an additional SELECT statement in my WHERE clause would get me what I want as I could apply my first select statement only to directories that exceed the criteria defined in my where clause and still keep my group by extension type information as well. Unfortunately when I attempt this I am running into syntax errors.

    Here's the query using a subquery in my where clause. Variables are passed from a Perl script and are the same in both select statements.

    SELECT count(*) as Files, TO_LOWERCASE(EXTRACT_PATH(Path)) as Path, TO_LOWERCASE(EXTRACT_EXTENSION(Path)) as Extension,
    MIN(LastWriteTime) as MinLastModified, MAX(LastWriteTime) as MaxLastModified,
    DIV(DIV(MUL(1.0, SUM(Size)), 1024), 1024) as SizeMB, DIV(MUL(1.0, SUM(Size)), 1024) as SizeKB, SUM(Size) as SizeBytes, SYSTEM_TIMESTAMP() as LastUpdate
    INTO Filecount
    FROM %source%
    WHERE (SELECT COUNT(*) FROM %source% WHERE Size > 0 GROUP BY TO_LOWERCASE(EXTRACT_PATH(Path))) > 100
    GROUP BY TO_LOWERCASE(EXTRACT_PATH(Path)), TO_LOWERCASE(EXTRACT_EXTENSION(Path))

     This returns the follow error when I run it

    "Syntax Error: <term2>: expecting matching closing parenthesis instead of token 'COUNT(*)'

    I really don't have the option of running the job twice to parse through my first pass as the job will already take a long time. I also cannot remove the file size or file number criteria (though the numbers being used here are just examples).

    Any help would be greatly appreciated.

    Thanks!

  • 08-21-2008, 9:05 AM In reply to

    • yellowdog.dave
    • Top 50 Contributor
    • Joined on 07-18-2008, 7:17 AM
    • Johannesburg, South Africa
    • Posts 83

    Re: Subquery in WHERE clause

    Hi Death, 

    I have had similar frustrations with sub query scenarios. Basically, and this is as far as I can ascertain, they won't work.

    Although I identify with your sentiment of not wanting to do a second pass, what I have done to workaround this limitation in my scenarios is to pump the data directly into an ODBC source, such as an M$ Access database table. I then have standard queries that I quickly and simply apply to the resulting subset to get what I need. 

    This does have a residual benefit, as it can also provide users with a quick and easy way of finding things that they are looking for without having to be technically savvy.

    Sorry that the news is not good. 

    Cheers, Dave 

     

    Yes, dear
  • 08-21-2008, 9:57 AM In reply to

    • yellowdog.dave
    • Top 50 Contributor
    • Joined on 07-18-2008, 7:17 AM
    • Johannesburg, South Africa
    • Posts 83

    Re: Subquery in WHERE clause

    Just had another thought and did a little experiment, I took the output of a query and piped it into another query. This worked and on my system, and appeared to not have a huge impact on run time. Although it will take you some jiggery-pokery to extract the fields you are interested in, it may be the way to achieve what you want to do in one pass.

    Here are my queries and results (on it must be said a very klunky system), but they could be a little skewed by caching;

    C:\projects\logparser\>logparser "select path,name,size into zub.csv from c:\*" -i:fs

    Statistics:
    -----------
    Elements processed: 226114
    Elements output:    226114
    Execution time:     286.69 seconds (00:04:46.69)


    C:\projects\logparser\>logparser "select path,name,size from c:\*.*" -i:fs -q|logparser "select * into zub.csv from stdin"
    WARNING: Input format not specified - using TEXTLINE input format.

    Statistics:
    -----------
    Elements processed: 226114
    Elements output:    226114
    Execution time:     283.75 seconds (00:04:43.75)

    I don't know if you think this may be worth a try or not, possibly to make your EXTRACT_TOKEN easier, you may want to select a separator literal such as a |  or ~ in the first query.

    Let us know if it helps. 

     

    Yes, dear
  • 08-22-2008, 2:35 PM In reply to

    Re: Subquery in WHERE clause

    Dave,

    Thanks for the response! I appreciate the input and am grateful of having had something to try unfortunately though this doesn't work for my specific needs. The problem seemed to be that I need to aggregate my data in the second set based on the first set which it appears isn't fully created - at least that is the best way I can explain the errors I was receiving.

     I ended up dumping the first set of results into a csv and then running my second query against that. That worked as well as I should have expected it however I ran into the same problem due to the type of data I was trying to put together just not working together - one set being distinct and the other aggregate. Had I thought more about that rather than a way around it I could have saved myself a lot of time. ;) I do appreciate your input though and have thought of a few other instances that your suggestion will likely come in handy. I'm impressed by the processing times.

    Ultimately for my purposes I ended up running two queries. The first being the aggregate file numbers and size and the second being the distinct file extension types within those directories. These write to two different tables and I can join the two on directory to get what I was after. The distinct file extension query runs pretty fast so overall this isn't adding too much to my total processing time... at least in testing.

    Thanks again!

Page 1 of 1 (4 items)
Page view counter