Difference between revisions of "Dusql"

Line 1: Line 1:
 +
{{color box|mintcream|[[Template:Needs Update]] To update when it is working on Gadi + add link to it {{{1|}}}}}
 +
[[Category:Needs Update]]
 +
 
'''dusql''' is a disk usage analysis tool developed by CMS to help deal with data on our storage areas at NCI
 
'''dusql''' is a disk usage analysis tool developed by CMS to help deal with data on our storage areas at NCI
  

Revision as of 22:44, 27 November 2019

Template:Needs Update To update when it is working on Gadi + add link to it

dusql is a disk usage analysis tool developed by CMS to help deal with data on our storage areas at NCI

dusql is installed in the 'unstable' CMS conda environment, to use it run

module use /g/data/hh5/public/modules
module load conda/analysis3-unstable

dusql --help

Dusql looks at a file list stored in a database in order to reduce the load on the filesystem. This database is updated nightly, just like short_files_report, so it won't notice changed or deleted files until the next day.

We only scan the CLEX projects w35, w40, w42, w48, w97, v45 and hh5, and only members of these projects can access the database

Commands

ncdu: Finding Files Interactively

The simplest way to find files is to use the interactive viewer, dusql ncdu. This is a basic text interface that shows how many files match a given condition in each directory.

Say you want to find big files in your /short directory. You might run dusql ncdu /short/$PROJECT/$USER --size=10gb to find all the files larger than 10 GB

du: Summarising a Directory

dusql du works the same as ncdu, it shows the total size and file count of files matching some constraint under a directory, but rather than the text interface it just prints a summary for each directory to screen. You can give it multiple directories as well, e.g. to find files under the current directory older than 3 years:

$ dusql du * --mtime=-3y | sort -hr
304.99GB,     6624 files, um-ostia
  4.76GB,      223 files, wrf-era
  3.41GB,     1003 files, access-cm2-ukca
  1.94GB,       98 files, mpas
919.57MB,        1 files, nu-wrf_v8-wrf371-lis71rp7.tgz

It's helpful to pipe the output of dusql du to sort -hr as shown above to order the paths by size, or sort -nr -k 2 to sort by file count.

find: Listing Individual Files

dusql find will print out the paths of all matching files. It can be helpful if there's just a few files you're trying to track down:

$ dusql find . --mtime=-7y | head
/short/w35/saw562/scratch/spherepack3.2/Makefile
/short/w35/saw562/scratch/wrf-era/FILE:2006-03-02_18
/short/w35/saw562/scratch/wrf-era/SST:2006-03-03_12
/short/w35/saw562/scratch/wrf-era/FILE:2006-03-01_00
/short/w35/saw562/scratch/wrf-era/SST:2006-03-01_12

Filters

All the dusql commands accept a common set of filters. If a file doesn't match the filter its size isn't included in the totals reported by ncdu and du:

  • --user=USER Only matches a file if it is owned by username USER. Use --user=-USER to only match if the file is not owned by USER
  • --group=GROUP Only matches a file if it is owned by group GROUP. Use --group=-GROUP to only match if the file is not owned by GROUP
  • --mtime=TIME Only match a file if it was create after TIME. Use --mtime=-TIME to only match if the file was created before TIME. TIME may be:
    • A year 2015
    • A date 20150326
    • A time delta readable by Pandas 1y6m
  • --size=SIZE Only match a file if it is larger than SIZE. Use --size=-SIZE to only match if the file is smaller than SIZE. SIZE can accept standard units, e.g. 10gb. If units aren't specified the size is assumed to be in bytes.

Things to Search For

  • Files in your /short space not in the proper group dusql ncdu /short/$PROJECT/$USER --group=-$PROJECT
  • Files in your /short space older than 1 year dusql ncdu /short/$PROJECT/$USER --mtime=-1y (note in some circumstances the file age can be inaccurate, e.g. if it came from a tar file)
  • Files in your /g/data space larger than 10gb dusql ncdu /g/data/$PROJECT/$USER --size=10gb