You are not logged in.
I'm setting this thread to "SOLVED" now.
WINE has been fixed by removing it, and the script I added in the previous post now works fully to extract all of the files within CacheData. The one thing that is missing is a description of the script + how to setup less to auto-show the compressed Brotli files, so I'll put that in the next post.
Online
Explanation + info on setting up getCC, the ChromeCache decrypt script:
Install PERL if necessary
(makes use of switch which was installed by default in version 5.10, but also available from CPAN)
Place the script where you will
Make executable
(chmod +x; chmod 700)
Set the values of $IN & $OUT
(lines 41 + 42; be careful to check permissions, particularly for $OUT)
Run the command from a command-prompt
(there are often 10s of thousands of files decrypted, so there is zero terminal output if no errors)
Install brotli
(sudo apt install brotli)
(this is to facilitate viewing text files)
(I run Chimaera & it is available as standard)
All lines beginning with a # are comments.
Lines 137 - 148 are all commented. It was exploratory code to determine if there was a common offset to the beginning of the cached file. There *was* indeed such an offset ($diff). This was important as not all files contained magic, and the start-of-file varied in ways that I could not decrypt.
The Chrome CacheData dir contains data-files which each contain the data + http-header from a single HTTP file delivered from a server during a Chrome/Chromium browser session.
HTTP files consist of a HTTP header + data.
The CacheData files have the file-data near the top of the file, then the HTTP header & then a bunch of other stuff. Here is a *very* small gif-file to make the point (look for 'GIF89a', the gif magic-marker, at ca in the hex-dump below). Notice how the gif is just 43 bytes, yet the cache-file that contains it is 4k bytes:
$ la ~/.cache/chromium/Default/Cache/Cache_Data/fff822c2bb27d828_0
-rw------- 1 alexk alexk 4389 Feb 24 02:31 /home/alexk/.cache/chromium/Default/Cache/Cache_Data/fff822c2bb27d828_0
$ la ~/Personal/ChromeCache/Files/fff822c2bb27d828_0.gif
-rw-r--r-- 1 alexk alexk 43 Feb 24 02:31 /home/alexk/Personal/ChromeCache/Files/fff822c2bb27d828_0.gif
$ hexdump ~/.cache/chromium/Default/Cache/Cache_Data/fff822c2bb27d828_0 -C | head -31
00000000 30 5c 72 a7 1b 6d fb fc 05 00 00 00 b2 00 00 00 |0\r..m..........|
00000010 23 84 68 3b 00 00 00 00 31 2f 30 2f 5f 64 6b 5f |#.h;....1/0/_dk_|
00000020 68 74 74 70 73 3a 2f 2f 61 6d 61 7a 6f 6e 2e 63 |https://amazon.c|
00000030 6f 2e 75 6b 20 68 74 74 70 73 3a 2f 2f 61 6d 61 |o.uk https://ama|
00000040 7a 6f 6e 2e 63 6f 2e 75 6b 20 68 74 74 70 73 3a |zon.co.uk https:|
00000050 2f 2f 61 61 78 2d 65 75 2e 61 6d 61 7a 6f 6e 2e |//aax-eu.amazon.|
00000060 63 6f 2e 75 6b 2f 65 2f 6c 6f 69 2f 69 6d 70 3f |co.uk/e/loi/imp?|
00000070 62 3d 4a 48 4f 6b 41 4c 63 55 4e 66 59 35 4f 61 |b=JHOkALcUNfY5Oa|
00000080 54 5f 5a 31 61 39 4c 32 67 41 41 41 47 47 67 55 |T_Z1a9L2gAAAGGgU|
00000090 4b 4d 77 67 4d 41 41 41 48 32 41 51 42 4f 4c 30 |KMwgMAAAH2AQBOL0|
000000a0 45 67 49 43 41 67 49 43 41 67 49 43 41 67 49 43 |EgICAgICAgICAgIC|
000000b0 42 4f 4c 30 45 67 49 43 41 67 49 43 41 67 49 43 |BOL0EgICAgICAgIC|
000000c0 41 67 49 43 41 2d 55 71 38 45 47 49 46 38 39 61 |AgICA-Uq8EGIF89a|
000000d0 01 00 01 00 f0 00 00 00 00 00 00 00 00 21 f9 04 |.............!..|
000000e0 01 00 00 00 00 2c 00 00 00 00 01 00 01 00 00 02 |.....,..........|
000000f0 02 44 01 00 3b d8 41 0d 97 45 6f fa f4 01 00 00 |.D..;.A..Eo.....|
00000100 00 ab bd 8a cb 2b 00 00 00 00 00 00 00 dc 0f 00 |.....+..........|
00000110 00 03 0d 45 02 86 fc 8d 34 ff 53 2f 00 e7 d9 8e |...E....4.S/....|
00000120 34 ff 53 2f 00 bd 00 00 00 48 54 54 50 2f 31 2e |4.S/.....HTTP/1.|
00000130 31 20 32 30 30 20 4f 4b 00 53 65 72 76 65 72 3a |1 200 OK.Server:|
00000140 20 53 65 72 76 65 72 00 44 61 74 65 3a 20 46 72 | Server.Date: Fr|
00000150 69 2c 20 32 34 20 46 65 62 20 32 30 32 33 20 30 |i, 24 Feb 2023 0|
00000160 32 3a 33 31 3a 30 38 20 47 4d 54 00 43 6f 6e 74 |2:31:08 GMT.Cont|
00000170 65 6e 74 2d 54 79 70 65 3a 20 69 6d 61 67 65 2f |ent-Type: image/|
00000180 67 69 66 00 43 6f 6e 74 65 6e 74 2d 4c 65 6e 67 |gif.Content-Leng|
00000190 74 68 3a 20 34 33 00 78 2d 61 6d 7a 2d 72 69 64 |th: 43.x-amz-rid|
000001a0 3a 20 42 37 35 4d 32 37 57 4e 38 38 32 54 59 4d |: B75M27WN882TYM|
000001b0 45 56 32 4e 46 48 00 56 61 72 79 3a 20 43 6f 6e |EV2NFH.Vary: Con|
000001c0 74 65 6e 74 2d 54 79 70 65 2c 41 63 63 65 70 74 |tent-Type,Accept|
000001d0 2d 45 6e 63 6f 64 69 6e 67 2c 55 73 65 72 2d 41 |-Encoding,User-A|
000001e0 67 65 6e 74 00 00 00 00 00 03 00 00 00 0d 07 00 |gent............|
$ hexdump fff822c2bb27d828_0.gif -C
00000000 47 49 46 38 39 61 01 00 01 00 f0 00 00 00 00 00 |GIF89a..........|
00000010 00 00 00 21 f9 04 01 00 00 00 00 2c 00 00 00 00 |...!.......,....|
00000020 01 00 01 00 00 02 02 44 01 00 3b |.......D..;|
0000002b
So, in the Cache file:
hex CA: filedata begins ('GIF89a')
hex 129: http header begins ('HTTP/1.1 200 OK')
Amongst other things, the HTTP header can give the Type of file, the length of file, delivery Date & Encoding (type of compression).
Every sensible Internet Server compresses most of the files that it delivers, and particularly text-files. atm getCC only detects gzip & brotli compression:-
gzip: shown as 'file.txt.gz'
brotli: shown as 'file.txt.br'
If viewed from a terminal with less file.txt.gz the gzip-file will be auto-decompressed & shown as plain text within the less-screen. That will NOT work the same for Brotli files unless you take the following steps:-
My version of BASH uses ~/.bashrc as a shell-script to initialise it. The following code within ~/.bashrc enables less to auto-decode a wealth of different compressions (though not Brotli) in conjunction with LESSPIPE:-
# make less more friendly for non-text input files, see lesspipe(1)
[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"
Take the following steps to add Brotli to all the other auto-decoded compressions:
Install Brotli
Save the script below as "~/.lessfilter"
Make it executable
#!/bin/sh
# ~/.lessfilter
# 2023-03-11 add brotli to all other encodings for less
case "$1" in
*.br)
brotli -dc "$1"
;;
*)
# We don't handle this format.
exit 1
esac
# No further processing by lesspipe necessary
exit 0
Last edited by alexkemp (2023-03-11 17:49:13)
Online
getCC, the ChromeCache decrypt script:
I needed to know whether it needed to activate on other HTTP Status Codes than just 200, so did some calculations:
$ la ~/.cache/chromium/Default/Cache/Cache_Data/* | wc -l
22525
$ strings ~/.cache/chromium/Default/Cache/Cache_Data/* | fgrep "HTTP/1.1" | sort | uniq -c
strings: Warning: '/home/alexk/.cache/chromium/Default/Cache/Cache_Data/index-dir' is a directory
14055 HTTP/1.1 200
1 HTTP/1.1 200 200
564 HTTP/1.1 200 OK
7490 HTTP/1.1 204
45 HTTP/1.1 204 No Content
5 HTTP/1.1 206
42 HTTP/1.1 301
15 HTTP/1.1 301 Moved Permanently
236 HTTP/1.1 302
1 HTTP/1.1 302 Found
1 HTTP/1.1 303 See Other
1 HTTP/1.1 307
2 HTTP/1.1 400
2 HTTP/1.1 403
84 HTTP/1.1 404
5 HTTP/1.1 404 Not Found
1 HTTP/1.1 410
11 HTTP/1.1 500
Sums:
65% 14,620 HTTP 200 OK
33% 7,535 HTTP 204 No Content
0% 5 HTTP 206 Partial Content
0% 57 HTTP 301 Moved Permanently
1% 237 HTTP 302 Found
0% 1 HTTP 303 See Other
0% 1 HTTP 307 Temporary Redirect
0% 2 HTTP 400 Bad Request
0% 2 HTTP 403 Forbidden
0% 89 HTTP 404 Not Found
0% 1 HTTP 410 Gone
0% 11 HTTP 500 Internal Server Error
Ah well, that's ok then. The script can stick with Status 200, no problem. There is a small chance that 203 Non-Authoritative Information may be involved (responses from a proxy, although never features in my accesses), but I'm happy to consider the chance of that being remote.
All of the 22 thousand files in the current cache were from servers reporting themselves to be version 1.1. HTTP/0.9 & HTTP/1.0 are now considered obsolete (I bet that some still exist). Both HTTP/2 & HTTP/3 are now supposed to be a thing, although no server reported either version in my accesses. However, I obviously need to modify the PERL regex to accept such possibilities, and that will come with the next post.
Online
getCC, the ChromeCache decrypt script:
Added the ability to decode any HTTP version + status 200 or 203 files.
Testing results:
$ ~/Personal/.getCC
image/webp: 000420fedcafe6ff_0; $TLS=.webp; $HPOS=5185; $END=5185; $HVER=1.1; $HSTA=200; $HTTP=HTTP/1.1 200; $MOD=Fri, 03 Mar 2023 20:27:56 GMT;
$ cd ~/Personal/ChromeCache/Files
$ time ~/Personal/.getCC
real 1m28.431s
user 0m53.738s
sys 0m34.723s
I had noticed that all HTTP/1.1 server responses were preceded by two null bytes in the cache files:
00001430 fc 9b 54 2f 00 a4 ec 1b fc 9b 54 2f 00 57 02 00 |..T/......T/.W..|
00001440 00 48 54 54 50 2f 31 2e 31 20 32 30 30 00 61 63 |.HTTP/1.1 200.ac|
00002800 02 72 ab 33 c9 16 55 2f 00 8d 9c 35 c9 16 55 2f |.r.3..U/...5..U/|
00002810 00 56 02 00 00 48 54 54 50 2f 31 2e 31 20 32 30 |.V...HTTP/1.1 20|
00003740 ea 4c 55 2f 00 48 7c d5 ea 4c 55 2f 00 55 02 00 |.LU/.H|..LU/.U..|
00003750 00 48 54 54 50 2f 31 2e 31 20 32 30 30 00 61 63 |.HTTP/1.1 200.ac|
I used that fact to guarantee that the HTTP string that was being indexed for was the correct one + updated $HTTP to contain the correct strings.
Here is the latest code:
#!/usr/bin/perl
# get Chrome Cache
# suggestion: save as ~/.getCC; chmod +x; chmod 700
# A PERL script to iterate through Chromium/Chrome 'Cache_Data/' dir
#+ & extract all http-delivered files stored within those data-files
# 2023-03-12: Account for multiple http version + 200|203 status
# 2023-03-08: bugfix: COUNT removed; LEN used instead
# + (F_OFF used for BEG, not COUNT)
# + brotli now works
# + (no magic for brotli (a mistake imo))
# 2023-03-07: bugfix: corrected miss on most magic files (my bad)
# + excluded compound header fields to eliminate wrong values
# added $F_OFF (diff between HTTP-begin ($END - $LEN) & magic-begin ($BEG))
# + (*every* file with both $BEG & $LEN has diff == x34) (h-begin is bigger)
# + thus if no magic but LEN then BEG = END - LEN - 52
# + if magic but no LEN then LEN = END - BEG - 52 (yes, this *does* happen)
# 2023-03-05: bugfix: coded to exclude 711 zero-length files
# + account for multiple-same-value $mime (fixes ~1000 gif + jpg files)
# + added 'Content-Encoding:br' Brotli compression
# + (you may need 'sudo apt install brotli' to view those files)
use strict;
use warnings;
use autodie;
use experimental qw( switch );
# save algorithm:
# 1) only save HTTP 200 files ($END)
# 2) try first to set file beginning ($BEG) from magic bytes
# 3) if (2) fails, set $BEG from $LEN; if no length, then ignore file
# 4) extract section $BEG to $END from $IN file into $OUT dir
# 5) touch file to conform with http header date
# Stats 2023-03-06:
# 10978 HTTP 200 from 23594 files in Cache_Data
# 6 do NOT contain a MIME field
# 10979 files saved to disk (real 1m23.219s)
# Global CONSTANTS
my $IN = "/home/alexk/.cache/chromium/Default/Cache/Cache_Data/"; # Chromium cache folder.
my $OUT = "/home/alexk/Personal/ChromeCache/Files/"; # Place for extracted files
my $HTTP = "HTTP/1.1 200"; # '200 OK' not in all files
my $F_OFF= 52; # Offset of HTTP-begin from magic-begin (BEG) + LEN
opendir( my $d, "$IN") or die "Cannot open directory $IN: $!\n"; # Open cache dir
my @list
= grep {
!/^\.\.?$/ # miss /. + /.. files
&& -f "$IN/$_" # is a file (not dir, etc)
} readdir( $d );
closedir( $d );
foreach my $f (@list) { # Iterate through each cached data-file
# my $f = "000420fedcafe6ff_0";
# section variables
my $BEG = -1; # Extract begins (bytes)
my $BROTLI = 0; # brotli encoding (0/1)
my $END = -1; # Extract ends (bytes)
my $GZIP = 0; # gzip encoding (0/1)
my $HPOS = -1; # 'HTTP' string begins (bytes)
my $HSTA = -1; # 'HTTP' status string (only interested in '200' or '203')
my $HVER = ''; # 'HTTP' version string (eg '1.1')
my $magic = '';
my $MIME = ""; # content-type
my $MOD = ""; # last-modified
my $OFF = -1; # Offset of magic from file beginning
my $TLS = ""; # TLS==Three Letter Suffix
my $LEN = -1; # content-length
open my $fhi, '<:raw', "$IN/$f" or die $!;
read( $fhi, my $cache_buffer, -s "$IN/$f" );
close( $fhi ) or die "could not close $IN/$f: $!";
if( $cache_buffer =~ /\x{00}\x{00}HTTP\/(\d.\d*)\s(200|203)/i ) {
$HPOS = $-[0] + 2;
$HVER = "$1";
$HSTA = "$2";
$HTTP = "HTTP/$HVER $HSTA";
}
$END = index( $cache_buffer, "$HTTP", $HPOS); # Check for presence of HTTP 200|203 header (paranoia coding)
if( $END > -1 ) { #+(and therefore std header fields for successful access)
if( $cache_buffer =~ /\x00Content-Encoding:\s*br/i ) { $BROTLI = 1; }
if( $cache_buffer =~ /\x00Content-Encoding:\s*gzip/i ) { $GZIP = 1; }
if( $cache_buffer =~ /\x00Content-Length:\s*(\d+)/i ) {
$LEN = $1;
if( !$LEN ) { $LEN = -1; } # yes, some pages have Content-Length:0
}
if( $cache_buffer =~ /\x00Last-Modified:\s*([ A-Za-z0-9,:]+)/i ) {
$MOD = $1; # some web servers ignore case + introduce spaces!
} else {
if( $cache_buffer =~ /\x00Date:\s*([ A-Za-z0-9,:]+)/i ) { # did page did not want to be cached? (Chromium did it anyway!)
$MOD = $1; # (all pages should have a date (or a Date))
}
}
if( $cache_buffer =~ /\x00Content-Type:\s*([a-z-]+\/[a-z0-9.+-]+)/i ) {
$MIME = $1;
} # variable $1 NOT reset on failed match (v stupid)
given( $MIME ) {
when ('application/font-woff' ) { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('application/font-woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('application/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/json') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; }
when ('application/x-javascript'){ $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('application/xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('binary/octet-stream') { $magic = "GIF89a"; $OFF = 0; $TLS = 'gif'; }
when ('font/ttf') { $magic = "\x{00}\x{01}\x{00}\x{00}\x{00}"; $OFF = 0; $TLS = 'ttf'; }
when ('font/woff') { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('font/woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('image/gif') { $magic = 'GIF87a'; $OFF = 0; $TLS = 'gif'; }
# when ('image/gif') { $magic = 'GIF89a'; $OFF = 0; $TLS = 'gif'; }
when ('image/jpeg') { $magic = 'JFIF'; $OFF = 6; $TLS = 'jpg'; }
# when ('image/jpeg') { $magic = 'Exif'; $OFF = 6; $TLS = 'jpeg'; }
# when ('image/jpeg') { $magic = "\x{ff}\x{d8}\x{ff}\x{e0}"; $OFF = 6; $TLS = 'jpg'; }
when ('image/png') { $magic = "\x{89}PNG"; $OFF = 0; $TLS = 'png'; }
when ('image/svg+xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'svg'; }
when ('image/vnd.microsoft.icon'){ $magic = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('image/webp') { $magic = 'RIFF'; $OFF = 0; $TLS = 'webp'; }
when ('text/css') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'css'; }
when ('text/fragment+html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'htm'; }
when ('text/html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'html'; }
when ('text/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('text/plain') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'txt'; }
when ('video/mp4') { $magic = 'ftypisom'; $OFF = 4; $TLS = 'mp4'; } # most unlikely
default { $magic = ''; $OFF = 0; $TLS = ''; }
}
if( $magic ) {
if( $magic eq 'GIF87a') { # account for gif + jpeg multiple $magic
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'GIF89a';
$BEG = index( $cache_buffer, "$magic" );
}
} elsif( $magic eq 'JFIF') {
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'Exif';
$TLS = 'jpeg';
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = "\x{ff}\x{d8}\x{ff}\x{e0}";
$TLS = 'jpg';
$BEG = index( $cache_buffer, "$magic" );
}
}
}
$BEG = index( $cache_buffer, "$magic" );
}
# # trying to decode where each file begins (determine common offsets)
# if( $LEN < 1 && $BEG > -1 ) { }
# if( $BEG > -1 && $LEN > -1 ) {
# # at this point $BEG - $OFF == start of magic
# # $END == start of $HTTP
# # $LEN == length of content from header
# my $mbeg = $BEG - $OFF; my $mhex = sprintf("0x%X", $mbeg);
# my $hbeg = $END - $LEN; my $hhex = sprintf("0x%X", $hbeg);
# my $diff = $hbeg - $mbeg;
# my $dhex = sprintf("0x%X", $diff);
# print "$MIME: $f; \$END/\$LEN=$END / $LEN; \$mbeg=$mbeg / $mhex; \$hbeg=$hbeg / $hhex; \$diff=$diff / $dhex; \n";
# }
if( $BEG > -1 ) {
$BEG -= $OFF;
if( $LEN < 1 ) { $LEN = $END - $BEG - $F_OFF; } # v rare, but happens
} elsif( $LEN > -1 ) { $BEG = $END - $LEN - $F_OFF; } # no magic (text, xml + brotli files)
# suffixes (holy m$)
if( $TLS ) {
$TLS = ".$TLS";
if( $GZIP || $BROTLI ) { # account for different compression-encodings
if( $GZIP ) { $TLS = "$TLS.gz"; } else { $TLS = "$TLS.br"; }
}
}
# print the files out
if( $BEG > -1 && $LEN > -1 ) {
`dd if="$IN/$f" of="$OUT/$f$TLS" skip=$BEG count=$LEN iflag=skip_bytes,count_bytes status=none`;
if( $MOD ) { `touch "$OUT/$f$TLS" -d "$MOD"`; }
# print "$MIME: $f; \$TLS=$TLS; \$HPOS=$HPOS; \$END=$END; \$HVER=$HVER; \$HSTA=$HSTA; \$HTTP=$HTTP; \$MOD=$MOD; \n";
}
} # if( $END > -1 ) # other pages are most likely to be HTTP 204 No Content
}
Online
This should be the last code update for now (below).
It is tested as well as I can manage in a short time. ~64% of cache are HTTP 200, with most of the rest being 204 No Content. A number of the 200 OK files are also Content-Length:0 (js files for search-results in many cases). The script is written so that no attempt is made to extract no-content files.
The final search was for Content-Encoding: (compression before delivery). My main source was latest Apache modules and that showed that only gzip & brotli are currently used. The statement was that "deflate is not supported", whilst compress was not even mentioned.
#!/usr/bin/perl
# get Chrome Cache
# suggestion: save as ~/.getCC; chmod +x; chmod 700
# A PERL script to iterate through Chromium/Chrome 'Cache_Data/' dir
#+ & extract all http-delivered files stored within those data-files
# 2023-03-12: Account for multiple http version + 200|203 status
# 2023-03-08: bugfix: COUNT removed; LEN used instead
# + (F_OFF used for BEG, not COUNT)
# + brotli now works
# + (no magic for brotli (a mistake imo))
# 2023-03-07: bugfix: corrected miss on most magic files (my bad)
# + excluded compound header fields to eliminate wrong values
# added $F_OFF (diff between HTTP-begin ($END - $LEN) & magic-begin ($BEG))
# + (*every* file with both $BEG & $LEN has diff == x34) (h-begin is bigger)
# + thus if no magic but LEN then BEG = END - LEN - 52
# + if magic but no LEN then LEN = END - BEG - 52 (yes, this *does* happen)
# 2023-03-05: bugfix: coded to exclude 711 zero-length files
# + account for multiple-same-value $mime (fixes ~1000 gif + jpg files)
# + added 'Content-Encoding:br' Brotli compression
# + (you may need 'sudo apt install brotli' to view those files)
use strict;
use warnings;
use autodie;
use experimental qw( switch );
# save algorithm:
# 1) only save HTTP 200 files ($END)
# 2) try first to set file beginning ($BEG) from magic bytes
# 3) if (2) fails, set $BEG from $LEN; if no length, then ignore file
# 4) extract section $BEG to $END from $IN file into $OUT dir
# 5) touch file to conform with http header date
# Stats 2023-03-06:
# 10978 HTTP 200 from 23594 files in Cache_Data
# 6 do NOT contain a MIME field
# 10979 files saved to disk (real 1m23.219s)
# Global CONSTANTS
my $IN = "/home/alexk/.cache/chromium/Default/Cache/Cache_Data/"; # Chromium cache folder.
my $OUT = "/home/alexk/Personal/ChromeCache/Files/"; # Place for extracted files
my $HTTP = "HTTP/1.1 200"; # '200 OK' not in all files
my $F_OFF= 52; # Offset of HTTP-begin from magic-begin (BEG) + LEN
opendir( my $d, "$IN") or die "Cannot open directory $IN: $!\n"; # Open cache dir
my @list
= grep {
!/^\.\.?$/ # miss /. + /.. files
&& -f "$IN/$_" # is a file (not dir, etc)
} readdir( $d );
closedir( $d );
foreach my $f (@list) { # Iterate through each cached data-file
# my $f = "000420fedcafe6ff_0";
# section variables
my $BEG = -1; # Extract begins (bytes)
my $BROTLI = 0; # brotli encoding (0/1)
my $END = -1; # Extract ends (bytes)
my $GZIP = 0; # gzip encoding (0/1)
my $HPOS = -1; # 'HTTP' string begins (bytes)
my $HSTA = -1; # 'HTTP' status string (only interested in '200' or '203')
my $HVER = ''; # 'HTTP' version string (eg '1.1')
my $magic = '';
my $MIME = ""; # content-type
my $MOD = ""; # last-modified
my $OFF = -1; # Offset of magic from file beginning
my $TLS = ""; # TLS==Three Letter Suffix
my $LEN = -1; # content-length
open my $fhi, '<:raw', "$IN/$f" or die $!;
read( $fhi, my $cache_buffer, -s "$IN/$f" );
close( $fhi ) or die "could not close $IN/$f: $!";
if( $cache_buffer =~ /\x{00}\x{00}HTTP\/(\d.\d*)\s(200|203)/i ) {
$HPOS = $-[0] + 2;
$HVER = "$1";
$HSTA = "$2";
$HTTP = "HTTP/$HVER $HSTA";
}
$END = index( $cache_buffer, "$HTTP", $HPOS); # Check for presence of HTTP 200|203 header (paranoia coding)
if( $END > -1 ) { #+(and therefore std header fields for successful access)
if( $cache_buffer =~ /\x00Content-Encoding:\s*br/i ) { $BROTLI = 1; }
if( $cache_buffer =~ /\x00Content-Encoding:\s*gzip/i ) { $GZIP = 1; }
if( $cache_buffer =~ /\x00Content-Length:\s*(\d+)/i ) {
$LEN = $1;
if( !$LEN ) { $LEN = -1; } # yes, some pages have Content-Length:0
}
if( $cache_buffer =~ /\x00Last-Modified:\s*([ A-Za-z0-9,:]+)/i ) {
$MOD = $1; # some web servers ignore case + introduce spaces!
} else {
if( $cache_buffer =~ /\x00Date:\s*([ A-Za-z0-9,:]+)/i ) { # did page did not want to be cached? (Chromium did it anyway!)
$MOD = $1; # (all pages should have a date (or a Date))
}
}
if( $cache_buffer =~ /\x00Content-Type:\s*([a-z-]+\/[a-z0-9.+-]+)/i ) {
$MIME = $1;
} # variable $1 NOT reset on failed match (v stupid)
# easy to mixup mime/media-types & encoding (compression schemes) here
# Content-Type == mime-type refers to the type of file that is being transferred
# Content-Encoding == compression scheme refers to the type of compression used during transfer
# so, a text file (js txt xml, etc) with gzip magic will be a gzipped-textfile (eg file.xml.gz)
# gzip encoding (+ brotli) are only support; deflate no support, compress not even mentioned
# see https://httpd.apache.org/docs/current/mod/mod_deflate.html
# see https://www.iana.org/assignments/media-types/media-types.xhtml
given( $MIME ) {
when ('application/font-woff' ) { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('application/font-woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('application/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/json') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; } # magic for gzip encoding
when ('application/manifest+json'){ $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; } # magic for gzip encoding
when ('application/x-javascript'){ $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'xml'; } # magic for gzip encoding
when ('binary/octet-stream') { $magic = "GIF89a"; $OFF = 0; $TLS = 'gif'; }
when ('font/ttf') { $magic = "\x{00}\x{01}\x{00}\x{00}\x{00}"; $OFF = 0; $TLS = 'ttf'; }
when ('font/woff') { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('font/woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('image/gif') { $magic = 'GIF87a'; $OFF = 0; $TLS = 'gif'; }
# when ('image/gif') { $magic = 'GIF89a'; $OFF = 0; $TLS = 'gif'; }
when ('image/jpeg') { $magic = 'JFIF'; $OFF = 6; $TLS = 'jpg'; }
# when ('image/jpeg') { $magic = 'Exif'; $OFF = 6; $TLS = 'jpeg'; }
# when ('image/jpeg') { $magic = "\x{ff}\x{d8}\x{ff}\x{e0}"; $OFF = 6; $TLS = 'jpg'; }
when ('image/png') { $magic = "\x{89}PNG"; $OFF = 0; $TLS = 'png'; }
when ('image/svg+xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'svg'; } # magic for gzip encoding
when ('image/vnd.microsoft.icon'){ $magic = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('image/webp') { $magic = 'RIFF'; $OFF = 0; $TLS = 'webp'; }
when ('image/x-icon') { $magic = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('text/css') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'css'; } # magic for gzip encoding
when ('text/fragment+html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'htm'; } # magic for gzip encoding
when ('text/html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'html'; } # magic for gzip encoding
when ('text/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('text/plain') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'txt'; } # magic for gzip encoding
when ('text/xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'xml'; } # magic for gzip encoding
when ('video/mp4') { $magic = 'ftypisom'; $OFF = 4; $TLS = 'mp4'; } # most unlikely
default { $magic = ''; $OFF = 0; $TLS = ''; }
}
if( $magic ) {
if( $magic eq 'GIF87a') { # account for gif + jpeg multiple $magic
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'GIF89a';
$BEG = index( $cache_buffer, "$magic" );
}
} elsif( $magic eq 'JFIF') {
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'Exif';
$TLS = 'jpeg';
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = "\x{ff}\x{d8}\x{ff}\x{e0}";
$TLS = 'jpg';
$BEG = index( $cache_buffer, "$magic" );
}
}
}
$BEG = index( $cache_buffer, "$magic" );
}
# fix $BEG + $LEN
if( $BEG > -1 ) {
$BEG -= $OFF;
if( $LEN < 1 ) { $LEN = $END - $BEG - $F_OFF; } # v rare, but happens
} elsif( $LEN > -1 ) { $BEG = $END - $LEN - $F_OFF; } # no magic (text + brotli files)
# suffixes (holy m$)
if( $TLS ) {
$TLS = ".$TLS";
if( $GZIP || $BROTLI ) { # compression-encoding
if( $GZIP ) { $TLS = "$TLS.gz"; } else { $TLS = "$TLS.br"; }
}
}
# print the files out
if( $BEG > -1 && $LEN > -1 ) {
`dd if="$IN/$f" of="$OUT/$f$TLS" skip=$BEG count=$LEN iflag=skip_bytes,count_bytes status=none`;
if( $MOD ) { `touch "$OUT/$f$TLS" -d "$MOD"`; }
# print "$MIME: $f; \$TLS=$TLS; \$BEG=$BEG; \$LEN=$LEN; \$END=$END; \$MOD=$MOD; \n";
} # lots of Content-Length:0 files
} # if( $END > -1 ) # other pages mostly HTTP 204 No Content
}
Online
It is said that the connection between Rats & Bulldogs lies in the construction of their jawbones: once they bite, neither can release their teeth until the jaws clamp together (due to a ratchet mechanism that joins the upper & lower jawbone). I sympathise with both species; my mind has a similar mechanism.
I finally spotted how to determine the precise length of the embedded URL within each cache (simple) Entry file. It is now possible to collate all urls, data-lengths, etc.. That finally opens the possibility to providing url + file listing, search, selection + individual extraction. However, that will all have to wait for later. For now, it is a simple utility that extracts all cached files (or just one file) into a single directory (listing below).
There is a commented-out print-line almost at the bottom of the script. It can produce a listing of all files for you. The following from a terminal can do that (comment the $DD lines & uncomment the PRINT line first):
~/Personal/.getCC > temp.txt; sort -n temp.txt > mime.txt;
The Cache contains all kinds of corrupted files. There are lines in the script to try to catch those; the notices go to STDERR so it will not corrupt your mime.txt.
Note that there has been a radical reset of almost all code, which creates some disjuncture between current code & earlier BugFix comments. $magic is still in the code but is unused now.
If I cannot stop myself producing a file browser then I shall place the code into GitHub, so that this thread can finally sleep.
#!/usr/bin/perl
# get Chrome Cache
# suggestion: save as ~/.getCC; chmod +x; chmod 700
# A PERL script to iterate through Chromium/Chrome 'Cache_Data/' dir
#+ & extract all http-delivered files stored within those data-files
# 2023-03-21: Finally found location of URL-length
# (& thus how to find start of content for all files)
# 2023-03-16: bugfix: Account for Content-Encoding invalidating file-magic
# 2023-03-12: Account for multiple http version + 200|203 status
# 2023-03-08: bugfix: COUNT removed; LEN used instead
# + (FOFF used for BEG, not COUNT)
# + brotli now works
# + (no magic for brotli (a mistake imo))
# 2023-03-07: bugfix: corrected miss on most magic files (my bad)
# + excluded compound header fields to eliminate wrong values
# added $FOFF (diff between HTTP-begin ($END - $LEN) & magic-begin ($BEG))
# + (*every* file with both $BEG & $LEN has diff == x34) (h-begin is bigger)
# + thus if no magic but LEN then BEG = END - LEN - 52
# + if magic but no LEN then LEN = END - BEG - 52 (yes, this *does* happen)
# 2023-03-05: bugfix: coded to exclude 711 zero-length files
# + account for multiple-same-value $mime (fixes ~1000 gif + jpg files)
# + added 'Content-Encoding:br' Brotli compression
# + (you may need 'sudo apt install brotli' to view those files)
use strict;
use warnings;
use autodie;
use experimental qw( switch );
# Global CONSTANTS
my $UNBROT= "/usr/bin/brotli -d"; # change to your location
my $DD = "/bin/dd"; # - ditto -
my $GUNZIP= "/bin/gunzip"; # - ditto -
my $TOUCH = "/usr/bin/touch"; # - ditto -
my $IN = "/home/alexk/.cache/chromium/Default/Cache/Cache_Data"; # Chromium cache folder
my $OUT = "/home/alexk/Personal/ChromeCache/Files/"; # Place to extract files to
my $FOFF = 52; # Offset of HTTP-begin from magic-eof (BEG) + LEN
my $HTTP = "HTTP/1.1 200"; # '200 OK' not in all files
my $MEOF = "\x{d8}\x{41}\x{0d}\x{97}\x{45}\x{6f}\x{fa}\x{f4}"; # Magic End bits (last 8 bytes of every simple cache Entry file data record)
my $MENT = "\x{30}\x{5c}\x{72}\x{a7}\x{1b}\x{6d}\x{fb}\x{fc}"; # Magic Start bits (1st 8 bytes of every simple cache Entry file data record)
my $MURL = "_dk_"; # Magic Start for URL (url follows within cache Entry file data record)
# save algorithm:
# 1) $URL/@URL: find $key_length from header
# 2) $BEG;$END;$LEN: obtain data start+end (from $key_length + $MEOF)
# 3) only save HTTP 200 files ($HTTP)
# 4) $HTTP;$BROTLI;$GZIP;$MIME;$MOD;$TLS: obtain http header fields (from $MEOF + $FOFF)
# 5) extract section $BEG to $END from $IN file into $OUT dir
# 6) $MOD: touch file to conform with http header date
# 7) $BROTLI;$GZIP: decompress gzip/brotli files
# Stats 2023-03-06:
# 10978 HTTP 200 from 23594 files in Cache_Data
# 6 do NOT contain a MIME field
# 10979 files saved to disk (real 1m23.219s)
# chromium cache in 2023 is a "simple cache"
# see https://www.chromium.org/developers/design-documents/network-stack/disk-cache/very-simple-backend/
# see https://chromium.googlesource.com/chromium/src/+/HEAD/net/disk_cache/simple/simple_entry_format.h
# see https://github.com/JimmXinu/FanFicFare/blob/main/fanficfare/browsercache/browsercache_simple.py
# start-of-record magic-marker == 30 5c 72 a7 1b 6d fb fc
# end-of-record magic-marker == d8 41 0d 97 45 6f fa f4
# (data ends immediately before eor)
# (http header starts 44 bytes after eor, and thus 44+8=52 bytes (\x34) after end-of-data)
# (eor also ends file; 16 bytes then follow to actual end-of-file)
# from FFF: (finally found url-length location)
# cache Entry-file header = struct.Struct('<QLLLL') [little-endian | 8-byte | 4-byte | 4-byte | 4-byte | 4-byte)
# (magic, version, key_length, key_hash, padding) = shformat.unpack(data)
# Parse Chrome Cache File; see https://github.com/JimmXinu/FanFicFare/blob/main/fanficfare/browsercache/chromagnon/cacheParse.py
opendir( my $d, "$IN") or die "Cannot open directory $IN"; # Open cache dir
my @list
= grep {
!/^\.\.?$/ # miss /. + /.. files
&& -f "$IN/$_" # is a file (not dir, etc)
} readdir( $d );
closedir( $d );
foreach my $f (@list) { # Iterate through each cached data-file
# my $f = "be75a13d44e548da_0";
# section variables
my $BEG = -1; # Extract begins (bytes)
my $BROTLI = 0; # brotli encoding (0/1)
my $END = -1; # Extract ends (bytes)
my $GZIP = 0; # gzip encoding (0/1)
my $HPOS = -1; # 'HTTP' string begins (bytes)
my $HSTA = -1; # 'HTTP' status string (only interested in '200' or '203')
my $HVER = ''; # 'HTTP' version string (eg '1.1')
my $LEN = -1; # content-length
my $MAGIC = '';
my $MIME = ""; # content-type
my $MOD = ""; # last-modified
my $OFF = -1; # Offset of magic from file beginning
my $TLS = ""; # TLS==Three Letter Suffix
my $URL = ""; # url within cache Entry file
my @URL = ""; # same url as an array
my $UPOS = ""; # position of url start in Entry file
open my $fh, '<:raw', "$IN/$f" or die "Cannot open file $IN/$f";
# 1 Obtain url length then url
# $key_length starts from byte 24 (\x18), normally begins with an 8-byte string '1/0/_dk_', then stretches to the end of the URL sequence
# the std 8-byte string indicates that two streams (1 + 0) are included within the file
# the request-url sequence is 2 x (normally-identical) base urls then the full request url, each separated by a single space
# data supplied to request url begins immediately after the url, and ends immediately before the $MEOF magic-marker
# http response headers begin 44 bytes after the end of $MEOF, starting with HTTP Status string at $HPOS
# none of the "std" response headers can be *expected* to exist, though most do
# all sorts of stuff exists after initial response header bundle, many of which I do not understand
#+ including content-servers such as amazon, certificates, proxy-servers, others
# this second stream (for std 2-stream files) ends with another $MEOF 16 bytes (\x10) before eof
# eg1: "1/0/_dk_https://bbc.co.uk https://bbc.co.uk https://static.files.bbci.co.uk/core/bundle-service-bar.003e5ecd332a5558802c.js"
# \x18 ^ ^ $UPOS (=32 =\x20) ($key_length =123 =\x7b; note: 24+123 =147 =\x93) \x93 ^
# eg2: "d8410d97 456ffaf4 01000000 24be2bf3 8d010000000000005814000003654702 acd8b17d9a552f00b8a4b27d9a552f00 40040000 HTTP/1.1 200"
# \x220 ^ \x228 ^ \x230 ^ \x240 ^ \x250 ^ ^ $HPOS (=596 =\x254)
my $bytes_read = read $fh, my $bytes, 24;
die "Got $bytes_read but expected 24" unless $bytes_read == 24;
my ($magic, $version, $key_length, $key_hash, $padding) = unpack 'a8 a4 a4 a4 a4', $bytes;
if( unpack('Q', $magic ) ne unpack('Q', $MENT )) {
$magic = unpack('H16', $magic );
$MENT = unpack('H16', $MENT );
die "'$IN/$f' is not a cache entry file, wrong magic number\n (got '$magic' not '$MENT')";
}
seek( $fh, 0, 0 ); # return to start of file
read( $fh, my $cache_buffer, -s "$IN/$f" ); # put whole file in $cache_buffer
close( $fh ) or die "could not close $IN/$f";
# Obtain url
if( $cache_buffer =~ /$MURL/ ) {
$UPOS = $-[0] + 4; # url begins immediately *after* marker string
$key_length=unpack('L', $key_length );
$key_hash =unpack('H16', $key_hash );
$URL = substr( $cache_buffer, $UPOS, $key_length - ($UPOS - 24));
@URL = split(' ', $URL );
}
# 2 Obtain data start+end
$BEG = $key_length + 24;
$END = index( $cache_buffer, "$MEOF", $BEG);
if( $END < 1 ) {
print STDERR "'$IN/$f': error finding end of data at $0 line:". __LINE__ ."\n";
next; # immediately skips up to foreach() + increments $f
} else {
if( $BEG == $END ) { # yes, some pages have Content-Length:0
$LEN = -1;
} else {
$LEN = $END - $BEG;
}
}
# 3 Only extract from HTTP 200|203
if( $cache_buffer =~ /\x{00}\x{00}HTTP\/(\d.\d*)\s(200|203)/i ) {
$HPOS = $-[0] + 2;
if( $HPOS != $END + $FOFF) {
print STDERR "'$IN/$f': error finding start of http at $0 line:". __LINE__ ."\n";
next; # immediately skips up to foreach() + increments $f
}
$HVER = "$1"; # http version; always HTTP/1.1 for me
$HSTA = "$2"; # http status; we are only interested in 200 or 203
$HTTP = "HTTP/$HVER $HSTA";
# 4 Obtain http header fields
if( $LEN > 0 ) { # yes, some pages have Content-Length:0
if( $cache_buffer =~ /\x00Content-Encoding:\s*br/i ) { $BROTLI = 1; }
if( $cache_buffer =~ /\x00Content-Encoding:\s*gzip/i ) { $GZIP = 1; }
if( $cache_buffer =~ /\x00Content-Length:\s*(\d+)/i ) {
if( $1 != $LEN ) {
print STDERR "'$IN/$f': data-length \$LEN=$LEN differs from http Content-Length=$1 at $0 line:". __LINE__ ."\n";
}
if( !$1 ) { print STDERR "'$IN/$f': len=0 at $0 line:". __LINE__ ."\n"; }
}
if( $cache_buffer =~ /\x00Last-Modified:\s*([ A-Za-z0-9,:]+)/i ) {
$MOD = $1; # some web servers ignore case + introduce spaces!
} else {
if( $cache_buffer =~ /\x00Date:\s*([ A-Za-z0-9,:]+)/i ) {# did page did not want to be cached? (Chromium did it anyway!)
$MOD = $1; # (all pages should have a date (or a Date))
}
}
if( $cache_buffer =~ /\x00Content-Type:\s*([a-z-]+\/[a-z0-9.+-]+)/i ) {
$MIME = $1;
} # variable $1 NOT reset on failed match (v stupid)
} else { next; } # if( $LEN > 0 )
# easy to mixup mime/media-types & encoding (compression schemes) here
# Content-Type == mime-type refers to the type of file that is being transferred
# Content-Encoding == compression scheme refers to the type of compression used during transfer
# so, a text file (js txt xml, etc) with gzip magic will be a gzipped-textfile (eg file.xml.gz)
# gzip encoding (+ brotli) are only support; deflate no support, compress not even mentioned
# see https://httpd.apache.org/docs/current/mod/mod_deflate.html
# see https://www.iana.org/assignments/media-types/media-types.xhtml
given( $MIME ) {
when ('application/font-woff' ) { $MAGIC = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('application/font-woff2') { $MAGIC = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('application/javascript') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/json') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; } # magic for gzip encoding
when ('application/manifest+json'){ $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; } # magic for gzip encoding
when ('application/x-javascript'){ $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/xml') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'xml'; } # magic for gzip encoding
when ('binary/octet-stream') { $MAGIC = "GIF89a"; $OFF = 0; $TLS = 'gif'; }
when ('font/ttf') { $MAGIC = "\x{00}\x{01}\x{00}\x{00}\x{00}"; $OFF = 0; $TLS = 'ttf'; }
when ('font/woff') { $MAGIC = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('font/woff2') { $MAGIC = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('image/gif') { $MAGIC = 'GIF87a'; $OFF = 0; $TLS = 'gif'; }
# when ('image/gif') { $MAGIC = 'GIF89a'; $OFF = 0; $TLS = 'gif'; }
when ('image/jpeg') { $MAGIC = 'JFIF'; $OFF = 6; $TLS = 'jpg'; }
# when ('image/jpeg') { $MAGIC = 'Exif'; $OFF = 6; $TLS = 'jpeg'; }
# when ('image/jpeg') { $MAGIC = "\x{ff}\x{d8}\x{ff}\x{e0}"; $OFF = 6; $TLS = 'jpg'; }
when ('image/png') { $MAGIC = "\x{89}PNG"; $OFF = 0; $TLS = 'png'; }
when ('image/svg+xml') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'svg'; } # magic for gzip encoding
when ('image/vnd.microsoft.icon'){ $MAGIC = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('image/webp') { $MAGIC = 'RIFF'; $OFF = 0; $TLS = 'webp'; }
when ('image/x-icon') { $MAGIC = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('text/css') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'css'; } # magic for gzip encoding
when ('text/fragment+html') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'htm'; } # magic for gzip encoding
when ('text/html') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'html'; } # magic for gzip encoding
when ('text/javascript') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('text/plain') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'txt'; } # magic for gzip encoding
when ('text/xml') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'xml'; } # magic for gzip encoding
when ('video/mp4') { $MAGIC = 'ftypisom'; $OFF = 4; $TLS = 'mp4'; } # most unlikely
default { $MAGIC = ''; $OFF = 0; $TLS = ''; }
}
# gzip encoding overrides file magic (is earlier in file-stream)
# brotli encoding overrides file magic (there is none)
if( $GZIP ) { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; } elsif( $BROTLI ) { $MAGIC = ""; $OFF = 0; }
if( $MAGIC ) {
if( $MAGIC eq 'GIF87a') { # account for gif + jpeg multiple $MAGIC
if( ! index( $cache_buffer, "$MAGIC" )) {
$MAGIC = 'GIF89a';
}
} elsif( $MAGIC eq 'JFIF') {
if( ! index( $cache_buffer, "$MAGIC" )) {
$MAGIC = 'Exif';
$TLS = 'jpeg';
if( ! index( $cache_buffer, "$MAGIC" )) {
$MAGIC = "\x{ff}\x{d8}\x{ff}\x{e0}";
$TLS = 'jpg';
}
}
}
}
# suffixes (holy m$)
if( $TLS ) {
$TLS = ".$TLS";
if( $GZIP || $BROTLI ) { # compression-encoding
if( $GZIP ) { $TLS = "$TLS.gz"; } else { $TLS = "$TLS.br"; }
}
}
# 5 print the files out
if( $BEG > -1 && $LEN > -1 ) {
`$DD if="$IN/$f" of="$OUT/$f$TLS" skip=$BEG count=$LEN iflag=skip_bytes,count_bytes status=none`;
# 6 set the date to last-modified
if( $MOD ) { `$TOUCH "$OUT/$f$TLS" -d "$MOD"`; }
# 7 decompress if necessary
if( $GZIP || $BROTLI ) { # compression-encoding
if( $GZIP ) { # decompressed; .gz/.br suffix removed
`$GUNZIP "$OUT/$f$TLS"`; # original file removed; date retained
} else {
`$UNBROT -j "$OUT/$f$TLS"`;
}
}
} # lots of Content-Length:0 files
# print "$MIME; $URL[0]; $f; \$key_length=$key_length; \$key_hash=$key_hash; \$BEG=$BEG; \$END=$END; \$LEN=$LEN; \$TLS=$TLS \n";
} # if( $cache_buffer =~ /\x{00}\x{00}HTTP\/(\d.\d*)\s(200|203)/i ) # other pages mostly HTTP 204 No Content
}
Thursday update: small improvement to comments
Last edited by alexkemp (2023-03-23 13:27:27)
Online
or perhaps place it somewhere that is not on by microsoft.
Offline
Hi Ralph
If you have a suggestion(s) I'll investigate it/them. However, I'm used to GitHub now & it is free. Of course, that *is* what was said about MSIE…
Online
or perhaps place it somewhere that is not on by microsoft.
I get if people want to play very old games, etc... if people use microsoft, but who does this by either A:
Exposing anything to the internet or B:
Running that as their main system.
Also, its interesting that windows is still an industry standard.
I wonder why bosses still want it done that way. That sounds like a good way to do security..................................................................If you like the security of being protected by an OS that seems like its made by someone who used the zombie drug, bathsalts... same with having windows being an industry standard.
Freedom is never more than one generation away from extinction. Feelings are not facts
If you wish to be humbled, try to exalt yourself long term If you wish to be exalted, try to humble yourself long term
Favourite operating systems: Hyperbola Devuan OpenBSD
Peace Be With us All!
Offline
Yeah, I'm guessing that when Microsoft bought github, they thought that people would show a "convenient mix of lazy and stupid" and pretend for themselves that Microsoft, by keeping keeping github services kind of the same at least initially, supports FOSS.
Offline
I agree on both points. Btw, I have more patience with someone who just doesn't know of other alternatives then someone who while not needing windows, still chooses to use it as their main operating system when they aren't ignorant.
That is a special kind of awful, like the kind of awful that continues a perpetual unnecessary cycle, for nothing.
Last edited by zapper (2023-03-29 19:45:25)
Freedom is never more than one generation away from extinction. Feelings are not facts
If you wish to be humbled, try to exalt yourself long term If you wish to be exalted, try to humble yourself long term
Favourite operating systems: Hyperbola Devuan OpenBSD
Peace Be With us All!
Offline
I've uploaded 2 scripts to Github:
In that Repository, getCC is a PERL script that extracts all accessible files from the Chromium cache. Once they are all extracted, browseCC is a BASH script that accesses text-files dropped into the extract dir & uses YAD to display summaries & specifics on those files, including thumbnails for image files.
Online