You are not logged in.
A one-line script is run every day under my Chimaera 4 system:
sudo apt update && sudo apt install -f && sudo apt upgradeNo problems until today, when it reports that 3 of 4 available packages will be held back:
$ ~/.local/sbin/update
Hit:1 https://josm.openstreetmap.de/apt alldist InRelease
Hit:2 http://deb.devuan.org/merged chimaera InRelease
Get:3 http://deb.devuan.org/merged chimaera-security InRelease [26.2 kB]
Get:4 http://deb.devuan.org/merged chimaera-updates InRelease [26.1 kB]
Get:5 http://deb.devuan.org/merged chimaera-proposed-updates InRelease [26.6 kB]
Hit:6 http://deb.devuan.org/merged chimaera-backports InRelease
Fetched 78.9 kB in 3s (28.9 kB/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
4 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages have been kept back:
chromium chromium-common chromium-sandbox
The following packages will be upgraded:
josm-latest
1 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.
Need to get 15.7 MB of archives.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] n
Abort.Cue frantic DDG searches.
A 12-year old question has good suggestions:
$ sudo apt-get --with-new-pkgs upgrade chromium chromium-common chromium-sandbox josm-latest
[sudo] password for alexk:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
chromium-l10n : Depends: chromium (< 113.0.5672.126-1~deb11u1.1~) but 114.0.5735.90-2~deb11u1 is to be installed
E: Broken packagesnext:
$ sudo apt-get install chromium chromium-common chromium-sandbox josm-latest
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
chromium-l10n chromium-shell chromium-driver
The following packages will be REMOVED:
chromium-l10n
The following packages will be upgraded:
chromium chromium-common chromium-sandbox josm-latest
4 upgraded, 0 newly installed, 1 to remove and 0 not upgraded.
Need to get 89.3 MB of archives.
After this operation, 80.6 MB disk space will be freed.
Do you want to continue? [Y/n] n
Abort.Nope; not ready for it to be removed yet.
These are the problematic packages waiting to be upgraded:
$ apt list --upgradable
Listing... Done
chromium-common/stable-security 114.0.5735.90-2~deb11u1 amd64 [upgradable from: 113.0.5672.126-1~deb11u1]
chromium-sandbox/stable-security 114.0.5735.90-2~deb11u1 amd64 [upgradable from: 113.0.5672.126-1~deb11u1]
chromium/stable-security 114.0.5735.90-2~deb11u1 amd64 [upgradable from: 113.0.5672.126-1~deb11u1]
josm-latest/unknown 1.5.svn18746 all [upgradable from: 1.5.svn18744]So, after that blizzard of code, what to do? Is it likely that chromium-l10n may finally get updated if I am patient, or is there any other solution?
Setting the immutable bit in extended attributes should also work - chattr +i [filename] as root.
Works perfect. Perhaps, a touch too perfect for some scenarios. But perfect to stop ALL deletions.
~$ man chattr
NAME
chattr - change file attributes on a Linux file system
…
ATTRIBUTES
…
i A file with the 'i' attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this
file, most of the file's metadata can not be modified, and the file can not be opened in write mode. Only the supe‐
ruser or a process possessing the CAP_LINUX_IMMUTABLE capability can set or clear this attribute.~$ mkdir TMP
~$ cd TMP
~/TMP$ echo "a" > tmp.txt
~/TMP$ la tmp.txt
-rw-r--r-- 1 alexk alexk 2 Jun 3 00:04 tmp.txt
~/TMP$ sudo chown root.root tmp.txt
[sudo] password for alexk:
~/TMP$ sudo chattr +i tmp.txt
~/TMP$ la tmp.txt
-rw-r--r-- 1 root root 2 Jun 3 00:04 tmp.txt
~/TMP$ chmod 0700 tmp.txt
chmod: changing permissions of 'tmp.txt': Operation not permitted
~/TMP$ sudo chmod 0700 tmp.txt
chmod: changing permissions of 'tmp.txt': Operation not permitted
~/TMP$ rm tmp.txt
rm: cannot remove 'tmp.txt': Operation not permitted
~/TMP$ sudo chattr -i tmp.txt
~/TMP$ sudo chmod 0700 tmp.txt
~/TMP$ la tmp.txt
-rwx------ 1 root root 2 Jun 3 00:04 tmp.txt
~/TMP$ rm tmp.txt
rm: remove write-protected regular file 'tmp.txt'? y
~/TMP$ cd -
/home/alexk
~$ rmdir TMPI was astonished that an ordinary user can delete a file owned by root, but yes it can (if the dir is owned by the user):
$ echo "a" > tmp.txt
$ la tmp.txt
-rw-r--r-- 1 alexk alexk 2 Jun 2 17:48 tmp.txt
$ chmod 0700 tmp.txt
$ la tmp.txt
-rwx------ 1 alexk alexk 2 Jun 2 17:48 tmp.txt
$ sudo chown root.root tmp.txt
[sudo] password for alexk:
$ la tmp.txt
-rwx------ 1 root root 2 Jun 2 17:48 tmp.txt
$ rm tmp.txt
rm: remove write-protected regular file 'tmp.txt'? y
$ la tmp.txt
ls: cannot access 'tmp.txt': No such file or directoryI've uploaded 2 scripts to Github:
In that Repository, getCC is a PERL script that extracts all accessible files from the Chromium cache. Once they are all extracted, browseCC is a BASH script that accesses text-files dropped into the extract dir & uses YAD to display summaries & specifics on those files, including thumbnails for image files.
DVD & Blu-Ray discs are problematic under Linux for many reasons, and special steps need to be taken before Linux apps can play them.
I thought that it might be useful to expand this post to serve as a reference, since at first I could not remember myself all the steps I took.
encrypted with "css" and decryption libraries are needed to play them … don't think this will ever make it into debian officially.
Wiki: libdvdcss: Content Scramble System (CSS) software decryption library for accessing DVDs
VLC: Videolan (VLC): help on installing libdvdcss
HowTo Geek: Play DVDs and Blu-rays on Linux
CDs were first issued in 1982 and mostly held music files. The digital files on the media were NOT scrambled & quickly everyone learned to ‘rip’ the music from the disc and load it onto other electronic players. That became a nightmare for the Music Industry when the Internet got going in the 1990s.
DVDs were first released in 1996 and were designed to store larger files than Music required. The industry was determined NOT to make the same mistake as with CDs, and introduced both Regions (to try to restrict distribution regionally) and encryption (CSS) to restrict playing to officially-endorsed machines. The media became popular for both entertainment- (films, games) and software-distribution. CSS was brain-dead and easy to decrypt, but the combo of both endemic disc-players that contained zero hardware-decryption + industry-sponsored law (DMCA (Digital Millennium Copyright Act)), threatening fines and/or incarceration, was a nightmare for Linux users.
Blu-Ray was originated in 2002 & became a standard in 2008. It was designed to store very much larger digital files (required by larger domestic TV screens). In every way it is DVDs with knobs-on.
The modern advice from both Windows & Linux is “Install VLC”.
VLC states “VLC media player binaries are distributed with the libdvdcss library included” (that library is the one that decrypts the DVD-stream). That should mean that, once VLC is installed under Devuan, that other apps in the system should also be able to access the same library & decrypt the same streams. However, life is often not perfect, so here is what to do if glitches appear (it looks like the libdvdcss binary may be pre-installed under Windows but not Linux):–
With more ancient Debian/Ubuntu 15.04 it was sudo apt-get install libdvdread4 then sudo /usr/share/doc/libdvdread4/install-css.sh. My Chimaera still has /usr/share/doc/libdvdread4/README.css within it. However, the current advice is a little different:–
sudo apt install libdvd-pkg
sudo dpkg-reconfigure libdvd-pkg
The second step will download, build and install the latest libdvdcss source on your machine. At that point *all* libdvdcss-aware apps in your system — such as MPlayer, MPV + Handbrake — will be able to play DVDs.
$ apt search libdvd-pkg
Sorting... Done
Full Text Search... Done
libdvd-pkg/stable,now 1.4.2-1-1 all [installed]
DVD-Video playing library - installer$ apt search libdvdcss
Sorting... Done
Full Text Search... Done
…
libdvdcss-dev/now 1.4.2-1~local amd64 [installed,local]
library for accessing encrypted DVDs - development files
libdvdcss2/now 1.4.2-1~local amd64 [installed,local]
library for accessing encrypted DVDs
libdvdcss2-dbgsym/now 1.4.2-1~local amd64 [installed,local]
debug symbols for libdvdcss2
libdvdread4/now 6.0.1-1 amd64 [installed,local]
library for reading DVDs
libdvdread8/stable,now 6.1.1-2 amd64 [installed,automatic]
library for reading DVDsBlue-Ray presents yet more problems. Once again, refer to How to Play DVDs and Blu-rays on Linux:
sudo apt-get install vlc libaacs0 libbluray-bdj libbluray1
mkdir -p ~/.config/aacs/
cd ~/.config/aacs/ && wget http://vlc-bluray.whoknowsmy.name/files/KEYDB.cfg
$ apt search libbluray
Sorting... Done
Full Text Search... Done
libbluray-bdj/stable,now 1:1.2.1-4+deb11u2 all [installed]
Blu-ray Disc Java support library (BD-J library)
libbluray-bin/stable 1:1.2.1-4+deb11u2 amd64
Blu-ray disc playback support library (tools)
libbluray-dev/stable 1:1.2.1-4+deb11u2 amd64
Blu-ray disc playback support library (development files)
libbluray-doc/stable 1:1.2.1-4+deb11u2 all
Blu-ray disc playback support library (documentation)
libbluray2/stable,now 1:1.2.1-4+deb11u2 amd64 [installed]
Blu-ray disc playback support library (shared library)All the above works fine in my Chimaera system.
Updated:
May 23: Added a fuller narrative on installing libdvdcss.
Added info on the command (libdvd-pkg) to install libdvdcss in order to be able to play DVDs.
I'm using updated Chimaera & VLC Media Player gives me the best results. The one thing that I need to settle to is very little in terms of the interactive DVD menu.
(there are a staggering number of VLC-Plugins; perhaps one of those will fix the bad dvd-menu):
$ apt search vlc
Sorting... Done
Full Text Search... Done
…
vlc/stable,stable-security,now 3.0.18-0+deb11u1 amd64 [installed]
multimedia player and streamer
vlc-bin/stable,stable-security,now 3.0.18-0+deb11u1 amd64 [installed]
binaries from VLC
vlc-data/stable,stable-security,now 3.0.18-0+deb11u1 all [installed,automatic]
common data for VLCYou have not included the full command for hexdump to get the correct result (it is hexdump -C). However, to make life simple for yourself use the shortcut hd (I only just discovered that just now - it is also hexdump).
If you look at this Wiki page for File Signatures you will find that the magic signature for ttf is 00 01 00 00 00. Therefore:
$ hd /usr/lib/firefox-esr/fonts/TwemojiMozilla.ttf | head
00000000 00 01 00 00 00 11 01 00 00 04 00 10 43 4f 4c 52 |............COLR|
00000010 9b 59 1b ad 00 13 22 f8 00 02 51 6c 43 50 41 4c |.Y...."...QlCPAL|
00000020 99 fd 98 85 00 15 74 64 00 00 0f ea 46 46 54 4d |......td....FFTM|
00000030 94 98 f7 41 00 15 84 50 00 00 00 1c 47 44 45 46 |...A...P....GDEF|
00000040 00 27 34 59 00 15 84 6c 00 00 00 1e 47 53 55 42 |.'4Y...l....GSUB|
00000050 71 92 ee f7 00 15 84 8c 00 00 6c 80 4f 53 2f 32 |q.........l.OS/2|
00000060 34 c3 0a 83 00 00 01 98 00 00 00 60 63 6d 61 70 |4..........`cmap|
00000070 25 3d b7 6c 00 00 6a a8 00 00 0b fa 63 76 74 20 |%=.l..j.....cvt |
00000080 00 11 01 44 00 00 76 a4 00 00 00 04 67 61 73 70 |...D..v.....gasp|
00000090 ff ff 00 03 00 13 22 f0 00 00 00 08 67 6c 79 66 |......".....glyf|and you will spot the magic signature starting at '00000000'. If *your* ttf files do not start with this signature (which AHA6.ttf does not), then they will not be recognised by the system as TTF font files. My best guess then is that something within the way that you are processing them is shafting them.
There is a helpful Step-by-Step install TTF fonts on Linux page if you need help on that.
Update: spel-check
Please use [ code ] ... [/ code] tags to quote terminal-results (as below) (keeps the text in this window small). You would also be well advised to use your computer as an ordinary user rather than the root user. Switch to the root user only when you have some extensive admin to do that actually *requires* root to work (updating the system is such a requirement).
# file /usr/lib/firefox-esr/fonts/TwemojiMozilla.ttf /usr/lib/firefox-esr/fonts/TwemojiMozilla.ttf: TrueType Font data, 17 tables, 1st "COLR", 12 names, Macintosh, type 1 string
That is a proper ttf font …
# file /usr/share/fonts/truetype/aha/AHA6.ttf
/usr/share/fonts/truetype/aha/AHA6.ttf: data… and that is most unlikely to be such a font. Here is one way to check:–
$ strings /usr/lib/firefox-esr/fonts/TwemojiMozilla.ttf | wc -l
28319(28,319 lines of text-strings - that is what a ttf font should look like)
Here is what the top of the file looks like (you can also change the 'head' to 'less' (no quotes) to examine the whole file):–
$ strings /usr/lib/firefox-esr/fonts/TwemojiMozilla.ttf | head
COLR
QlCPAL
FFTM
GDEF
GSUBq
OS/24
`cmap%=
cvt
gasp
glyfNow an examination using a hex-viewer; look at the first 12-bytes to see the magic-signature for a ttf font (ALL ttf-fonts should look something like this in the 1st 12 bytes):
$ hexdump -C /usr/lib/firefox-esr/fonts/TwemojiMozilla.ttf | head
00000000 00 01 00 00 00 11 01 00 00 04 00 10 43 4f 4c 52 |............COLR|
00000010 9b 59 1b ad 00 13 22 f8 00 02 51 6c 43 50 41 4c |.Y...."...QlCPAL|
00000020 99 fd 98 85 00 15 74 64 00 00 0f ea 46 46 54 4d |......td....FFTM|
00000030 94 98 f7 41 00 15 84 50 00 00 00 1c 47 44 45 46 |...A...P....GDEF|
00000040 00 27 34 59 00 15 84 6c 00 00 00 1e 47 53 55 42 |.'4Y...l....GSUB|
00000050 71 92 ee f7 00 15 84 8c 00 00 6c 80 4f 53 2f 32 |q.........l.OS/2|
00000060 34 c3 0a 83 00 00 01 98 00 00 00 60 63 6d 61 70 |4..........`cmap|
00000070 25 3d b7 6c 00 00 6a a8 00 00 0b fa 63 76 74 20 |%=.l..j.....cvt |
00000080 00 11 01 44 00 00 76 a4 00 00 00 04 67 61 73 70 |...D..v.....gasp|
00000090 ff ff 00 03 00 13 22 f0 00 00 00 08 67 6c 79 66 |......".....glyf|You can now examine both TwemojiMozilla.ttf & AHA6.ttf on your own machine to discover whether these rogue fonts are actually fonts at all.
(if you do not yet have hexdump then install bsdextrautils (part of chimaera-stable)):–
$ apt search hexdump
Sorting... Done
Full Text Search... Done
bsdextrautils/stable,now 2.36.1-8+devuan2 amd64 [installed]
extra utilities from 4.4BSD-Litein fontforge ".._ttf is not a known format
The use of an underscore (“_”) concerns me there. In general Linux uses mime (or file) to discover what a file actually is, rather than the Window$-inspired “.tld” convention.
I do not have fontforge installed.
Possibly one way to begin to diagnose your situation is from the command-line. If you have mlocate installed (to quickly locate files) and either FireFox and/or Thunderbird installed then you will be able to locate this specific TTF font:
$ file /usr/lib/firefox-esr/fonts/TwemojiMozilla.ttf
/usr/lib/firefox-esr/fonts/TwemojiMozilla.ttf: TrueType Font data, 17 tables, 1st "COLR", 12 names, Macintosh, type 1 stringThat should help to begin to help discover exactly what your system thinks any particular ttf file is.
One final other common source is mscorefonts and/or fonts-liberation:
$ apt search fonts-liberation
Sorting... Done
Full Text Search... Done
fonts-liberation/stable,now 1:1.07.4-11 all [installed]
Fonts with the same metrics as Times, Arial and Courier
fonts-liberation2/stable,now 2.1.3-1 all [installed]
Fonts with the same metrics as Times, Arial and Courier (v2)
ttf-mscorefonts-installer/stable,now 3.8 all [installed]
Installer for Microsoft TrueType core fonts
$ file /usr/share/fonts/truetype/msttcorefonts/Arial.ttf
/usr/share/fonts/truetype/msttcorefonts/Arial.ttf: TrueType Font data, digitally signed, 23 tables, 1st "DSIG", 70 names, Unicode, Typeface \251 The Monotype Corporation plc. Data \251 The Monotype Corporation plc/Type SolutionHTH
Hi amc252.
Every electronic component within your computer has a driver associated with it that allows that component to "play along". Monitors are no different to anything else. Therefore, your first search can involve finding a Chimaera driver for the digital TV.
The miracle of modern electronic equipment was made far easier with the introduction of PnP ("Plug 'n' Play"). That relies on a digital connection & various subsystems, and is what allows something like a monitor to be plugged in, auto-detected by the computer, recognised, the driver auto-located via the internet, auto-downloaded & auto-installed. Now, a USB connection is certainly digital but is more used for modems or HDD & little used for connecting monitors - HDMI connections are the standard for that.
Check your computer: does it have a HDMI port?
Check your digital monitor: does it have a HDMI port?
If the answer to the two questions above are both "yes" then you may be in business very quickly; just make sure that both ports are switched "on" in the setup for both machines, and that your computer is connected via an Ethernet port to the internet before you make the HDMI connection (Ethernet is 'old school' & thus provides few problems cf WLAN).
If the above is not possible & you are determined to go ahead with RCA & such-like then bring your will up-to-date so that afterwards others can realise the reasons for your suicide.
Good luck.
I've got two drives that are USB-connected HDD:
Seagate 2TB portable
(this is formatted using standard Linux utilities to (so-called) FAT64 (HPFS/NTFS/exFAT: max 2TB))
WD (Western Digital) 4TB portable ("My Passport")
(this is native M$ format ("Microsoft basic data") and I cannot find a linux utility that can format and/or repair it to it's current state)
The advantage of the former is that it is ubiquitous across many different OS. As an example, my ancient Samsung TV can read & play movies from (1), but not (2).
The advantage of (2) is that it can store above the 2TB threshold. Also, astonishing that I may need it to. I left the disk in it's supplied format since that can be read by more OS than a native Linux format.
As long as you have a 64-bit cpu then (as I understand it) either disk can be read up to the max of the cpu (which I cannot recall as I sit here, but much, much more than 4TB).
OK. Thanks to admin (although the OP has been further edited).
Well, now that you have edited it, it reads "daedalus" although when first posted it read "deadalus".
I was attempting to both be light-hearted in my response, and also warning other folks not to blindly copy your [ code ]'ed config since it contained a spelling mistake. Also, I personally only ever 'code' actual results without editing them so that others can trust that what I code is actually what I got as a result.
You need to address your remarks on the absence of non-free to fsmithred, since he states that yes, that does work whereas security & updates will not. I have no personal experience to be able to comment.
For those that realise that daedalus is *not* dead, do not copy brday's code.
The Devuan package information is here, and the sole Default configuration shown for daedalus is as follows on that page:
deb http://deb.devuan.org/merged daedalus main
@brday:
If your code was copied from the terminal then you likely have a reason (speling), otherwise non-free & contrib are not available for deadalus, though they may be for daedalus.
a new Chimaera live iso from two days ago
Possibly due to a recent kernel upgrade to 6.1.12-1 (available from backports):
$ uname -a
Linux ng3 6.1.0-0.deb11.5-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.12-1~bpo11+1 (2023-03-05) x86_64 GNU/Linux
$ la /boot/initrd* /boot/vmlinuz*
-rw-r--r-- 1 root root 68062772 Mar 28 22:35 /boot/initrd.img-6.0.0-0.deb11.6-amd64
-rw-r--r-- 1 root root 68913767 Mar 31 17:35 /boot/initrd.img-6.1.0-0.deb11.5-amd64
-rw-r--r-- 1 root root 7730784 Dec 19 14:14 /boot/vmlinuz-6.0.0-0.deb11.6-amd64
-rw-r--r-- 1 root root 7866720 Mar 5 18:27 /boot/vmlinuz-6.1.0-0.deb11.5-amd64My actual update was March 31 (I update daily):
$ la -clt /boot/initrd* /boot/vmlinuz*
-rw-r--r-- 1 root root 68913767 Mar 31 17:35 /boot/initrd.img-6.1.0-0.deb11.5-amd64
-rw-r--r-- 1 root root 7866720 Mar 31 17:35 /boot/vmlinuz-6.1.0-0.deb11.5-amd64
-rw-r--r-- 1 root root 68062772 Mar 28 22:35 /boot/initrd.img-6.0.0-0.deb11.6-amd64
-rw-r--r-- 1 root root 7730784 Jan 4 10:14 /boot/vmlinuz-6.0.0-0.deb11.6-amd64Hi Ralph
If you have a suggestion(s) I'll investigate it/them. However, I'm used to GitHub now & it is free. Of course, that *is* what was said about MSIE…
It is said that the connection between Rats & Bulldogs lies in the construction of their jawbones: once they bite, neither can release their teeth until the jaws clamp together (due to a ratchet mechanism that joins the upper & lower jawbone). I sympathise with both species; my mind has a similar mechanism.
I finally spotted how to determine the precise length of the embedded URL within each cache (simple) Entry file. It is now possible to collate all urls, data-lengths, etc.. That finally opens the possibility to providing url + file listing, search, selection + individual extraction. However, that will all have to wait for later. For now, it is a simple utility that extracts all cached files (or just one file) into a single directory (listing below).
There is a commented-out print-line almost at the bottom of the script. It can produce a listing of all files for you. The following from a terminal can do that (comment the $DD lines & uncomment the PRINT line first):
~/Personal/.getCC > temp.txt; sort -n temp.txt > mime.txt;
The Cache contains all kinds of corrupted files. There are lines in the script to try to catch those; the notices go to STDERR so it will not corrupt your mime.txt.
Note that there has been a radical reset of almost all code, which creates some disjuncture between current code & earlier BugFix comments. $magic is still in the code but is unused now.
If I cannot stop myself producing a file browser then I shall place the code into GitHub, so that this thread can finally sleep.
#!/usr/bin/perl
# get Chrome Cache
# suggestion: save as ~/.getCC; chmod +x; chmod 700
# A PERL script to iterate through Chromium/Chrome 'Cache_Data/' dir
#+ & extract all http-delivered files stored within those data-files
# 2023-03-21: Finally found location of URL-length
# (& thus how to find start of content for all files)
# 2023-03-16: bugfix: Account for Content-Encoding invalidating file-magic
# 2023-03-12: Account for multiple http version + 200|203 status
# 2023-03-08: bugfix: COUNT removed; LEN used instead
# + (FOFF used for BEG, not COUNT)
# + brotli now works
# + (no magic for brotli (a mistake imo))
# 2023-03-07: bugfix: corrected miss on most magic files (my bad)
# + excluded compound header fields to eliminate wrong values
# added $FOFF (diff between HTTP-begin ($END - $LEN) & magic-begin ($BEG))
# + (*every* file with both $BEG & $LEN has diff == x34) (h-begin is bigger)
# + thus if no magic but LEN then BEG = END - LEN - 52
# + if magic but no LEN then LEN = END - BEG - 52 (yes, this *does* happen)
# 2023-03-05: bugfix: coded to exclude 711 zero-length files
# + account for multiple-same-value $mime (fixes ~1000 gif + jpg files)
# + added 'Content-Encoding:br' Brotli compression
# + (you may need 'sudo apt install brotli' to view those files)
use strict;
use warnings;
use autodie;
use experimental qw( switch );
# Global CONSTANTS
my $UNBROT= "/usr/bin/brotli -d"; # change to your location
my $DD = "/bin/dd"; # - ditto -
my $GUNZIP= "/bin/gunzip"; # - ditto -
my $TOUCH = "/usr/bin/touch"; # - ditto -
my $IN = "/home/alexk/.cache/chromium/Default/Cache/Cache_Data"; # Chromium cache folder
my $OUT = "/home/alexk/Personal/ChromeCache/Files/"; # Place to extract files to
my $FOFF = 52; # Offset of HTTP-begin from magic-eof (BEG) + LEN
my $HTTP = "HTTP/1.1 200"; # '200 OK' not in all files
my $MEOF = "\x{d8}\x{41}\x{0d}\x{97}\x{45}\x{6f}\x{fa}\x{f4}"; # Magic End bits (last 8 bytes of every simple cache Entry file data record)
my $MENT = "\x{30}\x{5c}\x{72}\x{a7}\x{1b}\x{6d}\x{fb}\x{fc}"; # Magic Start bits (1st 8 bytes of every simple cache Entry file data record)
my $MURL = "_dk_"; # Magic Start for URL (url follows within cache Entry file data record)
# save algorithm:
# 1) $URL/@URL: find $key_length from header
# 2) $BEG;$END;$LEN: obtain data start+end (from $key_length + $MEOF)
# 3) only save HTTP 200 files ($HTTP)
# 4) $HTTP;$BROTLI;$GZIP;$MIME;$MOD;$TLS: obtain http header fields (from $MEOF + $FOFF)
# 5) extract section $BEG to $END from $IN file into $OUT dir
# 6) $MOD: touch file to conform with http header date
# 7) $BROTLI;$GZIP: decompress gzip/brotli files
# Stats 2023-03-06:
# 10978 HTTP 200 from 23594 files in Cache_Data
# 6 do NOT contain a MIME field
# 10979 files saved to disk (real 1m23.219s)
# chromium cache in 2023 is a "simple cache"
# see https://www.chromium.org/developers/design-documents/network-stack/disk-cache/very-simple-backend/
# see https://chromium.googlesource.com/chromium/src/+/HEAD/net/disk_cache/simple/simple_entry_format.h
# see https://github.com/JimmXinu/FanFicFare/blob/main/fanficfare/browsercache/browsercache_simple.py
# start-of-record magic-marker == 30 5c 72 a7 1b 6d fb fc
# end-of-record magic-marker == d8 41 0d 97 45 6f fa f4
# (data ends immediately before eor)
# (http header starts 44 bytes after eor, and thus 44+8=52 bytes (\x34) after end-of-data)
# (eor also ends file; 16 bytes then follow to actual end-of-file)
# from FFF: (finally found url-length location)
# cache Entry-file header = struct.Struct('<QLLLL') [little-endian | 8-byte | 4-byte | 4-byte | 4-byte | 4-byte)
# (magic, version, key_length, key_hash, padding) = shformat.unpack(data)
# Parse Chrome Cache File; see https://github.com/JimmXinu/FanFicFare/blob/main/fanficfare/browsercache/chromagnon/cacheParse.py
opendir( my $d, "$IN") or die "Cannot open directory $IN"; # Open cache dir
my @list
= grep {
!/^\.\.?$/ # miss /. + /.. files
&& -f "$IN/$_" # is a file (not dir, etc)
} readdir( $d );
closedir( $d );
foreach my $f (@list) { # Iterate through each cached data-file
# my $f = "be75a13d44e548da_0";
# section variables
my $BEG = -1; # Extract begins (bytes)
my $BROTLI = 0; # brotli encoding (0/1)
my $END = -1; # Extract ends (bytes)
my $GZIP = 0; # gzip encoding (0/1)
my $HPOS = -1; # 'HTTP' string begins (bytes)
my $HSTA = -1; # 'HTTP' status string (only interested in '200' or '203')
my $HVER = ''; # 'HTTP' version string (eg '1.1')
my $LEN = -1; # content-length
my $MAGIC = '';
my $MIME = ""; # content-type
my $MOD = ""; # last-modified
my $OFF = -1; # Offset of magic from file beginning
my $TLS = ""; # TLS==Three Letter Suffix
my $URL = ""; # url within cache Entry file
my @URL = ""; # same url as an array
my $UPOS = ""; # position of url start in Entry file
open my $fh, '<:raw', "$IN/$f" or die "Cannot open file $IN/$f";
# 1 Obtain url length then url
# $key_length starts from byte 24 (\x18), normally begins with an 8-byte string '1/0/_dk_', then stretches to the end of the URL sequence
# the std 8-byte string indicates that two streams (1 + 0) are included within the file
# the request-url sequence is 2 x (normally-identical) base urls then the full request url, each separated by a single space
# data supplied to request url begins immediately after the url, and ends immediately before the $MEOF magic-marker
# http response headers begin 44 bytes after the end of $MEOF, starting with HTTP Status string at $HPOS
# none of the "std" response headers can be *expected* to exist, though most do
# all sorts of stuff exists after initial response header bundle, many of which I do not understand
#+ including content-servers such as amazon, certificates, proxy-servers, others
# this second stream (for std 2-stream files) ends with another $MEOF 16 bytes (\x10) before eof
# eg1: "1/0/_dk_https://bbc.co.uk https://bbc.co.uk https://static.files.bbci.co.uk/core/bundle-service-bar.003e5ecd332a5558802c.js"
# \x18 ^ ^ $UPOS (=32 =\x20) ($key_length =123 =\x7b; note: 24+123 =147 =\x93) \x93 ^
# eg2: "d8410d97 456ffaf4 01000000 24be2bf3 8d010000000000005814000003654702 acd8b17d9a552f00b8a4b27d9a552f00 40040000 HTTP/1.1 200"
# \x220 ^ \x228 ^ \x230 ^ \x240 ^ \x250 ^ ^ $HPOS (=596 =\x254)
my $bytes_read = read $fh, my $bytes, 24;
die "Got $bytes_read but expected 24" unless $bytes_read == 24;
my ($magic, $version, $key_length, $key_hash, $padding) = unpack 'a8 a4 a4 a4 a4', $bytes;
if( unpack('Q', $magic ) ne unpack('Q', $MENT )) {
$magic = unpack('H16', $magic );
$MENT = unpack('H16', $MENT );
die "'$IN/$f' is not a cache entry file, wrong magic number\n (got '$magic' not '$MENT')";
}
seek( $fh, 0, 0 ); # return to start of file
read( $fh, my $cache_buffer, -s "$IN/$f" ); # put whole file in $cache_buffer
close( $fh ) or die "could not close $IN/$f";
# Obtain url
if( $cache_buffer =~ /$MURL/ ) {
$UPOS = $-[0] + 4; # url begins immediately *after* marker string
$key_length=unpack('L', $key_length );
$key_hash =unpack('H16', $key_hash );
$URL = substr( $cache_buffer, $UPOS, $key_length - ($UPOS - 24));
@URL = split(' ', $URL );
}
# 2 Obtain data start+end
$BEG = $key_length + 24;
$END = index( $cache_buffer, "$MEOF", $BEG);
if( $END < 1 ) {
print STDERR "'$IN/$f': error finding end of data at $0 line:". __LINE__ ."\n";
next; # immediately skips up to foreach() + increments $f
} else {
if( $BEG == $END ) { # yes, some pages have Content-Length:0
$LEN = -1;
} else {
$LEN = $END - $BEG;
}
}
# 3 Only extract from HTTP 200|203
if( $cache_buffer =~ /\x{00}\x{00}HTTP\/(\d.\d*)\s(200|203)/i ) {
$HPOS = $-[0] + 2;
if( $HPOS != $END + $FOFF) {
print STDERR "'$IN/$f': error finding start of http at $0 line:". __LINE__ ."\n";
next; # immediately skips up to foreach() + increments $f
}
$HVER = "$1"; # http version; always HTTP/1.1 for me
$HSTA = "$2"; # http status; we are only interested in 200 or 203
$HTTP = "HTTP/$HVER $HSTA";
# 4 Obtain http header fields
if( $LEN > 0 ) { # yes, some pages have Content-Length:0
if( $cache_buffer =~ /\x00Content-Encoding:\s*br/i ) { $BROTLI = 1; }
if( $cache_buffer =~ /\x00Content-Encoding:\s*gzip/i ) { $GZIP = 1; }
if( $cache_buffer =~ /\x00Content-Length:\s*(\d+)/i ) {
if( $1 != $LEN ) {
print STDERR "'$IN/$f': data-length \$LEN=$LEN differs from http Content-Length=$1 at $0 line:". __LINE__ ."\n";
}
if( !$1 ) { print STDERR "'$IN/$f': len=0 at $0 line:". __LINE__ ."\n"; }
}
if( $cache_buffer =~ /\x00Last-Modified:\s*([ A-Za-z0-9,:]+)/i ) {
$MOD = $1; # some web servers ignore case + introduce spaces!
} else {
if( $cache_buffer =~ /\x00Date:\s*([ A-Za-z0-9,:]+)/i ) {# did page did not want to be cached? (Chromium did it anyway!)
$MOD = $1; # (all pages should have a date (or a Date))
}
}
if( $cache_buffer =~ /\x00Content-Type:\s*([a-z-]+\/[a-z0-9.+-]+)/i ) {
$MIME = $1;
} # variable $1 NOT reset on failed match (v stupid)
} else { next; } # if( $LEN > 0 )
# easy to mixup mime/media-types & encoding (compression schemes) here
# Content-Type == mime-type refers to the type of file that is being transferred
# Content-Encoding == compression scheme refers to the type of compression used during transfer
# so, a text file (js txt xml, etc) with gzip magic will be a gzipped-textfile (eg file.xml.gz)
# gzip encoding (+ brotli) are only support; deflate no support, compress not even mentioned
# see https://httpd.apache.org/docs/current/mod/mod_deflate.html
# see https://www.iana.org/assignments/media-types/media-types.xhtml
given( $MIME ) {
when ('application/font-woff' ) { $MAGIC = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('application/font-woff2') { $MAGIC = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('application/javascript') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/json') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; } # magic for gzip encoding
when ('application/manifest+json'){ $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; } # magic for gzip encoding
when ('application/x-javascript'){ $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/xml') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'xml'; } # magic for gzip encoding
when ('binary/octet-stream') { $MAGIC = "GIF89a"; $OFF = 0; $TLS = 'gif'; }
when ('font/ttf') { $MAGIC = "\x{00}\x{01}\x{00}\x{00}\x{00}"; $OFF = 0; $TLS = 'ttf'; }
when ('font/woff') { $MAGIC = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('font/woff2') { $MAGIC = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('image/gif') { $MAGIC = 'GIF87a'; $OFF = 0; $TLS = 'gif'; }
# when ('image/gif') { $MAGIC = 'GIF89a'; $OFF = 0; $TLS = 'gif'; }
when ('image/jpeg') { $MAGIC = 'JFIF'; $OFF = 6; $TLS = 'jpg'; }
# when ('image/jpeg') { $MAGIC = 'Exif'; $OFF = 6; $TLS = 'jpeg'; }
# when ('image/jpeg') { $MAGIC = "\x{ff}\x{d8}\x{ff}\x{e0}"; $OFF = 6; $TLS = 'jpg'; }
when ('image/png') { $MAGIC = "\x{89}PNG"; $OFF = 0; $TLS = 'png'; }
when ('image/svg+xml') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'svg'; } # magic for gzip encoding
when ('image/vnd.microsoft.icon'){ $MAGIC = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('image/webp') { $MAGIC = 'RIFF'; $OFF = 0; $TLS = 'webp'; }
when ('image/x-icon') { $MAGIC = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('text/css') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'css'; } # magic for gzip encoding
when ('text/fragment+html') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'htm'; } # magic for gzip encoding
when ('text/html') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'html'; } # magic for gzip encoding
when ('text/javascript') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('text/plain') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'txt'; } # magic for gzip encoding
when ('text/xml') { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'xml'; } # magic for gzip encoding
when ('video/mp4') { $MAGIC = 'ftypisom'; $OFF = 4; $TLS = 'mp4'; } # most unlikely
default { $MAGIC = ''; $OFF = 0; $TLS = ''; }
}
# gzip encoding overrides file magic (is earlier in file-stream)
# brotli encoding overrides file magic (there is none)
if( $GZIP ) { $MAGIC = "\x{1f}\x{8b}\x{08}"; $OFF = 0; } elsif( $BROTLI ) { $MAGIC = ""; $OFF = 0; }
if( $MAGIC ) {
if( $MAGIC eq 'GIF87a') { # account for gif + jpeg multiple $MAGIC
if( ! index( $cache_buffer, "$MAGIC" )) {
$MAGIC = 'GIF89a';
}
} elsif( $MAGIC eq 'JFIF') {
if( ! index( $cache_buffer, "$MAGIC" )) {
$MAGIC = 'Exif';
$TLS = 'jpeg';
if( ! index( $cache_buffer, "$MAGIC" )) {
$MAGIC = "\x{ff}\x{d8}\x{ff}\x{e0}";
$TLS = 'jpg';
}
}
}
}
# suffixes (holy m$)
if( $TLS ) {
$TLS = ".$TLS";
if( $GZIP || $BROTLI ) { # compression-encoding
if( $GZIP ) { $TLS = "$TLS.gz"; } else { $TLS = "$TLS.br"; }
}
}
# 5 print the files out
if( $BEG > -1 && $LEN > -1 ) {
`$DD if="$IN/$f" of="$OUT/$f$TLS" skip=$BEG count=$LEN iflag=skip_bytes,count_bytes status=none`;
# 6 set the date to last-modified
if( $MOD ) { `$TOUCH "$OUT/$f$TLS" -d "$MOD"`; }
# 7 decompress if necessary
if( $GZIP || $BROTLI ) { # compression-encoding
if( $GZIP ) { # decompressed; .gz/.br suffix removed
`$GUNZIP "$OUT/$f$TLS"`; # original file removed; date retained
} else {
`$UNBROT -j "$OUT/$f$TLS"`;
}
}
} # lots of Content-Length:0 files
# print "$MIME; $URL[0]; $f; \$key_length=$key_length; \$key_hash=$key_hash; \$BEG=$BEG; \$END=$END; \$LEN=$LEN; \$TLS=$TLS \n";
} # if( $cache_buffer =~ /\x{00}\x{00}HTTP\/(\d.\d*)\s(200|203)/i ) # other pages mostly HTTP 204 No Content
}Thursday update: small improvement to comments
This should be the last code update for now (below).
It is tested as well as I can manage in a short time. ~64% of cache are HTTP 200, with most of the rest being 204 No Content. A number of the 200 OK files are also Content-Length:0 (js files for search-results in many cases). The script is written so that no attempt is made to extract no-content files.
The final search was for Content-Encoding: (compression before delivery). My main source was latest Apache modules and that showed that only gzip & brotli are currently used. The statement was that "deflate is not supported", whilst compress was not even mentioned.
#!/usr/bin/perl
# get Chrome Cache
# suggestion: save as ~/.getCC; chmod +x; chmod 700
# A PERL script to iterate through Chromium/Chrome 'Cache_Data/' dir
#+ & extract all http-delivered files stored within those data-files
# 2023-03-12: Account for multiple http version + 200|203 status
# 2023-03-08: bugfix: COUNT removed; LEN used instead
# + (F_OFF used for BEG, not COUNT)
# + brotli now works
# + (no magic for brotli (a mistake imo))
# 2023-03-07: bugfix: corrected miss on most magic files (my bad)
# + excluded compound header fields to eliminate wrong values
# added $F_OFF (diff between HTTP-begin ($END - $LEN) & magic-begin ($BEG))
# + (*every* file with both $BEG & $LEN has diff == x34) (h-begin is bigger)
# + thus if no magic but LEN then BEG = END - LEN - 52
# + if magic but no LEN then LEN = END - BEG - 52 (yes, this *does* happen)
# 2023-03-05: bugfix: coded to exclude 711 zero-length files
# + account for multiple-same-value $mime (fixes ~1000 gif + jpg files)
# + added 'Content-Encoding:br' Brotli compression
# + (you may need 'sudo apt install brotli' to view those files)
use strict;
use warnings;
use autodie;
use experimental qw( switch );
# save algorithm:
# 1) only save HTTP 200 files ($END)
# 2) try first to set file beginning ($BEG) from magic bytes
# 3) if (2) fails, set $BEG from $LEN; if no length, then ignore file
# 4) extract section $BEG to $END from $IN file into $OUT dir
# 5) touch file to conform with http header date
# Stats 2023-03-06:
# 10978 HTTP 200 from 23594 files in Cache_Data
# 6 do NOT contain a MIME field
# 10979 files saved to disk (real 1m23.219s)
# Global CONSTANTS
my $IN = "/home/alexk/.cache/chromium/Default/Cache/Cache_Data/"; # Chromium cache folder.
my $OUT = "/home/alexk/Personal/ChromeCache/Files/"; # Place for extracted files
my $HTTP = "HTTP/1.1 200"; # '200 OK' not in all files
my $F_OFF= 52; # Offset of HTTP-begin from magic-begin (BEG) + LEN
opendir( my $d, "$IN") or die "Cannot open directory $IN: $!\n"; # Open cache dir
my @list
= grep {
!/^\.\.?$/ # miss /. + /.. files
&& -f "$IN/$_" # is a file (not dir, etc)
} readdir( $d );
closedir( $d );
foreach my $f (@list) { # Iterate through each cached data-file
# my $f = "000420fedcafe6ff_0";
# section variables
my $BEG = -1; # Extract begins (bytes)
my $BROTLI = 0; # brotli encoding (0/1)
my $END = -1; # Extract ends (bytes)
my $GZIP = 0; # gzip encoding (0/1)
my $HPOS = -1; # 'HTTP' string begins (bytes)
my $HSTA = -1; # 'HTTP' status string (only interested in '200' or '203')
my $HVER = ''; # 'HTTP' version string (eg '1.1')
my $magic = '';
my $MIME = ""; # content-type
my $MOD = ""; # last-modified
my $OFF = -1; # Offset of magic from file beginning
my $TLS = ""; # TLS==Three Letter Suffix
my $LEN = -1; # content-length
open my $fhi, '<:raw', "$IN/$f" or die $!;
read( $fhi, my $cache_buffer, -s "$IN/$f" );
close( $fhi ) or die "could not close $IN/$f: $!";
if( $cache_buffer =~ /\x{00}\x{00}HTTP\/(\d.\d*)\s(200|203)/i ) {
$HPOS = $-[0] + 2;
$HVER = "$1";
$HSTA = "$2";
$HTTP = "HTTP/$HVER $HSTA";
}
$END = index( $cache_buffer, "$HTTP", $HPOS); # Check for presence of HTTP 200|203 header (paranoia coding)
if( $END > -1 ) { #+(and therefore std header fields for successful access)
if( $cache_buffer =~ /\x00Content-Encoding:\s*br/i ) { $BROTLI = 1; }
if( $cache_buffer =~ /\x00Content-Encoding:\s*gzip/i ) { $GZIP = 1; }
if( $cache_buffer =~ /\x00Content-Length:\s*(\d+)/i ) {
$LEN = $1;
if( !$LEN ) { $LEN = -1; } # yes, some pages have Content-Length:0
}
if( $cache_buffer =~ /\x00Last-Modified:\s*([ A-Za-z0-9,:]+)/i ) {
$MOD = $1; # some web servers ignore case + introduce spaces!
} else {
if( $cache_buffer =~ /\x00Date:\s*([ A-Za-z0-9,:]+)/i ) { # did page did not want to be cached? (Chromium did it anyway!)
$MOD = $1; # (all pages should have a date (or a Date))
}
}
if( $cache_buffer =~ /\x00Content-Type:\s*([a-z-]+\/[a-z0-9.+-]+)/i ) {
$MIME = $1;
} # variable $1 NOT reset on failed match (v stupid)
# easy to mixup mime/media-types & encoding (compression schemes) here
# Content-Type == mime-type refers to the type of file that is being transferred
# Content-Encoding == compression scheme refers to the type of compression used during transfer
# so, a text file (js txt xml, etc) with gzip magic will be a gzipped-textfile (eg file.xml.gz)
# gzip encoding (+ brotli) are only support; deflate no support, compress not even mentioned
# see https://httpd.apache.org/docs/current/mod/mod_deflate.html
# see https://www.iana.org/assignments/media-types/media-types.xhtml
given( $MIME ) {
when ('application/font-woff' ) { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('application/font-woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('application/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/json') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; } # magic for gzip encoding
when ('application/manifest+json'){ $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; } # magic for gzip encoding
when ('application/x-javascript'){ $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'xml'; } # magic for gzip encoding
when ('binary/octet-stream') { $magic = "GIF89a"; $OFF = 0; $TLS = 'gif'; }
when ('font/ttf') { $magic = "\x{00}\x{01}\x{00}\x{00}\x{00}"; $OFF = 0; $TLS = 'ttf'; }
when ('font/woff') { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('font/woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('image/gif') { $magic = 'GIF87a'; $OFF = 0; $TLS = 'gif'; }
# when ('image/gif') { $magic = 'GIF89a'; $OFF = 0; $TLS = 'gif'; }
when ('image/jpeg') { $magic = 'JFIF'; $OFF = 6; $TLS = 'jpg'; }
# when ('image/jpeg') { $magic = 'Exif'; $OFF = 6; $TLS = 'jpeg'; }
# when ('image/jpeg') { $magic = "\x{ff}\x{d8}\x{ff}\x{e0}"; $OFF = 6; $TLS = 'jpg'; }
when ('image/png') { $magic = "\x{89}PNG"; $OFF = 0; $TLS = 'png'; }
when ('image/svg+xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'svg'; } # magic for gzip encoding
when ('image/vnd.microsoft.icon'){ $magic = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('image/webp') { $magic = 'RIFF'; $OFF = 0; $TLS = 'webp'; }
when ('image/x-icon') { $magic = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('text/css') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'css'; } # magic for gzip encoding
when ('text/fragment+html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'htm'; } # magic for gzip encoding
when ('text/html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'html'; } # magic for gzip encoding
when ('text/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('text/plain') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'txt'; } # magic for gzip encoding
when ('text/xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'xml'; } # magic for gzip encoding
when ('video/mp4') { $magic = 'ftypisom'; $OFF = 4; $TLS = 'mp4'; } # most unlikely
default { $magic = ''; $OFF = 0; $TLS = ''; }
}
if( $magic ) {
if( $magic eq 'GIF87a') { # account for gif + jpeg multiple $magic
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'GIF89a';
$BEG = index( $cache_buffer, "$magic" );
}
} elsif( $magic eq 'JFIF') {
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'Exif';
$TLS = 'jpeg';
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = "\x{ff}\x{d8}\x{ff}\x{e0}";
$TLS = 'jpg';
$BEG = index( $cache_buffer, "$magic" );
}
}
}
$BEG = index( $cache_buffer, "$magic" );
}
# fix $BEG + $LEN
if( $BEG > -1 ) {
$BEG -= $OFF;
if( $LEN < 1 ) { $LEN = $END - $BEG - $F_OFF; } # v rare, but happens
} elsif( $LEN > -1 ) { $BEG = $END - $LEN - $F_OFF; } # no magic (text + brotli files)
# suffixes (holy m$)
if( $TLS ) {
$TLS = ".$TLS";
if( $GZIP || $BROTLI ) { # compression-encoding
if( $GZIP ) { $TLS = "$TLS.gz"; } else { $TLS = "$TLS.br"; }
}
}
# print the files out
if( $BEG > -1 && $LEN > -1 ) {
`dd if="$IN/$f" of="$OUT/$f$TLS" skip=$BEG count=$LEN iflag=skip_bytes,count_bytes status=none`;
if( $MOD ) { `touch "$OUT/$f$TLS" -d "$MOD"`; }
# print "$MIME: $f; \$TLS=$TLS; \$BEG=$BEG; \$LEN=$LEN; \$END=$END; \$MOD=$MOD; \n";
} # lots of Content-Length:0 files
} # if( $END > -1 ) # other pages mostly HTTP 204 No Content
}getCC, the ChromeCache decrypt script:
Added the ability to decode any HTTP version + status 200 or 203 files.
Testing results:
$ ~/Personal/.getCC
image/webp: 000420fedcafe6ff_0; $TLS=.webp; $HPOS=5185; $END=5185; $HVER=1.1; $HSTA=200; $HTTP=HTTP/1.1 200; $MOD=Fri, 03 Mar 2023 20:27:56 GMT;
$ cd ~/Personal/ChromeCache/Files
$ time ~/Personal/.getCC
real 1m28.431s
user 0m53.738s
sys 0m34.723sI had noticed that all HTTP/1.1 server responses were preceded by two null bytes in the cache files:
00001430 fc 9b 54 2f 00 a4 ec 1b fc 9b 54 2f 00 57 02 00 |..T/......T/.W..|
00001440 00 48 54 54 50 2f 31 2e 31 20 32 30 30 00 61 63 |.HTTP/1.1 200.ac|
00002800 02 72 ab 33 c9 16 55 2f 00 8d 9c 35 c9 16 55 2f |.r.3..U/...5..U/|
00002810 00 56 02 00 00 48 54 54 50 2f 31 2e 31 20 32 30 |.V...HTTP/1.1 20|
00003740 ea 4c 55 2f 00 48 7c d5 ea 4c 55 2f 00 55 02 00 |.LU/.H|..LU/.U..|
00003750 00 48 54 54 50 2f 31 2e 31 20 32 30 30 00 61 63 |.HTTP/1.1 200.ac|I used that fact to guarantee that the HTTP string that was being indexed for was the correct one + updated $HTTP to contain the correct strings.
Here is the latest code:
#!/usr/bin/perl
# get Chrome Cache
# suggestion: save as ~/.getCC; chmod +x; chmod 700
# A PERL script to iterate through Chromium/Chrome 'Cache_Data/' dir
#+ & extract all http-delivered files stored within those data-files
# 2023-03-12: Account for multiple http version + 200|203 status
# 2023-03-08: bugfix: COUNT removed; LEN used instead
# + (F_OFF used for BEG, not COUNT)
# + brotli now works
# + (no magic for brotli (a mistake imo))
# 2023-03-07: bugfix: corrected miss on most magic files (my bad)
# + excluded compound header fields to eliminate wrong values
# added $F_OFF (diff between HTTP-begin ($END - $LEN) & magic-begin ($BEG))
# + (*every* file with both $BEG & $LEN has diff == x34) (h-begin is bigger)
# + thus if no magic but LEN then BEG = END - LEN - 52
# + if magic but no LEN then LEN = END - BEG - 52 (yes, this *does* happen)
# 2023-03-05: bugfix: coded to exclude 711 zero-length files
# + account for multiple-same-value $mime (fixes ~1000 gif + jpg files)
# + added 'Content-Encoding:br' Brotli compression
# + (you may need 'sudo apt install brotli' to view those files)
use strict;
use warnings;
use autodie;
use experimental qw( switch );
# save algorithm:
# 1) only save HTTP 200 files ($END)
# 2) try first to set file beginning ($BEG) from magic bytes
# 3) if (2) fails, set $BEG from $LEN; if no length, then ignore file
# 4) extract section $BEG to $END from $IN file into $OUT dir
# 5) touch file to conform with http header date
# Stats 2023-03-06:
# 10978 HTTP 200 from 23594 files in Cache_Data
# 6 do NOT contain a MIME field
# 10979 files saved to disk (real 1m23.219s)
# Global CONSTANTS
my $IN = "/home/alexk/.cache/chromium/Default/Cache/Cache_Data/"; # Chromium cache folder.
my $OUT = "/home/alexk/Personal/ChromeCache/Files/"; # Place for extracted files
my $HTTP = "HTTP/1.1 200"; # '200 OK' not in all files
my $F_OFF= 52; # Offset of HTTP-begin from magic-begin (BEG) + LEN
opendir( my $d, "$IN") or die "Cannot open directory $IN: $!\n"; # Open cache dir
my @list
= grep {
!/^\.\.?$/ # miss /. + /.. files
&& -f "$IN/$_" # is a file (not dir, etc)
} readdir( $d );
closedir( $d );
foreach my $f (@list) { # Iterate through each cached data-file
# my $f = "000420fedcafe6ff_0";
# section variables
my $BEG = -1; # Extract begins (bytes)
my $BROTLI = 0; # brotli encoding (0/1)
my $END = -1; # Extract ends (bytes)
my $GZIP = 0; # gzip encoding (0/1)
my $HPOS = -1; # 'HTTP' string begins (bytes)
my $HSTA = -1; # 'HTTP' status string (only interested in '200' or '203')
my $HVER = ''; # 'HTTP' version string (eg '1.1')
my $magic = '';
my $MIME = ""; # content-type
my $MOD = ""; # last-modified
my $OFF = -1; # Offset of magic from file beginning
my $TLS = ""; # TLS==Three Letter Suffix
my $LEN = -1; # content-length
open my $fhi, '<:raw', "$IN/$f" or die $!;
read( $fhi, my $cache_buffer, -s "$IN/$f" );
close( $fhi ) or die "could not close $IN/$f: $!";
if( $cache_buffer =~ /\x{00}\x{00}HTTP\/(\d.\d*)\s(200|203)/i ) {
$HPOS = $-[0] + 2;
$HVER = "$1";
$HSTA = "$2";
$HTTP = "HTTP/$HVER $HSTA";
}
$END = index( $cache_buffer, "$HTTP", $HPOS); # Check for presence of HTTP 200|203 header (paranoia coding)
if( $END > -1 ) { #+(and therefore std header fields for successful access)
if( $cache_buffer =~ /\x00Content-Encoding:\s*br/i ) { $BROTLI = 1; }
if( $cache_buffer =~ /\x00Content-Encoding:\s*gzip/i ) { $GZIP = 1; }
if( $cache_buffer =~ /\x00Content-Length:\s*(\d+)/i ) {
$LEN = $1;
if( !$LEN ) { $LEN = -1; } # yes, some pages have Content-Length:0
}
if( $cache_buffer =~ /\x00Last-Modified:\s*([ A-Za-z0-9,:]+)/i ) {
$MOD = $1; # some web servers ignore case + introduce spaces!
} else {
if( $cache_buffer =~ /\x00Date:\s*([ A-Za-z0-9,:]+)/i ) { # did page did not want to be cached? (Chromium did it anyway!)
$MOD = $1; # (all pages should have a date (or a Date))
}
}
if( $cache_buffer =~ /\x00Content-Type:\s*([a-z-]+\/[a-z0-9.+-]+)/i ) {
$MIME = $1;
} # variable $1 NOT reset on failed match (v stupid)
given( $MIME ) {
when ('application/font-woff' ) { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('application/font-woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('application/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/json') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; }
when ('application/x-javascript'){ $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('application/xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('binary/octet-stream') { $magic = "GIF89a"; $OFF = 0; $TLS = 'gif'; }
when ('font/ttf') { $magic = "\x{00}\x{01}\x{00}\x{00}\x{00}"; $OFF = 0; $TLS = 'ttf'; }
when ('font/woff') { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('font/woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('image/gif') { $magic = 'GIF87a'; $OFF = 0; $TLS = 'gif'; }
# when ('image/gif') { $magic = 'GIF89a'; $OFF = 0; $TLS = 'gif'; }
when ('image/jpeg') { $magic = 'JFIF'; $OFF = 6; $TLS = 'jpg'; }
# when ('image/jpeg') { $magic = 'Exif'; $OFF = 6; $TLS = 'jpeg'; }
# when ('image/jpeg') { $magic = "\x{ff}\x{d8}\x{ff}\x{e0}"; $OFF = 6; $TLS = 'jpg'; }
when ('image/png') { $magic = "\x{89}PNG"; $OFF = 0; $TLS = 'png'; }
when ('image/svg+xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'svg'; }
when ('image/vnd.microsoft.icon'){ $magic = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('image/webp') { $magic = 'RIFF'; $OFF = 0; $TLS = 'webp'; }
when ('text/css') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'css'; }
when ('text/fragment+html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'htm'; }
when ('text/html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'html'; }
when ('text/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('text/plain') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'txt'; }
when ('video/mp4') { $magic = 'ftypisom'; $OFF = 4; $TLS = 'mp4'; } # most unlikely
default { $magic = ''; $OFF = 0; $TLS = ''; }
}
if( $magic ) {
if( $magic eq 'GIF87a') { # account for gif + jpeg multiple $magic
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'GIF89a';
$BEG = index( $cache_buffer, "$magic" );
}
} elsif( $magic eq 'JFIF') {
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'Exif';
$TLS = 'jpeg';
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = "\x{ff}\x{d8}\x{ff}\x{e0}";
$TLS = 'jpg';
$BEG = index( $cache_buffer, "$magic" );
}
}
}
$BEG = index( $cache_buffer, "$magic" );
}
# # trying to decode where each file begins (determine common offsets)
# if( $LEN < 1 && $BEG > -1 ) { }
# if( $BEG > -1 && $LEN > -1 ) {
# # at this point $BEG - $OFF == start of magic
# # $END == start of $HTTP
# # $LEN == length of content from header
# my $mbeg = $BEG - $OFF; my $mhex = sprintf("0x%X", $mbeg);
# my $hbeg = $END - $LEN; my $hhex = sprintf("0x%X", $hbeg);
# my $diff = $hbeg - $mbeg;
# my $dhex = sprintf("0x%X", $diff);
# print "$MIME: $f; \$END/\$LEN=$END / $LEN; \$mbeg=$mbeg / $mhex; \$hbeg=$hbeg / $hhex; \$diff=$diff / $dhex; \n";
# }
if( $BEG > -1 ) {
$BEG -= $OFF;
if( $LEN < 1 ) { $LEN = $END - $BEG - $F_OFF; } # v rare, but happens
} elsif( $LEN > -1 ) { $BEG = $END - $LEN - $F_OFF; } # no magic (text, xml + brotli files)
# suffixes (holy m$)
if( $TLS ) {
$TLS = ".$TLS";
if( $GZIP || $BROTLI ) { # account for different compression-encodings
if( $GZIP ) { $TLS = "$TLS.gz"; } else { $TLS = "$TLS.br"; }
}
}
# print the files out
if( $BEG > -1 && $LEN > -1 ) {
`dd if="$IN/$f" of="$OUT/$f$TLS" skip=$BEG count=$LEN iflag=skip_bytes,count_bytes status=none`;
if( $MOD ) { `touch "$OUT/$f$TLS" -d "$MOD"`; }
# print "$MIME: $f; \$TLS=$TLS; \$HPOS=$HPOS; \$END=$END; \$HVER=$HVER; \$HSTA=$HSTA; \$HTTP=$HTTP; \$MOD=$MOD; \n";
}
} # if( $END > -1 ) # other pages are most likely to be HTTP 204 No Content
}getCC, the ChromeCache decrypt script:
I needed to know whether it needed to activate on other HTTP Status Codes than just 200, so did some calculations:
$ la ~/.cache/chromium/Default/Cache/Cache_Data/* | wc -l
22525
$ strings ~/.cache/chromium/Default/Cache/Cache_Data/* | fgrep "HTTP/1.1" | sort | uniq -c
strings: Warning: '/home/alexk/.cache/chromium/Default/Cache/Cache_Data/index-dir' is a directory
14055 HTTP/1.1 200
1 HTTP/1.1 200 200
564 HTTP/1.1 200 OK
7490 HTTP/1.1 204
45 HTTP/1.1 204 No Content
5 HTTP/1.1 206
42 HTTP/1.1 301
15 HTTP/1.1 301 Moved Permanently
236 HTTP/1.1 302
1 HTTP/1.1 302 Found
1 HTTP/1.1 303 See Other
1 HTTP/1.1 307
2 HTTP/1.1 400
2 HTTP/1.1 403
84 HTTP/1.1 404
5 HTTP/1.1 404 Not Found
1 HTTP/1.1 410
11 HTTP/1.1 500
Sums:
65% 14,620 HTTP 200 OK
33% 7,535 HTTP 204 No Content
0% 5 HTTP 206 Partial Content
0% 57 HTTP 301 Moved Permanently
1% 237 HTTP 302 Found
0% 1 HTTP 303 See Other
0% 1 HTTP 307 Temporary Redirect
0% 2 HTTP 400 Bad Request
0% 2 HTTP 403 Forbidden
0% 89 HTTP 404 Not Found
0% 1 HTTP 410 Gone
0% 11 HTTP 500 Internal Server ErrorAh well, that's ok then. The script can stick with Status 200, no problem. There is a small chance that 203 Non-Authoritative Information may be involved (responses from a proxy, although never features in my accesses), but I'm happy to consider the chance of that being remote.
All of the 22 thousand files in the current cache were from servers reporting themselves to be version 1.1. HTTP/0.9 & HTTP/1.0 are now considered obsolete (I bet that some still exist). Both HTTP/2 & HTTP/3 are now supposed to be a thing, although no server reported either version in my accesses. However, I obviously need to modify the PERL regex to accept such possibilities, and that will come with the next post.
Explanation + info on setting up getCC, the ChromeCache decrypt script:
Install PERL if necessary
(makes use of switch which was installed by default in version 5.10, but also available from CPAN)
Place the script where you will
Make executable
(chmod +x; chmod 700)
Set the values of $IN & $OUT
(lines 41 + 42; be careful to check permissions, particularly for $OUT)
Run the command from a command-prompt
(there are often 10s of thousands of files decrypted, so there is zero terminal output if no errors)
Install brotli
(sudo apt install brotli)
(this is to facilitate viewing text files)
(I run Chimaera & it is available as standard)
All lines beginning with a # are comments.
Lines 137 - 148 are all commented. It was exploratory code to determine if there was a common offset to the beginning of the cached file. There *was* indeed such an offset ($diff). This was important as not all files contained magic, and the start-of-file varied in ways that I could not decrypt.
The Chrome CacheData dir contains data-files which each contain the data + http-header from a single HTTP file delivered from a server during a Chrome/Chromium browser session.
HTTP files consist of a HTTP header + data.
The CacheData files have the file-data near the top of the file, then the HTTP header & then a bunch of other stuff. Here is a *very* small gif-file to make the point (look for 'GIF89a', the gif magic-marker, at ca in the hex-dump below). Notice how the gif is just 43 bytes, yet the cache-file that contains it is 4k bytes:
$ la ~/.cache/chromium/Default/Cache/Cache_Data/fff822c2bb27d828_0
-rw------- 1 alexk alexk 4389 Feb 24 02:31 /home/alexk/.cache/chromium/Default/Cache/Cache_Data/fff822c2bb27d828_0
$ la ~/Personal/ChromeCache/Files/fff822c2bb27d828_0.gif
-rw-r--r-- 1 alexk alexk 43 Feb 24 02:31 /home/alexk/Personal/ChromeCache/Files/fff822c2bb27d828_0.gif
$ hexdump ~/.cache/chromium/Default/Cache/Cache_Data/fff822c2bb27d828_0 -C | head -31
00000000 30 5c 72 a7 1b 6d fb fc 05 00 00 00 b2 00 00 00 |0\r..m..........|
00000010 23 84 68 3b 00 00 00 00 31 2f 30 2f 5f 64 6b 5f |#.h;....1/0/_dk_|
00000020 68 74 74 70 73 3a 2f 2f 61 6d 61 7a 6f 6e 2e 63 |https://amazon.c|
00000030 6f 2e 75 6b 20 68 74 74 70 73 3a 2f 2f 61 6d 61 |o.uk https://ama|
00000040 7a 6f 6e 2e 63 6f 2e 75 6b 20 68 74 74 70 73 3a |zon.co.uk https:|
00000050 2f 2f 61 61 78 2d 65 75 2e 61 6d 61 7a 6f 6e 2e |//aax-eu.amazon.|
00000060 63 6f 2e 75 6b 2f 65 2f 6c 6f 69 2f 69 6d 70 3f |co.uk/e/loi/imp?|
00000070 62 3d 4a 48 4f 6b 41 4c 63 55 4e 66 59 35 4f 61 |b=JHOkALcUNfY5Oa|
00000080 54 5f 5a 31 61 39 4c 32 67 41 41 41 47 47 67 55 |T_Z1a9L2gAAAGGgU|
00000090 4b 4d 77 67 4d 41 41 41 48 32 41 51 42 4f 4c 30 |KMwgMAAAH2AQBOL0|
000000a0 45 67 49 43 41 67 49 43 41 67 49 43 41 67 49 43 |EgICAgICAgICAgIC|
000000b0 42 4f 4c 30 45 67 49 43 41 67 49 43 41 67 49 43 |BOL0EgICAgICAgIC|
000000c0 41 67 49 43 41 2d 55 71 38 45 47 49 46 38 39 61 |AgICA-Uq8EGIF89a|
000000d0 01 00 01 00 f0 00 00 00 00 00 00 00 00 21 f9 04 |.............!..|
000000e0 01 00 00 00 00 2c 00 00 00 00 01 00 01 00 00 02 |.....,..........|
000000f0 02 44 01 00 3b d8 41 0d 97 45 6f fa f4 01 00 00 |.D..;.A..Eo.....|
00000100 00 ab bd 8a cb 2b 00 00 00 00 00 00 00 dc 0f 00 |.....+..........|
00000110 00 03 0d 45 02 86 fc 8d 34 ff 53 2f 00 e7 d9 8e |...E....4.S/....|
00000120 34 ff 53 2f 00 bd 00 00 00 48 54 54 50 2f 31 2e |4.S/.....HTTP/1.|
00000130 31 20 32 30 30 20 4f 4b 00 53 65 72 76 65 72 3a |1 200 OK.Server:|
00000140 20 53 65 72 76 65 72 00 44 61 74 65 3a 20 46 72 | Server.Date: Fr|
00000150 69 2c 20 32 34 20 46 65 62 20 32 30 32 33 20 30 |i, 24 Feb 2023 0|
00000160 32 3a 33 31 3a 30 38 20 47 4d 54 00 43 6f 6e 74 |2:31:08 GMT.Cont|
00000170 65 6e 74 2d 54 79 70 65 3a 20 69 6d 61 67 65 2f |ent-Type: image/|
00000180 67 69 66 00 43 6f 6e 74 65 6e 74 2d 4c 65 6e 67 |gif.Content-Leng|
00000190 74 68 3a 20 34 33 00 78 2d 61 6d 7a 2d 72 69 64 |th: 43.x-amz-rid|
000001a0 3a 20 42 37 35 4d 32 37 57 4e 38 38 32 54 59 4d |: B75M27WN882TYM|
000001b0 45 56 32 4e 46 48 00 56 61 72 79 3a 20 43 6f 6e |EV2NFH.Vary: Con|
000001c0 74 65 6e 74 2d 54 79 70 65 2c 41 63 63 65 70 74 |tent-Type,Accept|
000001d0 2d 45 6e 63 6f 64 69 6e 67 2c 55 73 65 72 2d 41 |-Encoding,User-A|
000001e0 67 65 6e 74 00 00 00 00 00 03 00 00 00 0d 07 00 |gent............|
$ hexdump fff822c2bb27d828_0.gif -C
00000000 47 49 46 38 39 61 01 00 01 00 f0 00 00 00 00 00 |GIF89a..........|
00000010 00 00 00 21 f9 04 01 00 00 00 00 2c 00 00 00 00 |...!.......,....|
00000020 01 00 01 00 00 02 02 44 01 00 3b |.......D..;|
0000002bSo, in the Cache file:
hex CA: filedata begins ('GIF89a')
hex 129: http header begins ('HTTP/1.1 200 OK')
Amongst other things, the HTTP header can give the Type of file, the length of file, delivery Date & Encoding (type of compression).
Every sensible Internet Server compresses most of the files that it delivers, and particularly text-files. atm getCC only detects gzip & brotli compression:-
gzip: shown as 'file.txt.gz'
brotli: shown as 'file.txt.br'
If viewed from a terminal with less file.txt.gz the gzip-file will be auto-decompressed & shown as plain text within the less-screen. That will NOT work the same for Brotli files unless you take the following steps:-
My version of BASH uses ~/.bashrc as a shell-script to initialise it. The following code within ~/.bashrc enables less to auto-decode a wealth of different compressions (though not Brotli) in conjunction with LESSPIPE:-
# make less more friendly for non-text input files, see lesspipe(1)
[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"Take the following steps to add Brotli to all the other auto-decoded compressions:
Install Brotli
Save the script below as "~/.lessfilter"
Make it executable
#!/bin/sh
# ~/.lessfilter
# 2023-03-11 add brotli to all other encodings for less
case "$1" in
*.br)
brotli -dc "$1"
;;
*)
# We don't handle this format.
exit 1
esac
# No further processing by lesspipe necessary
exit 0I'm setting this thread to "SOLVED" now.
WINE has been fixed by removing it, and the script I added in the previous post now works fully to extract all of the files within CacheData. The one thing that is missing is a description of the script + how to setup less to auto-show the compressed Brotli files, so I'll put that in the next post.
I'm simply astonished that so few people (seemingly just one) have produced a Chrome cache viewer.
There *is* another on Github. It was a little heavyweight for me, so I spent a week learning PERL whilst writing a script to extract all the Chrome-cached files into a directory. ~100 lines. Below for your elucidation:
4pm update: +20 lines to fix ~2000 bad files
5pm update: added Brotli compression encoding; still not sure if that works ok
Mar 8 update: Brotli now works; ~150 active lines (+ ~10 debug lines commented out)
#!/usr/bin/perl
# get Chrome Cache
# suggestion: save as ~/.getCC; chmod +x; chmod 700
# A PERL script to iterate through the Chromium/Chrome 'Cache_Data/'
#+extract all http-delivered files stored within those data-files
# 2023-03-08: bugfix: COUNT removed; LEN used instead
# + (F_OFF used for BEG, not COUNT)
# + brotli now works
# + (no magic for brotli (a mistake imo))
# 2023-03-07: bugfix: corrected miss on most magic files (my bad)
# + excluded compound header fields to eliminate wrong values
# added $F_OFF (diff between HTTP-begin ($END - $LEN) & magic-begin ($BEG))
# + (*every* file with both $BEG & $LEN has diff == x34) (h-begin is bigger)
# + thus if no magic but LEN then BEG = END - LEN - 52
# + if magic but no LEN then LEN = END - BEG - 52 (yes, this *does* happen)
# 2023-03-05: bugfix: coded to exclude 711 zero-length files
# + account for multiple-same-value $mime (fixes ~1000 gif + jpg files)
# + added 'Content-Encoding:br' Brotli compression
# + (you may need 'sudo apt install brotli' to view those files)
use strict;
use warnings;
use autodie;
use experimental qw( switch );
# save algorithm:
# 1) only save HTTP 200 files ($END)
# 2) try first to set file beginning ($BEG) from magic bytes
# 3) if (2) fails, set $BEG from $LEN; if no length, then ignore file
# 4) extract section $BEG to $END from $IN file into $OUT dir
# 5) touch file to conform with http header date
# Stats 2023-03-06:
# 10978 HTTP 200 from 23594 files in Cache_Data
# 6 do NOT contain a MIME field
# 10979 files saved to disk (real 1m23.219s)
# Global CONSTANTS
my $IN = "/home/alexk/.cache/chromium/Default/Cache/Cache_Data/"; # Chromium cache folder.
my $OUT = "/home/alexk/Personal/ChromeCache/Files/"; # Place for extracted files
my $HTTP = "HTTP/1.1 200"; # '200 OK' not in all files
my $F_OFF= 52; # Offset of HTTP-begin from magic-begin (BEG) + LEN
opendir( my $d, "$IN") or die "Cannot open directory $IN: $!\n"; # Open cache dir
my @list
= grep {
!/^\.\.?$/ # miss /. + /.. files
&& -f "$IN/$_" # is a file (not dir, etc)
} readdir( $d );
closedir( $d );
foreach my $f (@list) { # Iterate through each cached data-file
# my $f = "0f0ce6df8548452e_0";
# section variables
my $BEG = -1; # Extract begins (bytes)
my $BROTLI = 0; # brotli encoding (0/1)
my $END = -1; # Extract ends (bytes)
my $GZIP = 0; # gzip encoding (0/1)
my $magic = '';
my $MIME = ""; # content-type
my $MOD = ""; # last-modified
my $OFF = -1; # Offset of magic from file beginning
my $TLS = ""; # TLS==Three Letter Suffix
my $LEN = -1; # content-length
open my $fhi, '<:raw', "$IN/$f" or die $!;
read( $fhi, my $cache_buffer, -s "$IN/$f" );
close( $fhi ) or die "could not close $IN/$f: $!";
$END = index( $cache_buffer, "$HTTP"); # Check for presence of HTTP 200 OK header
if( $END > -1 ) { #+(and therefore std header fields)
if( $cache_buffer =~ /\x00Content-Encoding:\s*br/i ) { $BROTLI = 1; }
if( $cache_buffer =~ /\x00Content-Encoding:\s*gzip/i ) { $GZIP = 1; }
if( $cache_buffer =~ /\x00Content-Length:\s*(\d+)/i ) {
$LEN = $1;
if( !$LEN ) { $LEN = -1; } # yes, some pages have Content-Length:0
}
if( $cache_buffer =~ /\x00Last-Modified:\s*([ A-Za-z0-9,:]+)/i ) {
$MOD = $1; # some web servers ignore case + introduce spaces!
} else {
if( $cache_buffer =~ /\x00Date:\s*([ A-Za-z0-9,:]+)/i ) { # did page did not want to be cached? (Chromium did it anyway!)
$MOD = $1; # (all pages should have a date (or a Date))
}
}
if( $cache_buffer =~ /\x00Content-Type:\s*([a-z-]+\/[a-z0-9.+-]+)/i ) {
$MIME = $1;
} # variable $1 NOT reset on failed match (v stupid)
given( $MIME ) {
when ('application/font-woff' ) { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('application/font-woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('application/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; } # magic for gzip encoding
when ('application/json') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'json'; }
when ('application/x-javascript'){ $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('application/xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('binary/octet-stream') { $magic = "GIF89a"; $OFF = 0; $TLS = 'gif'; }
when ('font/ttf') { $magic = "\x{00}\x{01}\x{00}\x{00}\x{00}"; $OFF = 0; $TLS = 'ttf'; }
when ('font/woff') { $magic = 'wOFF'; $OFF = 0; $TLS = 'woff'; }
when ('font/woff2') { $magic = 'wOF2'; $OFF = 0; $TLS = 'woff2'; }
when ('image/gif') { $magic = 'GIF87a'; $OFF = 0; $TLS = 'gif'; }
# when ('image/gif') { $magic = 'GIF89a'; $OFF = 0; $TLS = 'gif'; }
when ('image/jpeg') { $magic = 'JFIF'; $OFF = 6; $TLS = 'jpg'; }
# when ('image/jpeg') { $magic = 'Exif'; $OFF = 6; $TLS = 'jpeg'; }
# when ('image/jpeg') { $magic = "\x{ff}\x{d8}\x{ff}\x{e0}"; $OFF = 6; $TLS = 'jpg'; }
when ('image/png') { $magic = "\x{89}PNG"; $OFF = 0; $TLS = 'png'; }
when ('image/svg+xml') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'svg'; }
when ('image/vnd.microsoft.icon'){ $magic = "\x{00}\x{00}\x{01}\x{00}"; $OFF = 0; $TLS = 'ico'; }
when ('image/webp') { $magic = 'RIFF'; $OFF = 0; $TLS = 'webp'; }
when ('text/css') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'css'; }
when ('text/fragment+html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'htm'; }
when ('text/html') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'html'; }
when ('text/javascript') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'js'; }
when ('text/plain') { $magic = "\x{1f}\x{8b}\x{08}"; $OFF = 0; $TLS = 'txt'; }
when ('video/mp4') { $magic = 'ftypisom'; $OFF = 4; $TLS = 'mp4'; } # most unlikely
default { $magic = ''; $OFF = 0; $TLS = ''; }
}
if( $magic ) {
if( $magic eq 'GIF87a') { # account for gif + jpeg multiple $magic
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'GIF89a';
$BEG = index( $cache_buffer, "$magic" );
}
} elsif( $magic eq 'JFIF') {
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = 'Exif';
$TLS = 'jpeg';
$BEG = index( $cache_buffer, "$magic" );
if( $BEG < 0 ) {
$magic = "\x{ff}\x{d8}\x{ff}\x{e0}";
$TLS = 'jpg';
$BEG = index( $cache_buffer, "$magic" );
}
}
}
$BEG = index( $cache_buffer, "$magic" );
}
# # trying to decode where each file begins (determine common offsets)
# if( $LEN < 1 && $BEG > -1 ) { }
# if( $BEG > -1 && $LEN > -1 ) {
# # at this point $BEG - $OFF == start of magic
# # $END == start of $HTTP
# # $LEN == length of content from header
# my $mbeg = $BEG - $OFF; my $mhex = sprintf("0x%X", $mbeg);
# my $hbeg = $END - $LEN; my $hhex = sprintf("0x%X", $hbeg);
# my $diff = $hbeg - $mbeg;
# my $dhex = sprintf("0x%X", $diff);
# print "$MIME: $f; \$END/\$LEN=$END / $LEN; \$mbeg=$mbeg / $mhex; \$hbeg=$hbeg / $hhex; \$diff=$diff / $dhex; \n";
# }
if( $BEG > -1 ) {
$BEG -= $OFF;
if( $LEN < 1 ) { $LEN = $END - $BEG - $F_OFF; } # v rare, but happens
} elsif( $LEN > -1 ) { $BEG = $END - $LEN - $F_OFF; } # no magic (text, xml + brotli files)
# suffixes (holy m$)
if( $TLS ) {
$TLS = ".$TLS";
if( $GZIP || $BROTLI ) { # account for different compression-encodings
if( $GZIP ) { $TLS = "$TLS.gz"; } else { $TLS = "$TLS.br"; }
}
}
# print the files out
if( $BEG > -1 && $LEN > -1 ) {
`dd if="$IN/$f" of="$OUT/$f$TLS" skip=$BEG count=$LEN iflag=skip_bytes,count_bytes status=none`;
if( $MOD ) { `touch "$OUT/$f$TLS" -d "$MOD"`; }
# print "$MIME: $f; \$TLS=$TLS; \$END=$END; \$BEG=$BEG; \$LEN=$LEN; \$MOD=$MOD; \n";
}
} # if( $END > -1 ) # other pages are most likely to be HTTP 204 No Content
}Mark Hindley in the bug report was able to get to the gates of success in installing wine32 on a vanilla chimaera, and has therefore fingered backports as the reason for the error on my system. That log-file reported a terrifyingly-large number of i386 packages to install as helpers to wine32.
I would like to give public thanks to Mark for his help so far, but I'm going to remove all traces of Wine & the i386 architecture from my system.
BeginnerForever at this StackOverflow page has a PHP script which, after just a couple of tweaks, will extract all JPEG + PNG files from the Chromium/Chrome dir to a dir. Fast & very impressive
There now follows my small update to that script. I've added a section for GIF files (those files get extracted, but do not work as image files):
#!/usr/bin/php
<?php
// getCC (get Chrome Cache)
// suggestion: save as ~/.getCC; chmod +x; chmod 700
$dir = "/home/alexk/.cache/chromium/Default/Cache/Cache_Data/"; // Chromium cache folder.
$ppl = "/home/alexk/Personal/ChromeCache/Files/"; // Place for extracted files
// $END = "HTTP/1.1 200 OK"; // Search in cache-file (works, yet not in some files)
$END = "HTTP/1.1 200"; // Search in cache-file (works, and IS in all files)
$FTL = ""; // Filetype lowercase
$FTU = ""; // Filetype uppercase
$MOFF = 0; // Offset of magic from file beginning
$list = scandir( $dir );
foreach( $list as $filename ) {
if( is_file( $dir.$filename )) {
$content = file_get_contents( $dir.$filename );
if( strstr( $content, 'JFIF')) {
$FTL = "jpg";
$FTU = "JPEG";
$MOFF = 6;
echo( $filename." $FTU \n");
$start = ( strpos( $content, "JFIF", 0 ) - $MOFF );
$end = strpos( $content, $END, 0 );
$content = substr( $content, $start, $end - $MOFF );
$length = strlen( $content );
$wholenm = $ppl.$filename.".$FTL";
file_put_contents( $wholenm, $content );
// echo( "Saving :".$wholenm." \n");
echo( "start : $start \n");
echo( "end : $end \n");
$diff = $end - $start;
echo( "length: $length (s/b $diff)\n");
}
elseif( strstr( $content, "\211PNG")) {
$FTL = "png";
$FTU = "PNG";
$MOFF = 1;
echo( $filename." $FTU \n");
$start = ( strpos( $content, "$FTU", 0 ) - $MOFF );
$end = strpos( $content, $END, 0 );
$content = substr( $content, $start, $end - $MOFF );
$length = strlen( $content );
$wholenm = $ppl.$filename.".$FTL";
file_put_contents( $wholenm, $content );
// echo( "Saving :".$wholenm." \n");
echo( "start : $start \n");
echo( "end : $end \n");
$diff = $end - $start;
echo( "length: $length (s/b $diff)\n");
}
elseif( strstr( $content, "GIF89a")) {
$FTL = "gif";
$FTU = "GIF";
$MOFF = 0;
echo( $filename." $FTU \n");
$start = ( strpos( $content, "GIF89a", 0 ) - $MOFF );
$end = strpos( $content, $END, 0 );
$newc = substr( $content, $start, $end );
$length = strlen( $newc );
$wholenm = $ppl.$filename.".$FTL";
file_put_contents( $wholenm, $newc );
echo( "Saving :".$wholenm." \n");
echo( "start : $start \n");
echo( "end : $end \n");
$diff = $end - $start;
echo( "length: $length (s/b $diff)\n");
}
else {
echo( $filename." UNKNOWN \n");
}
}
}
?>There are a couple of strange occurrences that I cannot explain nor fix, and have added some echo lines to try to debug it. I'm going to rewrite the script in BASH which, hopefully, will be more reliable. If so, I will not need WINE (hooray!).
Line 8 has $END = "HTTP/1.1 200 OK"; & each section has $end = strpos( $content, $END, 0 );. I discovered that some files do not have an "OK" in the cache-file, yet they were found (not by grep) & the image correctly extracted. I cannot explain what is going on there.
The file content is concatenated within the Cache_Data file immediately before the $END string. Somehow, none of the extracted files is the length that they should be. JPEG + PNG files do not seem to mind, but GIF files refuse to play. I put some echo lines into the script to try to debug what on earth is going on.
Here is the very end of the script text output, to try to give some sense of the difficulty:
ffa41e3d8b4e0cf9_0 PNG
start : 150
end : 14212
length: 14211 (s/b 14062)
ffa78518232ea9f2_0 PNG
start : 170
end : 1417
length: 1416 (s/b 1247)
ffad48f3aefb6cd7_0 GIF
Saving :/home/alexk/Personal/ChromeCache/Files/ffad48f3aefb6cd7_0.gif
start : 1089
end : 1183
length: 1183 (s/b 94)
ffba1f5387a04a08_0 JPEG
start : 166
end : 972
length: 966 (s/b 806)
ffbf8448256da635_0 UNKNOWN
ffc1ebd8d62551b6_0 GIF
Saving :/home/alexk/Personal/ChromeCache/Files/ffc1ebd8d62551b6_0.gif
start : 193
end : 288
length: 288 (s/b 95)
ffc2019c23af2000_0 UNKNOWN
ffc239239bc4e4a9_0 JPEG
start : 195
end : 1920
length: 1914 (s/b 1725)
ffc57d9b41cebadd_0 UNKNOWN
ffcbd7258d6a0aea_0 UNKNOWN
ffda4d6b8e2937fd_0 UNKNOWN
ffdac4bf770719a1_0 UNKNOWN
ffde560cb8ad0eaf_0 UNKNOWN
fff42f6de6d58540_0 UNKNOWN
fff530252c03d813_0 UNKNOWN
fff55afc8b58e35f_0 UNKNOWN
fff822c2bb27d828_0 GIF
Saving :/home/alexk/Personal/ChromeCache/Files/fff822c2bb27d828_0.gif
start : 202
end : 297
length: 297 (s/b 95)
index UNKNOWNPHP seems to be unworkable now, so I'm going to switch to BASH.