One hack of a perfect (as in jack of all trades) backup solution for Ubuntu Linux (remote, flexible, instant restore, automated, reliable)

This is a work in progress (and most likely will always be so)!

Here is what I have been working on and looking for to aquip myself with. I wanted to keep working without any hassle on my daily stuff just as I ever had and with changes to come. But at the same time I needed to be sure for a situation where I needed older versions of my files — would it be due to a system or hard disk brake down or a file deleted erroneously or changes need to be undone — to just have them there by no more than one command away. To say in short I wanted a time machine for my files that just works™ — also in at least 5 years time. In the case of recreation of older versions I want to be able to focus on what to restore and not how. And also with previous backups I have had e.g. corrupted archive files or unreadable part files/CDs to many times (one is even to many) or I’ve had issues because of too old of a file format (mostly proprietary formats).

Here is what I’ve been looking for feature-wise generally:

  • no expenses money-wise
  • robust
  • using only small and freely available tools — the more system core utils the better
  • version control
  • snapshot system
  • remote storage
  • private, i.e. secure data transmission over network and reliably encrypted storage
  • suitable for mobility, independent of how I’m connected
  • simple yet flexible usage

for daily backups:

  • automation using cron
  • no need for interaction
  • easy and flexible declaration of files or folders to omit from backup

and for restoring data:

  • just works™ (see above)
  • fast and easy look up of what versions are available at best via a GUI like Timeline with filter options
  • at very best some sort of offline functionality, e.g. caching of most likely (whatever that means) required older versions

(partly) Alternative solutions I have come across on the run

  • Suns’ z file system (zfs): Haven’t had enough time to get it working with Ubuntu Linux (because of license issues not packaged, only working via FUSE so far). Need’s partition setup thus lavish. Not sure about networking/mobility demands, e.g. remote snapshot location nor ease of use.
  • subversion together with svk: Easy and flexible to use and automate, version control per se, distributed and offline operations (svk). Contra: Recovery relays on subversion software, i.e. no cp or mv. Basic idea is to work on a copy: checkout before you start) and have daily automated commits. Should need no interactions since I’m the only one working with my “backup projects”. See this lengthy description.
  • Coda file system: distributed file system with caching. Had not enough time to try out.
  • rsnapshot: Has remote feature (ssh, rsync), automation, rotation. Relies on file systems using hard links within backup folder hierarchy for “non-incremental files” and runs as root only (system wide conf file, ssh configure issue, ssh-key, …). Workaround could be to use a specific group.
  • sshfs: FUSE add on to use remote directories via ssh transparently.
  • croned bash backup script using tar and gzip; daily incremental and monthly save “snapshot” similar to logrotate.
  • grsync: gnome GUI for rsync optimized for (incremental) backups

Update 10/2009: A few weeks ago i stumbled upon Back In Time which has astonishingly many properties of what I expect from a perfect backup solution. It basis on flyback project and TimeVault. There is a — for some people maybe a little lengthy — video on blip.tv that shows how to install and use it and how straight forward the GUI is.

Will Drupal 7 help Semantic Web actually and finally come to live?

If you’ve been working with Drupal 6 (D6) and have taken a little time viewing

  • A. Dries’ talk (mp4 video file) at Drupalcon Boston this year — especially the bit about focusing on data, exports and imports, deliver in different formats, reusing data — quoting him: “so that no single party owns the data”
  • B. strolling through D6 modules already in development using RDF in some way or another, especially exhibit demos

then you really come to think of the next step of the web. Taking into account the goals set for the upcoming release (especially usability, WYSIWYG and media handling) on the one hand and through heaps of modules Drupals already extreamly flexible and extensable I really can believe it will be a “killer” as mentioned by Dries. And thereby premiering to take semantic web ceriously on a large scale.

So I propose on a more abstract level that web 3.0 (or whatever the buzzword might become) will depend on departing from the perspective “data in tables” and move towards “data in graphs”.

The really critical about this, though, could fast become that people will even more difficultly keep track of the truth or the sources of information in that matter. Only think about our media world as it is already where journalists/bloggers (name them as you wish) in large parts more or less do the copy-paste rather than actually investigating. One just hast to become even more aware I guess, which is not all that bad!

On a side note: I’m only disappointed that there was no mentioning of a couple of small but really effectiv and effectfull modules: Teleport, WYMeditor, Live or LiveSearch. But I guess they are just to specific; even though WYMeditor working properly and broadly would really give Drupal an extra push-up wow-effect that is not only shiny blink-blink.

Update: I just finished watching the video (l wrote this articel about half way through) and what do I see? Berners-Lee beeing cited saying: “From the Wourld Wide Web (WWW) to the Giant Global Graph (GGG)”. And what did I say? ;)

Have you ever looked at a vector graphic so watch the time fly by?

Well, here you can do so right now (you need Opera or another browser with SVG-viewer; FF3 doesn’t do it — it’s a Scalable Vector Graphic)

… Damit, why doesn’t wordpress allow .svg? Well, then you will have to follow this link above (I was going to show it in this post).

Chris Pirillo lifts up the web — They call it Gnomepal…

Sit back, watch the video

and read more and more or more.

I’m not quite sure, yet, what I think of it, but putting the “Can it really be realised/implemented?” question aside, his idea really sounds stamin! I’ll definitely will have an eye on it and would love to see those ideas aroused!

By the way, the Drupal module it obviously started with is Activity Stream.

Side note: What do I tag this with? Where is wordpress’s option for the infinity-all-possible-ever-thought-of-tag?

Mashups for last.fm

Here are some more or less useful mashups (-lists) I stumbled upon that seam to be nice:

last.fm’s events function together with fusecal could be very usefull for bands’ websites to give visiters an easy way to be noticed only for concerts around there home town, e.g.

About the fuzz on rootkits and whether or not to detect one

In the last couple of days I have been reading and hearing about Rootkits and the panic that comes with it. Mainly on German forums and sites, although also e.g. on Joanna Rutkowska’s blog (author of the “famous” Bluepill hijacker technique). And it kept me thinking. But first let me summerize what I understand the fuzz is all about.

A Rootkit is some sort of malware. Depending on whom you ask or enlist it is a piece of software running on someone’s computer — preferably with an Internet connection — without the user or even administrator knowing. I understand the definition itself so that this program does not have to be hiding itself in the memory and/or on the hard drive from detection software but it may (regardless of the, to my knowledge and despite Joanna’s work, unanswered question if potentially it can do so at all). Rootkits like any other malware have to be transferred to the target computer in some way or another and are — the hiding once like any other — detectable in this non-executed state (via digital signature for example). Ones primed, i.e. executed, the code becomes a process in the computer’s memory and tries to hide itself with various methods in memory and hard drives (potentially also MBR or even BIOS, but as far as I read/heard non have been reported so far).

Another factor of Rootkits is that they most often start with a small subset of code/features/routines and, ones residing in memory, recruit more and more features via the computer’s net link through a so-called back door. The back door part is why the differentiation from Trojan Horses is blurry. I’d say the Trojan Horse technique is only one of many features of such a Rootkit but that doesn’t make it a Trojan Horse since it’s not all it can do.

One other of the many possible features, and first shown by the before named Bluepill, is to become a hypervisor (think of it as a sandbox for OSs) like Xen (virtual machines like VMware, Qemu/VirtualBox work differently). The fancy bit about bluepill’s method is that, while active, the OS’ kernel is virtualized, i.e. becomes a guest OS from being host OS before; Microsoft Vista kernel here. It’s done by forcing to swap kernel parts to pagefile.sys which than are modified on disc — no Vista kernel protection — and loaded back to memory. Let me point out: On-the-fly, no reboot or BIOS or MBR modification necessary! That means that the malware runs below the OS or, rephrased the other way around, the real OS runs on top of the malware.

From Darkreading:

The new Blue Pill comes with support for so-called “nested” hypervisors (think Blue Pill within a Blue Pill), and uses an architecture similar to that of the open-source Xen 3 virtual machine technology. It comes with “on the fly” loading and unloading features, as well as more features for avoiding detection, such as hibernating and temporarily uninstalling the hypervisor when Blue Pill detects that a tool is about to detect it.

Let me add: This utilizes Pacifica specification of AMD’s newer processors which have virtualisation technology (VT) build-in. It just has been started on AMD processors but there are also implementations for Intel processors with similar techniques.

Having said all that I came to think of how could it still be possible to detect and what are the remarkable bits here. Let me also point out that I am by no means an expert on anti virus, Rootkits, hypervisors or any of that. I just know a some basic, though advanced, computer issues, how they basically work, about TCP/IP stuff and Linux OS basics. And I claim to have common sense :)

Ideas mentioned elsewhere to encounter the issue and comments on them:

Ok, now there is one point that is not technical at all: How do I detect something that can hide (let’s presume so) from a running system if I don’t see it and wouldn’t get alerted by any detection software? Imagine working on your computer and thinking: “Am I infected? Let’s check and boot to this detection LiveCD [see below]… checking… Good, not infected, so reboot to work system… keep on working… Oh, an now? Infected now?… LiveCD check… reboot to work, since still not infected… work a little… Hah, now is the time, I could now be infected… reboot….”.

The rumours are that it would be easy to detect the malware hiding on active systems when the system is dormant, i.e. not booted, e.g from some LiveCD. That’s one point I could believe to be true to some extant since I guess the malware has to have saved itself on system shut down to some place on the hard drive, BIOS (graphic card’s one, too), or some kind of non-volatile memory and, more importantly, cannot defend, i.e. hide, itself actively. But as with any malware detection by signature the signature of such “saved state hiding malware” has to be known which might be hard since it’s easy for malware to change it’s “saved state form” and thereby it’s signature. And also, is it handy and operably in real life to shut down, eg., servers “only” to detect potentially infected systems (again, assuming all the while it’s not possible to detect while the system is active)?

If it’s possible to only have one hypervisor (what I don’t know right now) then wouldn’t it be easy to just check if a hypervisor is present or can be enabled. If not because one is present already but not known about by the system -> suspicious. Matasano‘s virtualized rootkit detector most likely is about even more than that (from Hacker Smackdown, June 28th, 2007):

Ptacek, Lawson, and Ferrie contend that virtualization-based malware is actually easier to detect than a normal (non-virtualized) rootkit because basically by definition it leaves a trail, introducing changes in the system’s CPU clock, for instance. And the malware would have to be bug-free to truly emulate a system, anyway, Ptacek argues. “The problem with virtualized rootkits is… They have to present the illusion they are talking to real hardware and that’s not an easy task,” he says. “In order to do that, you have to write a bug-free program whose job it is to emulate bugs. And we don’t know how to write bug-free programs.”

One very simply (that’s why I liked it!) detection method described in a German forum was to simultaneously do an outside port scan and ask the system “from inside” for open ports. Most likely the malware will show an open port to the outside (it wants to receive data here) but will hide this port to the system running the malware.

Ideas I haven’t read about so far or are not related directly but rather with malware in general but still fairly new:

  • As a basic approach (operating) systems have to be transparent (best I know of open source) for experts to know what’s going on inside and users to trust “their” system. This is no new argument I assume.
  • Digital signature (public/private keys) handling in kernel for processes similar to what I believe Vista does but holistically and, again, transparent. The idea is similar to that Debian (and other distributions since) have been using with their repositories and dpkg/apt system for years now but now within the computer itself. SecureApt as it’s called uses MD5 checksums (switch to SHA-1 when MD5 is broken) to uniquely and securely identify software packages retrieved from Internet repositories and to verify data (read byte stream) is unchanged on the way from the maintainer to the user’s computer. On top of that SecureApt uses OpenPGP (with GPG) private keys to sign repositories release summaries and public keys to verify the signature, i.e. deciding whether a repo is trusted or not. Why not taking this one step further to the kernel itself and have a module in the kernel implementing the idea of SecureApt but for processes (instances of programs from those repositories)? Though, I guess with quantum computers approaching this prevention method most likely will not hold long anyway.
  • Security systems (German) like AppArmor or even better SELinux should be used more widely to protect more systems better from so-called 0-day attacks and the like. And thereby limit distribution of malware. These two methods, of course, do nothing to increase detection on harmed systems. It only prevents from becoming infected.
  • Don’t by VT supported processors if you don’t need to. This, as with all security issues, will not work on a wide range since it’s more convenient to benefit from supposingly up to 95% performance enhancement for so-called paravirtualized guest systems (more precisely domUs). At least if you need to run unmodified OSs like MS Windows. If you can however modify the domUs, eg. Linux, you can have the same performance with eg. Xen. Let me point out that unlike virtualized guest operating system with paravirtualization the domU does know about it being virtualized and can, among others, access hardware directly.
  • Another idea on how to become suspicious of possible infections includes a second system with net link to computers at risk. I’d call it a watch server or pass-through server. Maybe it could just be your firewall of choice. The idea is to watch the traffic from an to computers in your network just like a firewall does but watch for and learn some sort of network traffic signatures or patterns. This way you get a (statistical) profile of typical traffic regarding individual systems independent of applications running, user behaviour or or the like. Just plain network traffic. This, of course, has to be done while one is certain of no infections in the network. If one can guaranty this it could be possible after this learning phase to detect suspicious traffic.

Maybe everything said here is not new at all to others. But one thing I reckon will be true: After all it will always be a game of cat-and-mouse, since the bad guys will try to detect methods like those mentioned here to hide themselves and the good guys will always try to be smarter. The most interesting part I find about self hiding malwares is that malware is turning the tables now (well, not entirely): With conservative viruses it was evolving new techniques unknown to the anti-virus guys. Now it’s (partly) malware becoming virus-detection-detectors.

And one other thing once again became clear to me: The need for researchers to “do bad things”, i.e. to develop, test, execute, issue and whatever else necessary malware of whatever kind to be able to come up with antidote! Unfortunately there are movements on the way in Germany (German, heise, 06.07.2007 14:23) and as I understand in other parts of the would, too, to prohibit this.

Happy hacking ;)

Update 2007/10/11:

In slashdot there has been a note on VM-Based Rootkits Proved Easily Detectable pointing out an article from researchers from Stanford, CMU, VMware, and XenSource “Compatibility Is Not Transparency: VMM Detection Myths and Realities” (pdf). Unfortunatelly, untill now I haven’t had the time to read it.

Resources:

Ever thought about a system to keep your data sychronized on the move?

Over the years I often have thought how great a situation it would be if you’d come home from a conference/trip with your laptop, tuck it into the charger. Within seconds you’d be able to work on the more comfy desktop environment — meaning larger screen, nicer HIDs — and work on your automatically synchronized data without any intervention. And, also, your wife (or any “co-user” of parts of your data) could read the photos you saved on your laptop right away…

Until now I’ve never actually came across a setup that would work anything near this scenario, without an installation process that would take weeks that it. I came across thinks like coda, subversion, rsync, unionfs and heaps more. Recently I came by an article on slashdot where someone posted something very similar. Further down the suggestions by others there was a hint pointing to a new approach called dropbox. They say

imagine the best aspects of rsync, trac, subversion, but easy to use

A screencast is available, too.

How music similarity measures could be used

Lately I’ve been thinking about the possibilities of songbird (see screencast below), collecting music distributed under common licence or the like from blogs. Wouldn’t it be great to precisely search for music that way? You could say something like “search for music that sounds like what I’m listening to right now”. That means, the method I might be working on should be efficient enough to calculate relevant vectors for different pieces of music fast. Otherwise there would have to be some central service doing some fingerprinting again for all tracks around. But, as said before, that’s not what I’m after. Rather something to get a mutual measure between two tracks, pairwise.

As you can see in the screencast songbird can already, by using diverse search services, find related music. But, as far as I can judge right now, that’s user- or expert-based. That means, the music has to be known well to a certain point to have the data what a newly found song is like. And also it’s text based.

Google alternatives

Recently I was listening to a podcast by the German radio station Bayern 5. One topic, amongst others, was google alternatives. I’d just like to list them here with a short personal description and example searches each (searching for Edgar Allan Poe).

  1. hakia.com — a semantic search engine, where one can ask complete questions or frases and hakia highlights what it thinks is the answer given per search result. You get an table of content and summary kind of view when you, for example, search for a person.
  2. seakport.com — “Founded in December 2003 the company is headquartered in Munich but truly European” (from their company site). Seams fast and has the option of preview-loading the search results.
  3. exalead.com — a Frensh one. Slower, has screen shots as previews (doesn’t actually load the sites) and seams to use ajax, let’s you narrow your search or give more general info about your search, features a categories|keywords kind of clowd. I cannot say anything about the search quality, but UI quality is very promissing.
  4. pagebull.com — shows resulting pages as images with search term highlighted
  5. searchcristal.com — actually a meta search engine, showing it’s result as a 2D cluster with distances. Interesting, but to slow for daily use (I couldn’t figure out how to get a example search link).
  6. cartoo.com — Unfortunately it wouldn’t load in my opera browser via cartoo.com, and only quite slow at first in FF. With searchportal.information.com it does load in opera, too. The strategy seams to be narrowing the search showing the top results.
  7. chacha.com — shows “related searches” to narrow and as a plus give the opportunity to call upon a human guide to help narrow the search. It integrates blinkx.com‘s video wall.
  8. suchfibel.de — List and heaps of info on search engines (German only).

First Impression on Musicovery

Even though I’m into last.fm, if any (of those), I do understand if some Panora maniac cheets on them the alternative has to be worth a glimpse. So, it was Musicovery.com that also impressed me … at first. And I’ll admit it was mostly because of the blinky-blinky. But it’s more to it than just effects attention. From a HMI conceptional point of view Musicovery have really made an effort. It is easy to start listening to what you want — without any or, if you really want to, very little reading. In one word: I’d call it intuitive.

You are presented those, and only those, selections you need to do and combine (what you can) to make your choice distinct enough to gather correct songs. Also the other direction of “communication”, machine to human, has some promissing approaches like the “neighbourhood map” and colours for genres. One can even drag (move) that map around. The playlist is shown as path through the graph of audio tracks.

But then, of course, the hacker in me came to surface and I had to test that stuff. After a few clicks I was presented with Shakira’s “Objection” after hitting “dark” mood. Sure, no accounting for taste, but I wouldn’t call “Objection” a dark mood song. And also there was Black Eyed Peas’ “Shut up” to come… I don’t know about you; I couldn’t keep my feed still while listening and there where absolutely no “I hate the world” and “Where is my gun to get a rampage going” (just being sarcastic here). While the “energetic” direction has worked fine for a while dark more and more seams to be a bad label.

To conclude Musicovery.com nevertheless sounds very promising. I’d really like to know the “music selection techniques” behind it, though, since the more I listen to the tracks that are picked for a selected mood don’t satisfy me just like the other lot.

Edit: I just caught myself letting imaginary drift away: Wouldn’t it be possible to have, in a few years time, some HMI stuff so one brachiates though a play list just like the one displayed at Musicovery but as some sort of hologram or only imaginary (not directly visible) but more like that Wii stuff? So if one wants to ffw to a track on the playlist (displayed in some sort of 3D neighbourhood map/grid as a ball, e.g.) you grab it and drag it to the middle of the cube or punch it to play it, pet it to let information been displayed about it, …

« Older entries

Follow

Get every new post delivered to your Inbox.