Web proxy systems, mainly in Java
After my previous entry entitled Structured graphics, diagramming, graphs and networks in Java one month ago, here come this new entry about web proxies, starting my researches from the article Open Source Personal Proxy Servers Written In Java of Manageability.
Generic proxy systems
IBM Research Web Intermediaries (WBI):
Aiming to produce a more powerful and flexible web, we have developed the concept of intermediaries. Intermediaries are computational entities that can be positioned anywhere along an information stream and are programmed to tailor, customize, personalize, or otherwise enhance data as they flow along the stream.
A caching web proxy is a simple example of an HTTP intermediary. Intermediary-based programming is particularly useful for adding functionality to a system when the data producer (e.g., server or database) or the data consumer (e.g., browser) cannot be modified.
Web Intermediaries (WBI, pronounced "webby") is an architecture and framework for creating intermediary applications on the web. WBI is a programmable web proxy and web server. We are now making available the WBI Development Kit for building web intermediary applications within the WBI framework, using Java APIs. Many types of applications can be built with WBI; you can also download some plugins.
One key intermediary application is the transformation of information from one form to another, a process called transcoding. In fact, the WBI Development Kit now provides the same plugin APIs as IBM WebSphere Transcoding Publisher. Applications developed with WBI version 4.5 can be used with the Transcoding Publisher product (with a few exceptions), as WBI constitutes the backbone on which transcoding operations run.
Other examples of intermediary applications include:
* personalizing the web
* password & privacy management
* awareness and interactivity with other web users
* injecting knowledge from "advisors" into a user's web browsing
* filtering the web for kids
WBI has an interesting and entertaining history.
[quite old: the last update of The WBI Development Kit's tech page was made on “March 25,2004”, but the downloadable files are older (2000, June) – alphaWorks License]
WebFountain is another project from IBM Research (UIMA: The Unstructured Information Management Architecture Project) dealing with the problem of search in not-full-structured data:
WebFountain is a set of research technologies that collect, store and analyze massive amounts of unstructured and semi-structured text. It is built on an open, extensible platform that enables the discovery of trends, patterns and relationships from data.Again, papers in the Publications section. IBM alphaWorks owns a Semantics research topic area.
AT&T Mobile Netowrk (AMN - formerly known as iMobile) is a project that addresses the research issues in building mobile service platforms. AMN currently consists of three editions: Standard Edition (SE), Enterprise Edition (EE), and Micro Edition (ME).
AMN SE was built by extending iProxy, a programmable proxy. The proxy maintains user and device profiles, accesses and processes internet resources on behalf of the user, keeps track of user interactions, and performs content transformations according to the device and user profiles. The user accesses internet services through a variety of wireless devices and protocols (cell phones with SMS, WAP phones, PDA's, AOL Instant Messenger, Telnet, Email, Http, etc.)
The main research issues in AMN include
* Authentication: How does the proxy and associated services authenticate the user?
* Profile Management: How does the proxy maintain the user and device profiles? How do the profiles affect the services?
* Trancoding Service: How does the proxy map various formats (HTML, XML, WML, Text, GIF, JPEG, etc.) from one form to the other?
* Personalized Services: How can new services be created by taking advantage of the user access logs and location/mobility information?
* Deployment: How should the proxy be deployed? On the server side, on the network, on the client side, or should we use a mixed approach?
[not active – Apache Software License]
There are somewhat interesting references in the description of the project.
Pluxy is a modular Web proxy which can receive a dynamic set of services. Pluxy provides the infrastructure to download services, to execute them and to make them collaborate. Pluxy comes with a set of basic services like collaborative HTTP request processing, GUI management and distributed services. Three Pluxy applications are introduced: a collaborative filtering service for the Web, an extended caching system and a tool to know about document changes.
[dead: last modifications of the webpages were made from March to April, 1999]
Two papers written by Olivier Dedieu from the INRIA's SOR Action are available, with the same title: Pluxy : un proxy Web dynamiquement extensible (INRIA Research Report RR-3417, may 1998) and Pluxy : un proxy Web dynamiquement extensible (1998 NoTeRe colloquium, 20-23 october 1998).
The extensible Retrieval Annotation Caching Engine (eRACE) is a middleware system designed to support the development and provision of intermediary services on Internet. eRACE is a modular, programmable and distributed proxy infrastructure that collects information from heterogeneous Internet sources and protocols according to end-user requests and eRACE profiles registered within the infrastructure.
Collected information is stored in a software cache for further processing, personalized dissemination to subscribed users, and wide-area dissemination on the wireline or wireless Internet. eRACE supports personalization by enabling the registration, maintenance and management of personal profiles that represent the interests of individual users. Furthermore, the structure of eRACE allows the customization of its service provision according to information-access modes (pull or push), client-proxy communication (wireline or wireless; email, HTTP, WAP), and client-device capabilities (PC, PDA, mobile phone, thin clients).Finally, eRACE supports the ubiquitous provision of services by decoupling information retrieval, storage and filtering from content publishing and distribution.
eRACE can easily incorporate mechanisms for providing subscribed users with differentiated service-levels at the middleware level. This is achieved by the translation of user requests and eRACE profiles into ``eRACE requests'' tagged with QoS information. These requests are scheduled for execution by an eRACE scheduler, which can make scheduling decisions based on the QoS tags.
Performance scalability is an important consideration for eRACE given the expanding numbers of WWW users, the huge increase of information sources available on the Web, and the need to provide robust service. To this end, the performance-critical components of eRACE are designed to support multithreading and distributed operation, so that they can be easily deployed on a cluster of workstations.
The eRACE system consists of protocol-specific proxies, like WebRACE, mailRACE, newsRACE and dbRACE, that gather information from the World-Wide Web, POP3 email-accounts, USENET NNTP-news, and Web-database queries, respectively. At the core of eRACE lies a user-driven, high-performance and distributed crawler, filtering processor and object cache, written entirely in Java. Moreover, the system employs Java-based mobile agents to enhance the distribution of loads and its adaptability to changing network-traffic conditions.
[old: no paper since 2002]
The Papers section contains many publications, from Marios Dikaiakos (Department of Computer Science, University of Cyprus) and Demetris Zeinalipour (now Department of Computer Science & Engineering, University of California, Riverside) among others. For example: Intermediary Infrastructures for the WWW.
The Platform for Information Applications (PIA) is an open source framework for rapidly developing flexible, dynamic, and easy to maintain information browser-based applications. Such applications are created without programming and can be maintained by users and office administrators.
This framework has been used to build a broad range of applications, including a "workflow" web server that handles all of the purchase authorizations, time cards, and other (ex-)paperwork at Ricoh Innovations, Inc.
The PIA does this by separating an application into a core processing engine (a shared software engine, akin to a web server) and a task-specific collection of "active" XML pages, which specify not only the content but also the behavior of the application (XML is the W3C standard, for eXtensible Markup Language). So one document, by itself, can include other documents (or pieces of them), iterate over lists, make decisions, calculate, search/substitute text, and in general do almost anything a traditional "CGI script" or document processing program would do.
Application developers can extend the basic set of HTML and PIA elements ("tags") by defining new ones in terms of the existing ones. As a result, a PIA application can be customized simply by editing a single, familiar-looking XML page... in contrast to conventional Web applications, where even a simple change (like adding an input field) might require finding and fixing Perl CGI scripts or recompiling Java classes in several directories.
[very old: “2.1.6 built Tue Apr 3 11:57:21 PDT 2001”]
The concept of “transcoding services” is implemented on the server side in Apache Cocoon:
Apache Cocoon is a web development framework built around the concepts of separation of concerns and component-based web development.
Cocoon implements these concepts around the notion of 'component pipelines', each component on the pipeline specializing on a particular operation. This makes it possible to use a Lego(tm)-like approach in building web solutions, hooking together components into pipelines without any required programming.
Cocoon is "web glue for your web application development needs". It is a glue that keeps concerns separate and allows parallel evolution of all aspects of a web application, improving development pace and reducing the chance of conflicts.
[active – Apache Software License]
Back to 1995, the paper which founded a big part of the concept: Application-Specific Proxy Servers as HTTP Stream Transducers.
The Internet facilitates the development of networked services at the application level that both offload origin servers and improve the user experience. Web proxies, for example, are commonly deployed to provide services such as Web caching, virus scanning, and request filtering. Lack of standardized mechanisms to trace and to control such intermediaries causes problems with respect to failure detection, data integrity, privacy, and security.
The OPES Working Group has previously developed an architectural framework to authorize, invoke, and trace such application-level services for HTTP. The framework follows a one-party consent model, which requires that each service be authorized explicitly by at least one of the application-layer endpoints. It further requires that OPES services are reversible by mutual agreement of the application endpoints.
* Written entirely in Java. Requires JDK 1.1
Runs on Unix, Windows 95/NT, and Macintosh.
* Freely available under the GNU General Public License.
* Support for HTTP/0.9, HTTP/1.0, HTTP/1.1, and SSL (https).
* Graphical user interface and command-line interface.
* Remote admin interface using HTML forms.
* Includes several filters which can remove cookies, kill GIF animations, remove advertisements,
* View all HTTP headers to aid in CGI development and debugging.
* Users can write their own filters in Java using the provided filter interfaces.
[very old: last file release on SourceForge.net on “April 4, 2000” – GPL]
Old but simple. The thesis slides describing Muffin can be read on the website.
PAW (Pro-Active Webfilter)
AW (pro-active webfilter) is an Open-Source filtering HTTP proxy based on the Brazil Framework provided as a Open-Source Project by SUN. Because the Brazil Framework and PAW are written in Java the software is highly portable.
PAW allows for easy plugin of Handlers (filter outgoing requests) and Filters (filter incoming data - the HTML response) and a GUI for end users. All the configuration files are in XML-Format and thus easy to modify (even without the GUI).
It's aim is to provide an easy to use interface for end users and to be easily extendable by developers. PAW consists of the followig components:
* PAW Server which implements the filtering HTTP Proxy.
* PAW GUI for easy PAW Server administration.
[old: last file release on SourceForge.net on «January 17, 2003» – Apache Software License – uses the Sun Brazil Web Application framework]
Privoxy is a web proxy with advanced filtering capabilities for protecting privacy, modifying web page content, managing cookies, controlling access, and removing ads, banners, pop-ups and other obnoxious Internet junk. Privoxy has a very flexible configuration and can be customized to suit individual needs and tastes. Privoxy has application for both stand-alone systems and multi-user networks.
Privoxy is based on Internet Junkbuster.
[still active but last file release on SourceForge.net on “January 30, 2004” – GPL – coded in C]
The Proxomitron (also there, more information on Proxomitron.info)
For those who have not yet been introduced, meet the Proxomitron: a free, highly flexible, user-configurable, small but very powerful, local HTTP web-filtering proxy.
[old and dead: “There were two separate releases of Proxomitron 4.5, one in May of 2003 and one in June.” – for Windows]
Jon Udell wrote an article entitled SSL Proxying – Opening a window onto secure client/server conversations inspired by the SSL support in Proxomitron. He showed inside it how to code a very simple web proxy with Perl (libwww-perl is a powerful library which can be used to develop web application with Perl).
Amit's Web Proxy Project [coded in Python]: Proxy 2 [dead: “1997”], Proxy 3 [dead: “1998”], Proxy 4 [dead: “2000”], and Proxy 5 [dead: “[2005-04-12] A lot of the HTML-modifying tricks I wanted to implement are easier to implement in GreaseMonkey, so I haven't had much motivation to work on a proxy to do these things. See a list of GreaseMonkey plugins.”].
Amit J. Patel worked on this subject while doing his thesis (more here). He links on his webpage to A list of open-source HTTP proxies written in python, a very complete list on Web proxies in Python.
FilterProxy is a generic http proxy with the capability to modify proxied content on the fly. It has a modular system of filters which can modify web pages. The modular system means that many filters can be applied in succession to a web page, and configuration is easy and flexible. FilterProxy can proxy any data served by the HTTP protocol (i.e. anything off the web), and filter any recognizable mime-type. All configuration is done via web-based forms, or editing a configuration file. It was created to fix some of the annoyances of poor web design by rewriting it. It also can improve the web for you, in both speed (Compress) in quality (Rewrite/XSLT). After ads (and their graphics) are stripped out, and html is compressed, surfing over a modem is much faster. Compare to Muffin (a similar project in java), and WebCleaner (a similar project in python) in purpose and functionality. FilterProxy is written in perl, and is quite fast.
[old: last file release on SourceForge.net on “January 12, 2002” – GPL – coded in Perl]
V6 is to the Web what pipes are in Unix systems: a compositional device to combine document processing. To be easily integrated in the Web architecture, V6 is available as a personal proxy. Relying on a common skeleton architecture and Web related libraries, V6 can be easily configured to support various sets of filters while remaining portable and browser independent. The filters may act on the requests emitted by the browser (or other web client) or on the document returned by a server, or both.
In the current release, the available filters include
* flexible caching
* request redirection
* HTML filtering (based on NoShit)
* global history
* on-the-fly full text indexing
V6 can be used to support many other navigation aids and Web-related tools in a uniform, browser independent way. In addition, V6 can also be used as a traditional http server: this is particularly useful to serve private files without needing access to the site-wide http server, or to interface to local, private applications (mail, ...) through the CGI interface.
Useful for very old references: the position paper for example.
A new way of filtering web content is through greasemonkey:
Greasemonkey is a Firefox extension which lets you to add bits of DHTML ("user scripts") to any web page to change its behavior. In much the same way that user CSS lets you take control of a web page's style, user scripts let you easily control any aspect of a web page's design or interaction.
Downside: Firefox needed.
Web 1.0 experience augmentation
ThemeStream is an online "personal interest" site. It works on a self-publishing model; authors may post articles freely in a wide variety of categories. Unfortunately, its reader-based rating system is not particularly reliable, nor is it customizable. The MeStream Proxy allows users to customize how they view ThemeStream and rate ThemeStream content.
[very old: last file release on SourceForge.net on “July 31, 2000” – GPL – MeStream was developped using the WBI development kit]
Note: ThemeStream is dead.
Identity management (ala RoboForm)
Super Proxy System is the combination of a proxyserver and a mailserver.
A special mailserver is built together with proxyserver, which is necessary in some cases where a confirmation email should be replied when registering the account in a form.
Super Proxy System makes your web surfing easy and secure.
Super Proxy System can be run in a local area network or individually.
[quite old: “Last Updated Jan. 25, 2003”]
* a full-featured Web proxy cache
* designed to run on Unix systems
* free, open-source software
* the result of many contributions by unpaid (and paid) volunteers
* proxying and caching of HTTP, FTP, and other URLs
* proxying for SSL
* cache hierarchies
* ICP, HTCP, CARP, Cache Digests
* transparent caching
* WCCP (Squid v2.3 and above)
* extensive access controls
* HTTP server acceleration
* caching of DNS lookups
[active ;-) – GPL – coded in C]
The reference in the UNIX world (not written in Java). Interesting Related Software webpage on the Squid website.
RabbIT is a web proxy that speeds up web surfing over slow links by doing:
* Compress text pages to gzip streams. This reduces size by up to 75%
* Compress images to 10% jpeg. This reduces size by up to 95%
* Remove advertising
* Remove background images
* Cache filtered pages and images
* Uses keepalive if possible
* Easy and powerful configuration
* Multi threaded solution written in java
* Modular and easily extended
* Complete HTTP/1.1 compliance
RabbIT is a proxy for HTTP, it is HTTP/1.1 compliant (testing being done with Co-Advisors test, http://coad.measurement-factory.com/) and should hopefully support the latest HTTP/x.x in the future. Its main goal is to speed up surfing over slow links by removing unnecessary parts (like background images) while still showing the page mostly like it is. For example, we try not to ruin the page layout completely when we remove unwanted advertising banners. The page may sometimes even look better after filtering as you get rid of pointless animated gif images.
Since filtering the pages is a "heavy" process, RabbIT caches the pages it filters but still tries to respect cache control headers and the old style "pragma: no-cache". RabbIT also accepts request for nonfiltered pages by prepending "noproxy" to the adress (like http://noproxy.www.altavista.com/). Optionally, a link to the unfiltered page can be inserted at the top of each page automatically.
RabbIT is developed and tested under Solaris and Linux. Since the whole package is written in java, the basic proxy should run on any plattform that supports java. Image processing is done by an external program and the recomended program is convert (found in ImageMagick). RabbIT can of course be run without image processing enabled, but then you lose a lot of the time savings it gives.
RabbIT works best if it is run on a computer with a fast link (typically your ISP). Since every large image is compressed before it is sent from the ISP to you, surfing becomes much faster at the price of some decrease in image quality. If some parts of the page are already cached by the proxy, the speedup will often be quite amazing. For 1275 random images only 22% (2974108 bytes out of a total of 13402112) were sent to the client. That is 17 minutes instead of 75 using 28.8 modem.
RabbIT works by modifying the pages you visit so that your browser never sees the advertising images, it only sees one fixed image tag (that image is cached in the browser the first time it is downloaded, so sequential requests for it is made from the browsers cache, giving a nice speedup). For images RabbIT fetches the image and run it through a processor giving a low quality jpeg instead of the animated gif-image. This image is very much smaller and download of it should be quick even over a slow link (modem).
[active: last file release on SourceForge.net on “January 11, 2005” – BSD License]
WWWOFFLE (World Wide Web Offline Explorer)
The wwwoffled program is a simple proxy server with special features for use with dial-up internet links. This means that it is possible to browse web pages and read them without having to remain connected.
[old: “Version 2.8 of WWWOFFLE released on Mon Oct 6 2003” – GPL – coded in C]
HTTP debugger and HTTP/HTML awareness
WebScarab is a framework for analysing applications that communicate using the HTTP and HTTPS protocols. It is written in Java, and is thus portable to many platforms. In its simplest form, WebScarab records the conversations (requests and responses) that it observes, and allows the operator to review them in various ways.
WebScarab is designed to be a tool for anyone who needs to expose the workings of an HTTP(S) based application, whether to allow the developer to debug otherwise difficult problems, or to allow a security specialist to identify vulnerabilities in the way that the application has been designed or implemented.
A framework without any functions is worthless, of course, and so WebScarab provides a number of plugins, mainly aimed at the security functionality for the moment. Those plugins include:
* Fragments - extracts Scripts and HTML comments from HTML pages as they are seen via the proxy, or other plugins
* Proxy - observes traffic between the browser and the web server. The WebScarab proxy is able to observe both HTTP and encrypted HTTPS traffic, by negotiating an SSL connection between WebScarab and the browser instead of simply connecting the browser to the server and allowing an encrypted stream to pass through it. Various proxy plugins have also been developed to allow the operator to control the requests and responses that pass through the proxy.
o Manual intercept - allows the user to modify HTTP and HTTPS requests and responses on the fly, before they reach the server or browser.
o Beanshell - allows for the execution of arbitrarily complex operations on requests and responses. Anything that can be expressed in Java can be executed.
o Reveal hidden fields - sometimes it is easier to modify a hidden field in the page itself, rather than intercepting the request after it has been sent. This plugin simply changes all hidden fields found in HTML pages to text fields, making them visible, and editable.
o Bandwidth simulator - allows the user to emulate a slower network, in order to observe how their website would perform when accessed over, say, a modem.
* Spider - identifies new URLs on the target site, and fetches them on command.
* Manual request - Allows editing and replay of previous requests, or creation of entirely new requests.
* SessionID analysis - collects and analyses a number of cookies (and eventually URL-based parameters too) to visually determine the degree of randomness and unpredictability.
* Scripted - operators can use BeanShell to write a script to create requests and fetch them from the server. The script can then perform some analysis on the responses, with all the power of the WebScarab Request and Response object model to simplify things.
Future development will probably include:
* Parameter fuzzer - performs automated substitution of parameter values that are likely to expose incomplete parameter validation, leading to vulnerabilities like Cross Site Scripting (XSS) and SQL Injection.
* WAS-XML Static Tests - leveraging the OASIS WAS-XML format to provide a mechanism for checking known vulnerabilities.
As a framework, WebScarab is extensible. Each feature above is implemented as a plugin, and can be removed or replaced. New features can be easily implemented as well. The sky is the limit! If you have a great idea for a plugin, please let us know about it on the list.
There is no shiny red button on WebScarab, it is a tool primarily designed to be used by people who can write code themselves, or at least have a pretty good understanding of the HTTP protocol. If that sounds like you, welcome! Download WebScarab, sign up on the subscription page, and enjoy!
[active: last release on “20050222”]
Charles is an HTTP proxy / HTTP monitor / Reverse Proxy that enables a developer to view all of the HTTP traffic between their machine and the Internet. This includes requests, responses and the HTTP headers (which contain the cookies and caching information).
Charles can act as a man-in-the-middle for HTTP/SSL communication, enabling you to debug the content of your HTTPS sessions.
Charles simulates modem speeds by effectively throttling your bandwidth and introducing latency, so that you can experience an entire website as a modem user might (bandwidth simulator).
Charles is especially useful for Macromedia Flash developers as you can view the contents of LoadVariables, LoadMovie and XML loads.
[seems still active: last update on freshmeat.net on “25-Dec-2004”]
Surfboard is a filtering HTTP 1.1 proxy. It features dynamic filter management through an interactive HTML console, IP tunneling, WindowMaker applets, and a suite of filters. See the Features page for details.
Who should use this? Surfboard is a "personal proxy", intended to be used by individuals rather than organizations. It's purpose is not to censor or monitor surfing activity, nor is it intended to implement caching within the proxy. Filters could be written to do these things, but it's not something I'm personally interested in doing, and it's already available in other proxies. My goal with surfboard is to make a proxy that covers new ground and let's you "surf in style" by adding visual feedback, interaction, and network load balancing to make websurfing more enjoyable.
Why another filtering proxy? A long time ago, I wanted a simple way to examine HTTP headers for a project I was working on. All the existing proxies I found were overkill for what I wanted, and were nontrivial to configure. So instead, I spent a lunch break writing a very simple proxy in Java that did everything I needed. Later I modified it to remove certain types of banner ads, but I was unhappy with the code -- it was ugly and difficult to maintain. I imagined that someday I would re-write it and "do it right", making everything dynamic with a browser-enabled console, some WindowMaker applets to visualize HTTP activity and to toggle filters on/off on the fly, etc. The typical second-system effect, in other words :-)
[old: last file release on SourceForge.net on “January 12, 2002” – GPL – mainly in Java, but frontend parts coded in C]
A lightweight Java TCP proxy (from the Axis project).
WebMate is part of the Intelligent Software Agents project headed by Katia Sycara. It is a personal agent for World-Wide Web browsing and searching developed by Liren Chen. It accompanies you when you travel on the internet and provides you what you want.
* Searching enhancement, including parallel search (it can send search request to the current popular search engines and get results from them, reorder them according to how much overlapping among the different search engines), searching keywords refinement using our relevant keywords extraction technology, relevant feedback, etc.
* Browsing assistant, including learning your current interesting, recommending you new URLs according to your profile and selected resources, giving some URL a short name or alias, monitoring bookmarks of Netscape or IE, getting more like the current browsing page, sending the current browsing page to your friends, prefetching the following hyperlinks at the current browsing page, etc.
* Offline browsing, including downloading the following pages from the current page for offline browsing, getting the references of some pages and printing it out. * Filtering HTTP header, including recording http header and all the transactions between your browser and WWW servers, filtering the cookie to protect your privacy, block the animation gif file to speed up your browsing, etc.
* Checking the HTML page to find the errors in it, checking embedded links in to find the dead links for your learning to write HTML pages or maintain your webmate site, etc.
* Dynamically setting up all kinds of resources, including search engines, dictionaries available in the WWW, online translation systems available in the WWW, etc.
* Programming in Java, independent of operating system, runing in multi-thread.
[dead: downlodable file from March, 2000]
There is a paper about Webmate here (other – older – papers there). The developer, Liren Chen wrote other interesting personal agents. He/She (?) works in The Intelligent Software Agents Lab from The Robotics Institude, School of Computer Science of the Carnegie Mellon University, headed by Katia Sycara (a lot of publications).
Knowledge augmentation and retrieval
Scone – “A Java Framework to Build Web Navigation Tools”:
Scone is a Java Framework published under the GNU GPL, which was designed to allow the quick development and evaluation of new Web enhancements for research and educational purposes. Scone is focussed on tools which help to improve the navigation and orientation on the Web.
Scone has a modular architecture and offers several components, which can be used, enhanced and programmed using a plugin concept. Scone plugins can augment Web browsers or servers in many ways. They can:
* generate completely new views of Web documents,
* show extra navigation tools inside an extra window next to the browser,
* offer workgroup tools to support collaborative navigation,
* enrich web pages with new navigational elements,
* help to evaluate such prototypes in controlled experiments etc.
[latest version: “Version 1.1.34 from 13. Nov 2004” Scone uses “IBM's WBI (Web Based Intermediary) as Proxy”]
On the Related Projects page, many interesting tools are mentionned; among them: HTMLStreamTokenizer (“HtmlStreamTokenizer is an HTML parser written in Java. The parser classifies the HTML stream is into three broad token types: tags, comments, and text.”), HTTPClient (“This package provides a complete http client library. It currently implements most of the relevant parts of the HTTP/1.0 and HTTP/1.1 protocols, including the request methods HEAD, GET, POST and PUT, and automatic handling of authorization, redirection requests, and cookies. Furthermore the included Codecs class contains coders and decoders for the base64, quoted-printable, URL-encoding, chunked and the multipart/form-data encodings.” – there are other interesting stuff on the webpage) and WebSPHINX: A Personal, Customizable Web Crawler:
WebSPHINX (Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.
On the WebSPHINX webpage, one can find a list of other web crawlers and some references.
Scone has got an unbelievable architecture, “developed as a research project at the Distributed Systems and Information Systems Group (VSIS) [from the Department of Informatics] of the University of Hamburg”. In the Documentation section, there are many papers and theses (the list of people in the project is on the main page). Many prototypes were also developed; BrowsingIcons being one of the most impressing: “BrowsingIcons is a tool to support revisitation of Web pages. To do this, it dynamically draws dynamic graphs of the paths of users as they surf the Web. Compared to using a plain browser, people can revisit web pages faster when they use these visualizations. A study showed that they also enjoy the visualizations more than Netscape alone.”
The goal of Agent Frank is to be a personal intelligent intermediary and companion to internet infovores during their daily hunter/gatherer excursions. Whew. Okay, so what does that mean? Well, let's take it one buzzword at a time:
Personal - While employing many traditionally server-side technologies, Agent Frank is intended to reside near the user, on the desktop or the laptop.
Intelligent - Agent Frank wants to learn about the user, observe preferences and habits, and become capable of automating many of the tedious tasks infovores face. Eventually, this will come to involve various forms of machine learning and analysis, & etc.
Intermediary - Amongst Agent Frank's facilities are network proxies that can be placed between local clients and remote servers. Using these, Agent Frank can tap into the user's online activities in order to monitor, archive, analyze, and alter information as it flows. For example, using a web proxy, Agent Frank can log sites visited, analyze content, filter out ads or harmful scripting.
Companion - Agent Frank's ultimate purpose is to accompany an infovore and assist along the way.
Agent Frank is, at least initially, a laboratory for hacker/infovores to implement and play with technologies and techniques to fulfill the above goals. At its core, Agent Frank is a patchwork of technologies stitched together into a single environment intended to enable this experimentation. At the edges, Agent Frank is open to plugins and scripting to facilitate quick development and playing with ideas.
Agent Frank wants to be slick & clean one day, but not today. Instead, it is a large and lumbering creature with all the bolts, sockets, and stitches still showing. This is a feature, not a bug.
[old: the last release was on “20030215” – GPL]
Very impressing job done by Leslie Michael Orchard! This platform uses many open source tools:
Implemented in Java, with an intent to stick to 100% pure Java.
Makes use of Jetty for an embedded browser-based user interface
Employs BeanShell to provide a shell prompt interface and scripting facilities
RDF metadata is employed via the Jena toolkit
Web proxy services are provided via the Muffin web proxy
Text indexing and searching enabled by Jakarta Lucene.
Exploring use of HSQL and/or Jisp for data storage.
an open collaborative hypermedia system for the Web[old]
Web 1.0 annotation
mprox is not a 'product' - we dont give a shit about business!
mprox is not 'art' - we dont waste time being at the right parties talking shit about our work!
mprox is simply an experiment.
it is an experiment about how the web could be used for not only passive viewing of information, but active commmunication on top of (and below) this information.
it will also be an experiment how people will develop ways to deal with these possibilities, since there is no censorship, control or administration involved.
by using mprox, a second layer of consciousness is created on every web page you visit, that can be used to communicate, post messages, manipulate the content of the page or transform the web page into an art object. possibilities are unlimited and uncontrollable due to an easily expandable "plugin"-system.
[very old: “v0.3, 2000/03/22”]
The YACY project is a new approach to build a p2p-based Web indexing network.
* Crawl your own pages or start distributed crawling
* Search your own or the global index
* Built-in caching http proxy, but usage of the proxy is not a requisite
* Indexing benefits from the proxy cache; private information is not stored or indexed
* Filter unwanted content like ad- or spyware; share your web-blacklist with other peers
* Extension to DNS: use your peer name as domain name!
* Easy to install! No additional database required!
* No central server!
* GPL'ed, freeware
[active: “The latest YaCy-release is 0.37” (on “20050502”) – GPL]
Very clear architecture, explained on the technology webpage.
This is the first entry on this subject. More to come.