Index of /~tskirvin/software/news-archive
NAME
News::Archive - archive news articles for later use
SYNOPSIS
use News::Archive;
my $archive = new News::Archive
( 'basedir' => '/home/tskirvin/kiboze' );
# Get a news article
my $article = News::Article->new(\*STDIN);
my $msgid = article->header('message-id');
die "Already processed '$msgid'\n"
if ($archive->article( $messageid ));
# Get the list of groups we're supposed to be saving the article into
my @groups = split('\s*,\s*', $article->header('newsgroups') );
map { s/\s+//g } @groups;
# Make sure we're subscribed to these groups
foreach (@groups) { $archive->subscribe($_) }
# Actually save the article.
my $ret = $archive->save_article(
[ @{$article->rawheaders}, '', @{$article->body} ], @groups );
$ret ? print "Accepted article $messageid\n"
: print "Couldn't save article $messageid\n";
See below for more options.
DESCRIPTION
News::Archive is a package for storing news articles in an accessible
form. Articles are stored one-per-file, and are accessible by either
message-ID or overview information. The files are then accessible with a
Net::NNTP compatible interface, for easy access by other packages.
News::Archive keeps several files to keep track of its archives:
active file
Keeps track of all newsgroups we are "subscribed" to and all of the
information that changes regularly - the number of articles we have
archived, the current first and last article numbers, etc.
Watched over with News::Active.
history database
A simple database keeping track of articles by Message-ID. Makes
access by ID easy, and ensures that we don't save the same article
twice. The database chosen to maintain these is user-determined.
newsgroup file
Keeps track of more static information about the newsgroups we are
subscribed to - descriptions, creation dates, etc.
Watched over with News::GroupInfo.
archive directory
Directory structure of all articles, with each article saved as a
single textfile within a directory structure laid out at one section
of the group name per directory, such as "rec/games/mecha".
Crossposts are hardlinked to other directory structures.
Articles are actually divided into sub-directories containing up to
500 articles, to avoid Unix directory size performance limitations.
Individual files are thus stored in a file such as
"rec/games/mecha/1.500/1".
Each newsgroup also contains overview information, watched over with
News::Overview. This overview file goes in the top of the structure,
such as "rec/games/mecha/.overview".
You may note that these files are very similar to how INN does its work.
This is intentional - this package is meant to act in many ways like a
lighter-weight INN.
USAGE
Global Variables
The following variables are set within News::Archive, and are global
throughout all invocations.
$News::Active::DEBUG
Default value for "debug()" in new objects.
$News::Active::HOSTNAME
Default value for "hostname()" in new objects. Obtained using
"Sys::Hostname::hostname()".
$News::Active::HASH
The number of articles to keep in each directory. Default is 500;
change this at your own peril, since things may get screwed up later
if you change it after archiving any articles!
Basic Functions
These functions create and deal with the object itself.
new ( HASHREF )
Creates the News::Archive object. "HASHREF" contains initialization
information for this object; currently supported options:
basedir Base directory for this object to work with.
Required; we will fail without this.
archives Location of the post archives. Defaults to
$basedir/archives
historyfile Location of the history database. Defaults to
$basedir/historyfile
activefile Location of the active file. Defaults to
$basedir/active
overfilename File name for the overview database files in each
newsgroup hierarchy. Defaults to ".overview".
db_type The type of perl database we will use to store
files that need that level of service. Defaults
to 'DB_File'
groupinfofile Location of the groupinfo file. Defaults to
$basedir/newsgroups.
hostname String to use when a local hostname is required.
Defaults to $News::Archive::HOSTNAME.
debug Should we print debugging information? Defaults to
$News::Archive::DEBUG.
readonly Should we open this read-only?
The History, Active, and GroupInfo objects are loaded, and the
archive is locked, with News::Lock.
Returns the blessed object on success, or undef on failure.
reload ( )
Closes and re-opens all of the objects in the archive - History,
Active file, and GroupInfo, at present - and locks the archive with
a read lock. Necessary for News::Lock compatibility.
activefile ()
Returns the News::Active object based on "activefile", set in new().
If this object has not already been opened and created, creates it;
otherwise, just returns the existing object. Passes on the
'readonly' flag.
activeclose ()
Writes out and closes the News::Active object.
groupinfo ()
Returns the News::GroupInfo object based on "groupinfofile", set in
new(). If this object has not already been opened and created,
creates it; otherwise, just returns the existing object. Passes on
the 'readonly' flag.
groupclose ()
Writes out and closes the News::GroupInfo object.
history ()
Returns a tied hashref based on "historyfile", set in new(). If this
object has not already been opened and created, creates it;
otherwise, just returns the existing object.
debug ()
Returns true if we want to print debugging information, false
otherwise. Used a lot internally, may also be used externally.
activeentry ( GROUP )
Returns the News::Active::Entry information for the given "GROUP".
groupentry ( GROUP )
Returns the News::GroupInfo::Entry information for the given
"GROUP".
close ()
Close all open files.
Error Functions
These functions deal with the global error variable, which is currently
not being used very effectively.
error ( [ERROR] )
Returns the text (a scalar) describing the last error message. If
"ERROR" is offered, then it sets the error message to this first.
clear_error ()
Clears the error message.
Net::NNTP Equivalents
The following functions are the equivalent of the Net::NNTP commands;
they are provided for compatibility with News::Web and other news
functions. More information on their use is available in those manual
pages.
article ( [ MSGID|MSGNUM ], [FH] )
Retrives the article indicated by "MSGID" or "MSGNUM" (Net::NNTP) as
the headers, a blank line, and then the body of the article. Either
prints it to "FH" (if offered) or returns an array reference
containing the text.
Returns undef if the article is not found.
head ( [ MSGID|MSGNUM ], [FH] )
As with "article()", but only returns the header of the article.
body ( [ MSGID|MSGNUM ], [FH] )
As with "article()", but only returns the body of the article.
nntpstat ( [ MSGID|MSGNUM ] )
As with "article()", but only returns the article's message-id.
Returns undef if not set or the article didn't exist.
group ( [GROUP] )
Sets the current group pointer; necessary if we want to use
"article()" or its ilk by message number and not message-ID. In
array context, returns the active information of the group as a list
(number of articles, first article number, last article number,
group name). In scalar context, just returns the group name.
ihave ( MSGID, MESSAGE )
Writes an article to the archive with Message-ID "MSGID". "MESSAGE"
is the actual message. Invokes "save_article()".
(Note that this is preferred to "post()", at least here, because it
lets us tell much earlier if we don't want the article.)
last ()
Unimplemented.
date ()
Returns the local time (in seconds since the epoch).
postok ()
Returns 0; we don't want anything to get the idea that it can post.
authinfo ()
Unimplemented.
list ()
Same as "active('*')", listing all active groups.
newgroups ()
Unimplemented.
newnews ()
Unimplemented.
newnews ()
Unimplemented.
post ( MESSAGE )
Writes an article to the archive. "MESSAGE" is the actual message.
Invokes "save_article()".
slave ()
Unimplemented.
quit ()
Close the current connection; clear the current group, and reset the
pointer. Returns 1.
newsgroups ( [PATTERN] )
Returns a hashref where the keys are the newsgroups that match the
pattern "PATTERN" (uses "active()"), and the values are descriptiion
text for the newsgroup.
distributions
Not implemented.
subscriptions ()
Returns a listref to all groups that we are subscribed to. This is
not ideal; we may only want the ones that we have descriptions for,
or a specific flag set in News::GroupInfo, or something. It works
for now, though.
overview_fmt ()
Returns the overview format information from News::Overview, since
that's what we're currently using.
active_times ( [PATTERN] )
Returns a hashref where the keys are the group names, and the values
are the results from "News::GroupInfo::Entry-"arrayref()>.
active ( [PATTERN] )
Returns a hashref where the keys are the group names, and the values
are the results from "News::Active::Entry-"arrayref()>.
xgtitle ( [PATTERN] )
Same as "newsgroups()"
xhdr ( HEADER, SPEC [, PATTERN] )
xover ( MATCH, HDR )
Gets information from the stored overview database. See
News::Overview for more information on how this works.
xpath ( MID )
Returns the full path name on the server of the location of the
given article.
xpat ( HEADER, SPEC [, PATTERN] )
Same as "xhdr()".
xrover ( SPEC )
Same as $self->xhdr('References', SPEC)
listgroup
Unimplemented.
reader ()
Unimplemented.
Archive Functions
The following functions actually deal with the archive itself.
save_article ( LINES [, GROUPS] )
Saves an article into the archive. "LINEREF" is an arrayref that is
passed to News::Article; "GROUPS" is an array of groups that we want
to save the article to, if not those listed in the Newsgroups:
header.
The article is modified by adding "hostname()" onto the Path: header
and creating a new Xref: header to match where we will save the
article. The file is primarily linked to a single location, and
hardlinks are made to the other locations. Overview information is
generated for each group, history information is saved to ensure
that we don't save the same article twice, and directories are
created as needed.
Note that there are currently some race conditions possible with
this function, which should be partially solved be adding file and
directory locking.
article_is_in_archive ( MSGID )
Returns 1 if the article is in the archive, 0 otherwise.
subscribe ( GROUP )
Subscribe to the given "GROUP", by adding information about the
group to the active and groupinfo files and starting the directory
tree.
unsubscribe ( GROUP )
Unsubscribe from "GROUP", by removing information about it from the
active and groupinfo files.
subscribed ( GROUP )
Returns 1 if we are subscribed to "GROUP", 0 otherwise.
overview_add ( NUMBER, GROUP, ARTICLE )
Add information to "GROUP"'s overview information regarding article
"NUMBER", which is "ARTICLE". Just appends the information to the
overview database; we don't need to do anything more at this point.
overview_read ( GROUP, MESSAGE-SPEC [, HDR ] )
Get the overview information from "GROUP" for the articles specified
by "MESSAGE-SPEC" (see Net::NNTP). If "HDR" is offered, only return
that header information. Mostly invokes "xover()".
NOTES
This module has grown out of my original kiboze.pl scripts, which
accomplished essentially the same writing functions but none of the
reading ones. While a write-only interface has been somewhat beneficial,
this should be much more helpful.
TODO
Start using the AutoLoader (or something like it)
Close and re-open the databases periodically, to write stuff out while
in the middle of an operation.
While we currently have basic hashing taking place on the newsgroups to
prevent the directories from getting too large, it would be nice if this
were instead done as a time-hash - that is, if the article was from 28
Apr 2004, we could make directories that looked like 2004.01.01 (yearly
hashing), 2004.04.01 (monthly), or 2004.04.28 (daily).
More News::Web changes to better connect with News::Archive would be
nice.
Using a different Overview format may make sense.
Offer some functions to rebuild overview information later.
Offer something to make default ~/.kibozerc files.
REQUIREMENTS
"Net::NNTP::Functions", News::Article, News::Overview, News::Lock,
News::Active, News::GroupInfo, DB_File
SEE ALSO
Modules: News::Active, News::GroupInfo, News::Article, News::Web,
newslib, newsrecurse.pl
Scripts: kiboze.pl, newsarchive.pl, mbox2news.pl
AUTHOR
Tim Skirvin <tskirvin@killfile.org>
HOMEPAGE
http://www.killfile.org/~tskirvin/software/news-archive/
LICENSE
This code may be redistributed under the same terms as Perl itself.
COPYRIGHT
Copyright 2003-2007, Tim Skirvin.