Citation graphs, alchemistry
Aug. 21st, 2007 09:51 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Things I'm thinking of coding, that I've not found that anyone else has done.
1. A program that takes a group of articles and produces a citation graph. More specifically, it gives a vertex/node in a graph for each article, with edges coming in for articles that cite that one, and edges going out for articles that that article cites.
It would be in several parts:
* Core: takes article data in a chosen form, gives a graph
* Graphics: takes graph, displays it (possibly in a way that can be interacted with, but that'd be another part)
* Parser (optional): takes article title and bibliography in text form, gives article data in a form the core can cope with
Characterizing and mining citation graph of computer science literature - someone has implemented this as part of their MCSc thesis (Masters of Computer Science, I assume). A few moments more searching found the thesis itself.
There seem to be a fair few papers written about it, which I'm planning to read more of.
However, my interest in it isn't in document clustering and how well-spread-out citations are (though that could tie into one of the themes for my MPhil, so might be worth looking at), but in finding and recording, in a form I can refer to later, what papers have been cited on a topic and what papers I may want to look at later.
What I'd like to make, in essence, will do the following things:
* Keep a graph of articles and their citations
* Allow me to mark, in some way, which articles I have read and/or found a copy of
* Keep enough data about each article that I can cite it properly in a later paper (title, authors, date, publisher, pages, and an abstract)
* Allow me to import article data for articles that have not added to the graph, either by typing in the article data manually or using an online database such as Citeseer
* When new articles are imported, link them to existing articles by citations
NOTE: Automated importing is bloody annoying for citation websites. ArVix.org, for example, has a warning that they do not allow it. This is the general approach to such things - robots can hit websites much harder than human users - and though some sites such as Amazon let you interface with their database, they have limits on how many requests you can make at a time.
With that in mind, the approach I'd take would be to provide the user with a built-in browser with links to various citation sites, and aid them in finding the appropriate page that the program could then suck the data off (the data is already loaded into memory, and the only page loads are from searches the user does manually).
(EDIT: Citeseer does have a way to search its records programmatically, which is quite neat. I'll look into what citation websites there are, and what their policies are, when I start planning this more thoroughly and come to planning the citation-harvester part.)
The example use for this I have in mind is the literature I looked at for my undergraduate dissertation, which was mainly in linguistics journals or books, but occasionally went into anthropology, psychology and neurolinguistics.
I have a limited number of articles I cited for the thesis, and a slightly larger number that I actually read; to catch up with the current state of research, I'd like to find out what's been written since. I'd have to find the articles in question first, but a citation graph would let me track who's been citing what and thus who has been read.
2. A program that lets you play around with Alchemy boards from Puzzle Pirates.
This would consist of several parts:
* Core - Takes a screenshot of the board as input, and splits it into individual hex blocks, inputs (where the colour comes from) and outputs (the bottles).
* User interaction - Takes a board state (hex blocks, their rotations, inputs and outputs), and displays it in a way similar to Puzzle Pirates, letting the user play around with moves.
1. A program that takes a group of articles and produces a citation graph. More specifically, it gives a vertex/node in a graph for each article, with edges coming in for articles that cite that one, and edges going out for articles that that article cites.
It would be in several parts:
* Core: takes article data in a chosen form, gives a graph
* Graphics: takes graph, displays it (possibly in a way that can be interacted with, but that'd be another part)
* Parser (optional): takes article title and bibliography in text form, gives article data in a form the core can cope with
Characterizing and mining citation graph of computer science literature - someone has implemented this as part of their MCSc thesis (Masters of Computer Science, I assume). A few moments more searching found the thesis itself.
There seem to be a fair few papers written about it, which I'm planning to read more of.
However, my interest in it isn't in document clustering and how well-spread-out citations are (though that could tie into one of the themes for my MPhil, so might be worth looking at), but in finding and recording, in a form I can refer to later, what papers have been cited on a topic and what papers I may want to look at later.
What I'd like to make, in essence, will do the following things:
* Keep a graph of articles and their citations
* Allow me to mark, in some way, which articles I have read and/or found a copy of
* Keep enough data about each article that I can cite it properly in a later paper (title, authors, date, publisher, pages, and an abstract)
* Allow me to import article data for articles that have not added to the graph, either by typing in the article data manually or using an online database such as Citeseer
* When new articles are imported, link them to existing articles by citations
NOTE: Automated importing is bloody annoying for citation websites. ArVix.org, for example, has a warning that they do not allow it. This is the general approach to such things - robots can hit websites much harder than human users - and though some sites such as Amazon let you interface with their database, they have limits on how many requests you can make at a time.
With that in mind, the approach I'd take would be to provide the user with a built-in browser with links to various citation sites, and aid them in finding the appropriate page that the program could then suck the data off (the data is already loaded into memory, and the only page loads are from searches the user does manually).
(EDIT: Citeseer does have a way to search its records programmatically, which is quite neat. I'll look into what citation websites there are, and what their policies are, when I start planning this more thoroughly and come to planning the citation-harvester part.)
The example use for this I have in mind is the literature I looked at for my undergraduate dissertation, which was mainly in linguistics journals or books, but occasionally went into anthropology, psychology and neurolinguistics.
I have a limited number of articles I cited for the thesis, and a slightly larger number that I actually read; to catch up with the current state of research, I'd like to find out what's been written since. I'd have to find the articles in question first, but a citation graph would let me track who's been citing what and thus who has been read.
2. A program that lets you play around with Alchemy boards from Puzzle Pirates.
This would consist of several parts:
* Core - Takes a screenshot of the board as input, and splits it into individual hex blocks, inputs (where the colour comes from) and outputs (the bottles).
* User interaction - Takes a board state (hex blocks, their rotations, inputs and outputs), and displays it in a way similar to Puzzle Pirates, letting the user play around with moves.