In what ways do networks come?
Or more accurately, what can I load in Gephi?
There are a number of ways that you can store network data and Gephi will accept most of those
formats. Here I'll try to give you an overview of the different file types and how they are stored.
This list isn't exhaustive, but is directed towards what Gephi can load.
To show the differences I've created a very simple sample network with only four nodes. The network
looks like this:
Told you it was simple! I'll go through each of these file formats and show what the network would
look like in each. What type of file format you need depends on how much information you want to
store. If you want to store what a graph should visually look like, then you should use a more
complex file type (like GEXF or GraphML). If you want it to be easily edited or used by someone else
then something like an edgelist is best. If you're going to geek out on some graph statistics, then
most people prefer an adjacency matrix. The actual gephi website has a great
table that compares all of the formats
and their ability to encode different information.
For graph formats that are very long you won't be
able to see the details of the format in these small thumbnails, but if you open them in a new tab
you will see the image at full resolution. Also, I'm not going to go over every single format, just
the most important ones that cover a variety of use cases for needing to store different
information. As an example, GraphML and GEXF both store network data in a similar manner but GEXF
can incorporate dynamical data so I show it and ignore GraphML.
One thing I suggest though, is to make sure to save your
graph in some format with the features that you need that is not the Gephi file format. That way if
another program ever comes along that is better to use for you than Gephi, it'll be easy to load
your data into it.
CSV files
A CSV file is a very generic file format and you're probably already seen it elsewhere. If you open
a CSV file in Excel you'll see a spreadsheet, if you open it in a text editor you'll just see rows
of values separated by commas (thus where the name comes from: CSV='comma-separate values'). You can
store as many values as you want in each row but the simplest example of a network stored in a CSV
is one where you have two columns of node names. Each row represents an edge and establishes a connection
from one node to another node.
A CSV file can be useful because you can store a large number of attributes for the edges by just
having additional columns. So you could label an edge as 'friend' or 'enemy' if you had a high
school social network for example.
Edgelists
An edgelist is very simple and is similar to a very basic CSV file. Each edge is a single line in
the file and the two connected nodes are listed. The difference is that a space separates the node
names instead of a comma. Typically this will be the only information contained in edgelist,
although sometimes there is a third field (again separated by a space) that would be the numeric
weight of the edge. Edge lists are very simple (both visually and to make by hand), which leads to
their general availability but inability to add complex additional information (like hierarchy or
where nodes should be placed in an image).
GEXF
A GEXF file is a much more complicated structure. It uses XML (which is a markup format for storing data), which isn't very friendly to a normal human to write but very easy for a computer. Because of this it can store lots of additional information (like how the graph should be visualized), but you wouldn't want to write it by hand.
GML
Graph Markup Language (GML) is one of the earlier attempts to create a file that could save
additional attributes and visual representation. It's a bit verbose but it still could be written
relatively easily by hand. You have a graph object that you declare and the parts of the graph go
inside the brackets. To specify a node or an edge you say that it is a `node` or `edge` and put its
attributes inside the brackets. When you specify the nodes that have an edge you use their integer
id instead of their label (if you already have integer labels it will still create a numeric id fo
r the nodes). GML works fairly well, but I have had a few networks in the GML format be unreadable
when used in different programs in the past so I don't typically recommend using it.
JSON
JSON is not supported by Gephi as an input currently, but I am including it because it is both (i)
my current favorite and (ii) a popular format to use for web visuals (see the
D3 website). JSON can be very compact (and ugly)
or have space put in to make it very readable. Similar to GML, it uses very basic syntax and and has
different entries for either nodes or edges. In the same manner, it also numerically labels nodes in
the order that they are listed and uses those numeric labels when defining edges. Unlike GML though
(and confusingly for people the first time they see it), this labelling is not made explicit! It
doesn't say anywhere that the first node listed has an id of 0, you just need to know it.
In any case, it's a really nice portable format that can be directly used on the web.