doing a wordl-like word cloud.
I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations.
So I looked around to find a nice open-source implementation of word-clouds ... only to find none. (This has been a while, maybe it has changed since).
While I was bored in the train last week, I came up with this code.
A little today-themed taste:
The first step is to get some document. I used the constitution of the united states for the above.
with open("constitution.txt") as f: lines f.readlines() text = "".join(lines)
The next step is to extract words and give the words some weighting - for example how often they occur in the document. I used scikit-learn's CountVectorizer for that as it is convenient and fast, but you could also use nltk or just some regexp.
I get the counts of the 200 most common non-stopwords and normalize by the maximum count (to be somewhat invariant to document size).
cv = CountVectorizer(min_df=0, charset_error="ignore", stop_words="english", max_features=200) counts = cv.fit_transform([text]).toarray().ravel() words = np.array(cv.get_feature_names()) # normalize counts = counts / float(counts.max())
Now the real work starts. The basic idea is to randomly sample a place on the canvas and draw a word with a size related to its importance (frequency).
We have to take care not to make the words overlap, though.
There seems to be no good alternative to the Python image library (PIL), which is really, really horrible. There are no docstrings. You specify colors using strings. There is a weird module structure. There are no docstrings.
Any way, we can get a canvas and a drawing object like this:
img_grey = Image.new("L", (width, height)) draw = ImageDraw.Draw(img_grey)We can then write in the image using
font = ImageFont.truetype(font_path, font_size) draw.setfont(font) draw.text((y, x), "Text that will appear in white", fill="white")The
font_pathhere is an absolute path to a true type font on your system. I found now way to get around this (didn't look very hard, though).
Ok, now we could draw random positions and see if we could draw there without touching any other words.
There is a handy function in
ImageDraw.textsize, which tells you how large a piece of text will be once rendered. We can use that to test if there is any overlap.
Unfortunately, random sampling any place in the image turns out to be very inefficient: if a lot of the room is already taken, we have to try quite often to find some space.
My next idea was first to find out all possible free places in the image and then sample randomly from those. The easiest way to find free positions is to convolve the current image with a box of size
ImageDraw.textsize(next_word). The places where the result is zero are exactly the places that have enough room for the text.
scipy.ndimage.uniform_filterthat worked quite nicely.
But what do we do if there is not enough room to draw a word in the size we want?
Then we have to make the font smaller and try again. Which means convolving the image again, this time with a somewhat smaller box.
The code wasn't very fast and this seemed pretty wasteful, so I wanted to use another approach: integral images! Integral images are a way to pre-compute a simple 2d structure from which it is possible to extract the sum over arbitrary rectangles in the image in constant time.
The integral image is basically a 2d cumulative sum and can be computed as
integral_image = np.cumsum(np.cumsum(image, axis=0), axis=1). This can be done once, and then we can look up rectangles of any size very fast. If we are interested in windows of size
(w, h)we can find the sum over all possible windows of this size via
area = (integral_image[w:, h:] + integral_image[:w, :h] - integral_image[w:, :h] - integral_image[:w, h:])This is a combination of the integral image query (see wikipedia) and my favorite numpy trick to query all positions simulataneuosly.
So basically this does the same as the convolution above, only it precomputes a structure so that we can query for all possible windows sizes.
After drawing a word, we have to compute the integral image again.
Unfortunately, the fancy indexing with the integral image was a bit sluggish.
On the other hand, that was a great opportunity to try out typed memory views in cython, which I learned about from Stefan Behnel at Pycon DE :)
def query_integral_image(unsigned int[:,:] integral_image, int size_x, int size_y): cdef int x = integral_image.shape cdef int y = integral_image.shape cdef int area, i, j x_pos, y_pos =  for i in xrange(x - size_x): for j in xrange(y - size_y): area = integral_image[i, j] + integral_image[i + size_x, j + size_y] area -= integral_image[i + size_x, j] + integral_image[i, j + size_y] if not area: x_pos.append(i) y_pos.append(j)Awesome! Easy to write down and straight to C-Speed.
Except for the last two lines ... lists are not fast.
I couldn't get that much faster (the array module doesn't have a C API afaik).
I wanted to sample from all possible positions any way, so I just rand the above code twice: once counting how many possible positions there are, then sampling, then going to the position that I sampled.
Using C++ lists would probably be easier but I was to lazy to try...
Anyhow, now I had pretty decent integral images :)
The building still took some time, though... so I lazily recomputed only the part that is changed after I draw a new word.
Check out the full code on github.
It is not very pretty but I think should be quite readable.
Less talk more pictures:
To scale the fonts I used some arbitrary logarithmic dependency on the frequency, that I felt looked decent.
It is also possible just to become smaller if there is no more room.
Oh and of course I allowed flipping of the words :) I also played with using arbitrary colors. I didn't see anything like colormaps in PIL, so I just used the HSL space and just sampled the hue. More elaborate schemes are obviously possible.
Again, I used a slight trick for a bit more speed: I first computed everything in grey-scale, saved all the positions and then re-did it in color.
One more, this time a bit more with the theme of the blog (can you guess what this is?)
And with less saturation:
There is definitely some room for improvement w.r.t. the look of it, but I feel this is already a nice start if you want to play around.
One last comment: I though about improving performance (apparently the only thing on my mind during this little project) by doing the whole thing at a lower resolution and then recreating it at a higher one.
This has two problems: if you use a too small resolution, some text might actually become invisible as it is too small. The other problem is that PIL's font sizes don't scale linearly. So it is not possible to say "I want this font 4 times larger".
You can work around that but it's not pretty.
So I went with the cython / integral image way, which I think is kind of cool :)
If you scrolled down for the code, it is here.
PS: yes, this doesn't generate css / html4. But as you get the text sizes and positions, it should be easy to use this as a backend to generate a html page. PR welcome ;)