| ⭅ Previous (Best NDS Emulators for Android) |
Programming Language Complexity
How complex is your programming language? In this article we’ll look at a few ways of quantifying complexity, and use it to look at a variety of programming languages in frequent use today.
Why? When building emulators, we ultimately have to capture the complexity of another (hardare) system in software. Having some good way of looking at and comparing complexity is useful.
Similarly, the tools (languages) we use for building these emulators impacts the complexity of the resulting work. If you’ve ever wondered about what it would take to make your own programming language, some of the investigation in this article should help you get an idea.
How to measure complexity
One way of measuring complex is known as Kolmogorov Complexity. This states that the Kolmogorov Complexity (K-Complexity) is the length of the thing, whether a string of text, a graphic, or even some algorith, for example, is just the length of the shortest program that can be used to recreate it.
This is a sort of theoretical bound for complexity. We don’t have any way to generate the shortest program for some description, short of essentially doing a depth first search for all programs starting at length 1.
Kolmogorov Complexity of Programming Languages
However, in cases where we have multiple solutions to the same problem, we have perhaps a better sampling of the space of possible solutions to some problem. With one solution to a problem taking 1000 lines, we might think its complexity is about that of 1000 lines. But if we can also implement the same functionality in the same language in 100 lines, we have a new, lower, upper bound on the complexity of the problem.
Programming languages provide us this sort of opportunity. The top programming languages often have several implementations. While the features between these vary somewhat, we can still use the metric of lines of code as a rough proxy for complexity.
Ideally, we could have several language implementations by the same small group of people, as this would remove variation caused by the brevity of different groups of programmers. Alternatively, if we had a competition to produce the shortest correct implementation, we might better approach the “shortest solution” for the implementation of these programming languages.
That said, the data still gives us something interesting to look at. In a future post, we’ll also look at how the choice of implementation language impacts the length of the solution.
Size of Programming Language Implementations
We’ve taken a variety of programming languages, as well as a variety of implementations for each language. We’ve chosen to select (language, implementation) pairs based on a mix of language popularity, size of implementation, and reputation of the programming language.
Implementations are grouped together. Without further ado:
| Language | Implementation | Implementation Language | Lines of Code |
|---|---|---|---|
| C | tinycc | C | 112806 |
| C | gcc (everything) | C++ | 13375674 |
| C | gcc/gcc/c (frontend) | C++ | 54107 |
| C++ | gcc/gcc/cp (frontend) | C++ | 222130 |
| C | gcc/gcc/config (backend) | C++ | 1207017 |
| C | clang | C++ | 1586720 |
| Python | CPython 3.15 | C+Python | 2007986 |
| Python | CPython 2.7.18 | C+Python | 1128575 |
| Python(JIT) | PyPy | Python | 1823348 |
| Ruby | Ruby | C | 1741273 |
| Lua | Lua 5.5.0 | C | 36496 |
| Lua(JIT) | LuaJit | C | 93920 |
| Javascript(JIT) | JavaScriptCore(apple/webkit) | C++ | 591170 |
| Javascript(JIT) | V8 (google/chromium) | C++ | 3151551 |
A few notes
For all languages, we’ve done a git checkout followed by cloc, and included the ’total lines of code' metric. This may overestimate slightly, by including tests and other metadata which is also code.
GCC is both the “GNU Compiler Collection”, and the “GNU C Compiler”. We were interested in just the C compiler for this investigation, but in this monorepo it is difficult to isolate just the c compiler code. Each language in GCC has “frontend” which handles language parsing and rules, before delegating to a shared middle/backend used by the other languages. So the C compiler implementation ends up being the frontend+backend.
We included TCC, another compiler written with the emphasis on being small and fast, for comparison. We also include clang, a newer C compiler written more heavily in C++ than C.
For Python, we include CPython, which is the official implementation. We have included the latest 2.x version, as well as the newest version overall which happened to be 3.15 at the time of writing. It has often been stated that Python 2.x was a simpler language, and the numbers strongly support this claim (about half as much code).
Lua was included as it has a reputation for being a small language. And indeed, it had the smallest implementation, despite also being implemented in C.
Finally, we include Javascript, as it is both the most popular language today and also deceptively complex.
Also note that while the C implementations are full compilers, and target native code. The other scripting languages (Python / Ruby / Lua) are interpretters, which eliminate the need for much of the backend. And PyPy / Luajit / JS contain just-in-time compilers, which are essentially interpretters AND native compilers.
Conclusion
Hopefully you found these numbers interesting and or informative. In our next investigation we’ll focus strictly on the impact of implementation language on the length of the resulting program.
Did we miss something or make a mistake? Feel free to email us at [email protected] and we’ll do our best to update the article.
| ⭅ Previous (Best NDS Emulators for Android) |