Are sets and symbols the building blocks of mathematics?
A formal language is defined as a set of strings of symbols. I want to know that if "symbol" is a primitive notion in mathematics i.e we don't define what a symbol is. If it is the case that in mathematics every thing(object) is a set and the members of a set are themselves sets, shouldn't we define symbols by set? I'm confused by what comes first, set theory or formal languages.
The things you actually write on the paper or some other medium are not definable as any kind of mathematical objects. Mathematical structures can at most be used to model (or approximate) the real world structures. For example we might say that we can have strings of symbols of arbitrary length, but in the real world we would run out of paper or ink or atoms or whatever it is we use to store our physical representations of strings.
So let's see what we can build non-circularly in what order.
Natural language
Ultimately everything boils down to natural language. We simply cannot define everything before we use it. For example we cannot define "define"... What we hope to do, however, is to use as few and as intuitive concepts as possible (described in natural language) to bootstrap to formal systems that are more 'powerful'. So let's begin.
Natural numbers and strings
We simply assume the usual properties of natural numbers (arithmetic and ordering) and strings (symbol extraction, length and concatenation). If we don't even assume these, we can't do string manipulation and can't define any syntax whatsoever. It is convenient to assume that every natural number is a string (say using binary encoding).
Program specification
Choose any reasonable programming language. A program is a string that specifies a sequence of actions, each of which is either a basic string manipulation step or a conditional jump. In a basic string manipulation step we can refer to any strings by name. Initially all strings named in the program are empty, except for the string named $input$, which contains the input to the program. A conditional jump allows us to test if some basic condition is true (say that a number is nonzero) and jump to another step in the program iff it is so. We can easily implement a $k$-fold iteration of a sequence of actions by using a natural number counter that is set to $k$ before that sequence and is decreased by $1$ after the sequence, and jumping to the start of the sequence as long as $k$ is nonzero. The execution of a program on an input is simply following the program (with $input$ containing the input at the start) until we reach the end, at which point the program is said to halt, and whatever is stored in the string named $output$ will be taken as the output of the program. (It is possible that the program never reaches the end, in which case it does not halt. Note that at this point we don't (yet) want to affirm that every program execution either halts or does not halt. In special cases we might be able to observe that it will not halt, but if we cannot tell then we will just say "We don't know." for now.)
One special class of programs are those where conditional jumps are only used to perform iteration (in the manner described above). These programs always terminate on every input, and so they are in some sense the most primitive. Indeed they are called primitive recursive. They are also the most acceptable in the sense that you can 'see clearly' that they always halt, and hence it is very 'well-defined' to talk about the collection of strings that they accept (output is not the empty string), since they always halt and either accept or don't accept. We call such collections primitive recursive as well. (As a side note, there are programs that always halt but are not primitive recursive.)
Formal system specification
We can now use programs to represent a formal system. Specifically a useful formal system $T$ has a language $L$, which is a primitive recursive collection of strings, here called sentences over $T$, some of which are said to be provable over $T$. Often $T$ comes with a deductive system, which consists of rules that govern what sentences you can prove given sentences that you have already proven. We might express each rule in the form "$φ_1,φ_2,...,φ_k \vdash ψ$", which says that if you have already proven $φ_1,φ_2,...,φ_k$ then you can prove $ψ$. There may even be infinitely many rules, but the key feature of a useful $T$ is that there is one single primitive recursive program that can be used to check a single deductive step, namely a single application of any one of the rules. Specifically, for such a $T$ there is a primitive recursive program $P$ that accepts a string $x$ iff $x$ encodes a sequence of sentences $φ_1,φ_2,...,φ_k,ψ$ such that $φ_1,φ_2,...,φ_k \vdash ψ$.
Since all useful formal systems have such a program, we can verify anyone's claim that a sentence $φ$ is provable over $T$, as long as they provide the whole sequence of deductive steps, which is one possible form of proof.
Up to now we see that all we need to be philosophically committed to is the ability to perform (finitely many) string manipulations, and we can get to the point where we can verify proofs over any useful formal system. This includes the first-order systems PA and ZFC. In this sense we clearly can do whatever ZFC can do, but whether or not our string manipulations have any meaning whatsoever cannot be answered without stronger ontological commitment.
Godel's incompleteness theorems
At this point we can already 'obtain' Godel's incompleteness theorems, in both external and internal forms. In both we are given a useful formal system $T$ that can also prove whatever PA can prove (under suitable translation). Given any sentence $P$ over $T$, we can construct a sentence $Prov_T(P)$ over $T$ that is intended to say "$P$ is provable over $T$". Then we let $Con(T) = ¬Prov_T(\bot)$. To 'obtain' the external form (if $T$ proves $Con(T)$ then $T$ proves $\bot$), we can explicitly write down a program that given as input any proof of $Con(T)$ over $T$ produces as output a proof of $\bot$ over $T$. And to 'obtain' the internal form we can explicitly write down a proof over $T$ of the sentence "$Con(T) \rightarrow \neg Prov_T(Con(T))$". (See this for more precise statements of this kind of result.)
The catch is that the sentence "$Prov_T(P)$" is completely meaningless unless we have some notion of interpretation of a sentence over $T$, which we have completely avoided so far so that everything is purely syntactic. We will get to a basic form of meaning in the next section.
Basic model theory
Let's say we want to be able to affirm that any given program on a given input either halts or does not halt. We can do so if we accept LEM (the law of excluded middle). With this we can now express basic properties about $T$, for example whether it is consistent (does not prove both a sentence and its negation), and whether it is complete (always proves either a sentence or its negation). This gives meaning to Godel's incompleteness theorems. From the external form, if $T$ is really consistent then it cannot prove $Con(T)$ even though $Con(T)$ corresponds via the translation to an assertion about the natural numbers that is true iff $T$ is consistent.
But if we further want to be able to talk about the collection of strings accepted by a program (not just primitive recursive ones), we are essentially asking for a stronger set-comprehension axiom, in this case $Σ^0_1$-comprehension (not just $Δ^0_0$-comprehension). The area of Reverse Mathematics includes the study of distinction between such weak set-theoretic axioms, and the linked Wikipedia article mentions these concepts and others that I later talk about, but a much better reference is this short document by Henry Towsner. With $Σ^0_1$-comprehension we can talk about the collection of all sentences that are provable over $T$, whereas previously we could talk about any one such sentence but not the whole collection as a single object.
Now to prove the compactness theorem, even for (classical) propositional logic, we need even more, namely WKL (weak Konig's lemma). And since the compactness theorem is a trivial consequence of the completeness theorem (say for natural deduction), WKL is also required to prove the completeness theorem. The same goes for first-order logic.
Turing jumps
Now it does not really make sense from a philosophical viewpoint to only have $Σ^0_1$-comprehension. After all, that is in some sense equivalent to having an oracle for the halting problem (for ordinary programs), which is the first Turing jump. The halting problem is undecidable, meaning that there is no program that always halts on any input $(P,x)$ and accepts iff $P$ halts on $x$. By allowing $Σ^0_1$-comprehension we are in a sense getting access to such an oracle. But then if we consider augmented programs that are allowed to use the first Turing jump (it will get the answer in one step), the halting problem for these programs will again be undecidable by any one of themselves, but we can conceive of an oracle for that too, which is the second Turing jump. Since we allowed the first one there is no really good reason to ban the second. And so on.
The end result is that we might as well accept full arithmetical comprehension (we can construct any set of strings or natural numbers definable by an formula where all quantifiers are over natural numbers or strings). And from a meta-logical perspective, we ought to have the full second-order induction schema too, because we already assume that we have only been accepting assumptions that hold for the standard natural numbers, namely those which are expressible in the form "$1+1+\cdots+1$".
Note that everything up to this point can be considered predicative, in the sense that at no point do we construct any object whose definition depends on the truth value of some assertion involving itself (such as via some quantifier whose range includes the object to be constructed). Thus most constructively-inclined logicians are perfectly happy up to here.
Higher model theory
If you only accept countable predicative sets as ontologically justified, specifically predicative sets of strings (or equivalently subsets of the natural numbers), then the above is pretty much all you need. Note that from the beginning we have implicitly assumed a finite alphabet for all strings. This implies that we have only countably many strings, and hence we cannot have things like formal systems with uncountably many symbols. These occur frequently in higher model theory, so if we want to be able to construct anything uncountable we would need much more, such as ZFC.
One example of the use of the power of ZFC is in the construction of non-standard models via ultrapowers, which require the use of a weak kind of the axiom of choice. The nice thing about this construction is that it is elegant, and for instance the resulting non-standard model of the reals can be seen to capture quite nicely the idea of using sequences of reals modulo some equivalence relation as a model for the first-order theory of the real numbers, where having eventual consistent behaviour implies the corresponding property holds. The non-constructive ultrafilter is needed to specify whether the property holds in the case without eventually consistent behaviour.
I hope I have shown convincingly that although we need very little to get started with defining and using a formal system, including even ZFC, all the symbol-pushing is devoid of meaning unless we assume more, and the more meaning we want to express or prove, the stronger assumptions we need. ZFC (excluding the axiom of foundation) is historically the first sufficiently strong system that can do everything mathematicians had been doing, and so it is not surprising that it is also used as a meta-system to study logic. But you're not going to be able to ontologically justify ZFC, if that is what you're looking for.
Sets in set theories
Finally, your question might be based on a common misconception that in ZFC you have a notion of "set". Not really. ZFC is just another formal system and has no symbol representing "set". It is simply that the axioms of ZFC were made so that it seems reasonable to assume that they hold for some vague notion of "sets" in natural language. Inside ZFC every quantifier is over the entire domain, and so one cannot talk about sets as if there are other kinds of objects. If we use a meta-system that does not have sets, then a model of ZFC might not have any 'sets' at all, whatever "set" might mean!
In ZFC, one cannot talk about "the Russell set", since the comprehension axiom does not allow us to construct such a collection. In MK (Morse-Kelley) set theory, there is an internal notion of sets, and one can construct any class of sets definable by some formula, but one cannot construct anything that resembles a "class of classes" for the same reason as Russell's paradox.
In the non-mainstream set theory NFU, one has both sets and urelements (extensionality only applying to sets), and so one can make sense of talking about sets here. But NFU is not a very user-friendly system anyway.
And this is also where my answer shall stop.